git ptrace.dev


        
            
            
        
        0/* stb_image_resize2 - v2.12 - public domain image resizing
1
2   by Jeff Roberts (v2) and Jorge L Rodriguez
3   http://github.com/nothings/stb
4
5   Can be threaded with the extended API. SSE2, AVX, Neon and WASM SIMD support. Only
6   scaling and translation is supported, no rotations or shears.
7
8   COMPILING & LINKING
9      In one C/C++ file that #includes this file, do this:
10         #define STB_IMAGE_RESIZE_IMPLEMENTATION
11      before the #include. That will create the implementation in that file.
12
13   EASY API CALLS:
14     Easy API downsamples w/Mitchell filter, upsamples w/cubic interpolation, clamps to edge.
15
16     stbir_resize_uint8_srgb( input_pixels,  input_w,  input_h,  input_stride_in_bytes,
17                              output_pixels, output_w, output_h, output_stride_in_bytes,
18                              pixel_layout_enum )
19
20     stbir_resize_uint8_linear( input_pixels,  input_w,  input_h,  input_stride_in_bytes,
21                                output_pixels, output_w, output_h, output_stride_in_bytes,
22                                pixel_layout_enum )
23
24     stbir_resize_float_linear( input_pixels,  input_w,  input_h,  input_stride_in_bytes,
25                                output_pixels, output_w, output_h, output_stride_in_bytes,
26                                pixel_layout_enum )
27
28     If you pass NULL or zero for the output_pixels, we will allocate the output buffer
29     for you and return it from the function (free with free() or STBIR_FREE).
30     As a special case, XX_stride_in_bytes of 0 means packed continuously in memory.
31
32   API LEVELS
33      There are three levels of API - easy-to-use, medium-complexity and extended-complexity.
34
35      See the "header file" section of the source for API documentation.
36
37   ADDITIONAL DOCUMENTATION
38
39      MEMORY ALLOCATION
40         By default, we use malloc and free for memory allocation.  To override the
41         memory allocation, before the implementation #include, add a:
42
43            #define STBIR_MALLOC(size,user_data) ...
44            #define STBIR_FREE(ptr,user_data)   ...
45
46         Each resize makes exactly one call to malloc/free (unless you use the
47         extended API where you can do one allocation for many resizes). Under
48         address sanitizer, we do separate allocations to find overread/writes.
49
50      PERFORMANCE
51         This library was written with an emphasis on performance. When testing
52         stb_image_resize with RGBA, the fastest mode is STBIR_4CHANNEL with
53         STBIR_TYPE_UINT8 pixels and CLAMPed edges (which is what many other resize
54         libs do by default). Also, make sure SIMD is turned on of course (default
55         for 64-bit targets). Avoid WRAP edge mode if you want the fastest speed.
56
57         This library also comes with profiling built-in. If you define STBIR_PROFILE,
58         you can use the advanced API and get low-level profiling information by
59         calling stbir_resize_extended_profile_info() or stbir_resize_split_profile_info()
60         after a resize.
61
62      SIMD
63         Most of the routines have optimized SSE2, AVX, NEON and WASM versions.
64
65         On Microsoft compilers, we automatically turn on SIMD for 64-bit x64 and
66         ARM; for 32-bit x86 and ARM, you select SIMD mode by defining STBIR_SSE2 or
67         STBIR_NEON. For AVX and AVX2, we auto-select it by detecting the /arch:AVX
68         or /arch:AVX2 switches. You can also always manually turn SSE2, AVX or AVX2
69         support on by defining STBIR_SSE2, STBIR_AVX or STBIR_AVX2.
70
71         On Linux, SSE2 and Neon is on by default for 64-bit x64 or ARM64. For 32-bit,
72         we select x86 SIMD mode by whether you have -msse2, -mavx or -mavx2 enabled
73         on the command line. For 32-bit ARM, you must pass -mfpu=neon-vfpv4 for both
74         clang and GCC, but GCC also requires an additional -mfp16-format=ieee to
75         automatically enable NEON.
76
77         On x86 platforms, you can also define STBIR_FP16C to turn on FP16C instructions
78         for converting back and forth to half-floats. This is autoselected when we
79         are using AVX2. Clang and GCC also require the -mf16c switch. ARM always uses
80         the built-in half float hardware NEON instructions.
81
82         You can also tell us to use multiply-add instructions with STBIR_USE_FMA.
83         Because x86 doesn't always have fma, we turn it off by default to maintain
84         determinism across all platforms. If you don't care about non-FMA determinism
85         and are willing to restrict yourself to more recent x86 CPUs (around the AVX
86         timeframe), then fma will give you around a 15% speedup.
87
88         You can force off SIMD in all cases by defining STBIR_NO_SIMD. You can turn
89         off AVX or AVX2 specifically with STBIR_NO_AVX or STBIR_NO_AVX2. AVX is 10%
90         to 40% faster, and AVX2 is generally another 12%.
91
92      ALPHA CHANNEL
93         Most of the resizing functions provide the ability to control how the alpha
94         channel of an image is processed.
95
96         When alpha represents transparency, it is important that when combining
97         colors with filtering, the pixels should not be treated equally; they
98         should use a weighted average based on their alpha values. For example,
99         if a pixel is 1% opaque bright green and another pixel is 99% opaque
100         black and you average them, the average will be 50% opaque, but the
101         unweighted average and will be a middling green color, while the weighted
102         average will be nearly black. This means the unweighted version introduced
103         green energy that didn't exist in the source image.
104
105         (If you want to know why this makes sense, you can work out the math for
106         the following: consider what happens if you alpha composite a source image
107         over a fixed color and then average the output, vs. if you average the
108         source image pixels and then composite that over the same fixed color.
109         Only the weighted average produces the same result as the ground truth
110         composite-then-average result.)
111
112         Therefore, it is in general best to "alpha weight" the pixels when applying
113         filters to them. This essentially means multiplying the colors by the alpha
114         values before combining them, and then dividing by the alpha value at the
115         end.
116
117         The computer graphics industry introduced a technique called "premultiplied
118         alpha" or "associated alpha" in which image colors are stored in image files
119         already multiplied by their alpha. This saves some math when compositing,
120         and also avoids the need to divide by the alpha at the end (which is quite
121         inefficient). However, while premultiplied alpha is common in the movie CGI
122         industry, it is not commonplace in other industries like videogames, and most
123         consumer file formats are generally expected to contain not-premultiplied
124         colors. For example, Photoshop saves PNG files "unpremultiplied", and web
125         browsers like Chrome and Firefox expect PNG images to be unpremultiplied.
126
127         Note that there are three possibilities that might describe your image
128         and resize expectation:
129
130             1. images are not premultiplied, alpha weighting is desired
131             2. images are not premultiplied, alpha weighting is not desired
132             3. images are premultiplied
133
134         Both case #2 and case #3 require the exact same math: no alpha weighting
135         should be applied or removed. Only case 1 requires extra math operations;
136         the other two cases can be handled identically.
137
138         stb_image_resize expects case #1 by default, applying alpha weighting to
139         images, expecting the input images to be unpremultiplied. This is what the
140         COLOR+ALPHA buffer types tell the resizer to do.
141
142         When you use the pixel layouts STBIR_RGBA, STBIR_BGRA, STBIR_ARGB,
143         STBIR_ABGR, STBIR_RX, or STBIR_XR you are telling us that the pixels are
144         non-premultiplied. In these cases, the resizer will alpha weight the colors
145         (effectively creating the premultiplied image), do the filtering, and then
146         convert back to non-premult on exit.
147
148         When you use the pixel layouts STBIR_RGBA_PM, STBIR_RGBA_PM, STBIR_RGBA_PM,
149         STBIR_RGBA_PM, STBIR_RX_PM or STBIR_XR_PM, you are telling that the pixels
150         ARE premultiplied. In this case, the resizer doesn't have to do the
151         premultipling - it can filter directly on the input. This about twice as
152         fast as the non-premultiplied case, so it's the right option if your data is
153         already setup correctly.
154
155         When you use the pixel layout STBIR_4CHANNEL or STBIR_2CHANNEL, you are
156         telling us that there is no channel that represents transparency; it may be
157         RGB and some unrelated fourth channel that has been stored in the alpha
158         channel, but it is actually not alpha. No special processing will be
159         performed.
160
161         The difference between the generic 4 or 2 channel layouts, and the
162         specialized _PM versions is with the _PM versions you are telling us that
163         the data *is* alpha, just don't premultiply it. That's important when
164         using SRGB pixel formats, we need to know where the alpha is, because
165         it is converted linearly (rather than with the SRGB converters).
166
167         Because alpha weighting produces the same effect as premultiplying, you
168         even have the option with non-premultiplied inputs to let the resizer
169         produce a premultiplied output. Because the intially computed alpha-weighted
170         output image is effectively premultiplied, this is actually more performant
171         than the normal path which un-premultiplies the output image as a final step.
172
173         Finally, when converting both in and out of non-premulitplied space (for
174         example, when using STBIR_RGBA), we go to somewhat heroic measures to
175         ensure that areas with zero alpha value pixels get something reasonable
176         in the RGB values. If you don't care about the RGB values of zero alpha
177         pixels, you can call the stbir_set_non_pm_alpha_speed_over_quality()
178         function - this runs a premultiplied resize about 25% faster. That said,
179         when you really care about speed, using premultiplied pixels for both in
180         and out (STBIR_RGBA_PM, etc) much faster than both of these premultiplied
181         options.
182
183      PIXEL LAYOUT CONVERSION
184         The resizer can convert from some pixel layouts to others. When using the
185         stbir_set_pixel_layouts(), you can, for example, specify STBIR_RGBA
186         on input, and STBIR_ARGB on output, and it will re-organize the channels
187         during the resize. Currently, you can only convert between two pixel
188         layouts with the same number of channels.
189
190      DETERMINISM
191         We commit to being deterministic (from x64 to ARM to scalar to SIMD, etc).
192         This requires compiling with fast-math off (using at least /fp:precise).
193         Also, you must turn off fp-contracting (which turns mult+adds into fmas)!
194         We attempt to do this with pragmas, but with Clang, you usually want to add
195         -ffp-contract=off to the command line as well.
196
197         For 32-bit x86, you must use SSE and SSE2 codegen for determinism. That is,
198         if the scalar x87 unit gets used at all, we immediately lose determinism.
199         On Microsoft Visual Studio 2008 and earlier, from what we can tell there is
200         no way to be deterministic in 32-bit x86 (some x87 always leaks in, even
201         with fp:strict). On 32-bit x86 GCC, determinism requires both -msse2 and
202         -fpmath=sse.
203
204         Note that we will not be deterministic with float data containing NaNs -
205         the NaNs will propagate differently on different SIMD and platforms.
206
207         If you turn on STBIR_USE_FMA, then we will be deterministic with other
208         fma targets, but we will differ from non-fma targets (this is unavoidable,
209         because a fma isn't simply an add with a mult - it also introduces a
210         rounding difference compared to non-fma instruction sequences.
211
212      FLOAT PIXEL FORMAT RANGE
213         Any range of values can be used for the non-alpha float data that you pass
214         in (0 to 1, -1 to 1, whatever). However, if you are inputting float values
215         but *outputting* bytes or shorts, you must use a range of 0 to 1 so that we
216         scale back properly. The alpha channel must also be 0 to 1 for any format
217         that does premultiplication prior to resizing.
218
219         Note also that with float output, using filters with negative lobes, the
220         output filtered values might go slightly out of range. You can define
221         STBIR_FLOAT_LOW_CLAMP and/or STBIR_FLOAT_HIGH_CLAMP to specify the range
222         to clamp to on output, if that's important.
223
224      MAX/MIN SCALE FACTORS
225         The input pixel resolutions are in integers, and we do the internal pointer
226         resolution in size_t sized integers. However, the scale ratio from input
227         resolution to output resolution is calculated in float form. This means
228         the effective possible scale ratio is limited to 24 bits (or 16 million
229         to 1). As you get close to the size of the float resolution (again, 16
230         million pixels wide or high), you might start seeing float inaccuracy
231         issues in general in the pipeline. If you have to do extreme resizes,
232         you can usually do this is multiple stages (using float intermediate
233         buffers).
234
235      FLIPPED IMAGES
236         Stride is just the delta from one scanline to the next. This means you can
237         use a negative stride to handle inverted images (point to the final
238         scanline and use a negative stride). You can invert the input or output,
239         using negative strides.
240
241      DEFAULT FILTERS
242         For functions which don't provide explicit control over what filters to
243         use, you can change the compile-time defaults with:
244
245            #define STBIR_DEFAULT_FILTER_UPSAMPLE     STBIR_FILTER_something
246            #define STBIR_DEFAULT_FILTER_DOWNSAMPLE   STBIR_FILTER_something
247
248         See stbir_filter in the header-file section for the list of filters.
249
250      NEW FILTERS
251         A number of 1D filter kernels are supplied. For a list of supported
252         filters, see the stbir_filter enum. You can install your own filters by
253         using the stbir_set_filter_callbacks function.
254
255      PROGRESS
256         For interactive use with slow resize operations, you can use the the
257         scanline callbacks in the extended API. It would have to be a *very* large
258         image resample to need progress though - we're very fast.
259
260      CEIL and FLOOR
261         In scalar mode, the only functions we use from math.h are ceilf and floorf,
262         but if you have your own versions, you can define the STBIR_CEILF(v) and
263         STBIR_FLOORF(v) macros and we'll use them instead. In SIMD, we just use
264         our own versions.
265
266      ASSERT
267         Define STBIR_ASSERT(boolval) to override assert() and not use assert.h
268
269     PORTING FROM VERSION 1
270        The API has changed. You can continue to use the old version of stb_image_resize.h,
271        which is available in the "deprecated/" directory.
272
273        If you're using the old simple-to-use API, porting is straightforward.
274        (For more advanced APIs, read the documentation.)
275
276          stbir_resize_uint8():
277            - call `stbir_resize_uint8_linear`, cast channel count to `stbir_pixel_layout`
278
279          stbir_resize_float():
280            - call `stbir_resize_float_linear`, cast channel count to `stbir_pixel_layout`
281
282          stbir_resize_uint8_srgb():
283            - function name is unchanged
284            - cast channel count to `stbir_pixel_layout`
285            - above is sufficient unless your image has alpha and it's not RGBA/BGRA
286              - in that case, follow the below instructions for stbir_resize_uint8_srgb_edgemode
287
288          stbir_resize_uint8_srgb_edgemode()
289            - switch to the "medium complexity" API
290            - stbir_resize(), very similar API but a few more parameters:
291              - pixel_layout: cast channel count to `stbir_pixel_layout`
292              - data_type:    STBIR_TYPE_UINT8_SRGB
293              - edge:         unchanged (STBIR_EDGE_WRAP, etc.)
294              - filter:       STBIR_FILTER_DEFAULT
295            - which channel is alpha is specified in stbir_pixel_layout, see enum for details
296
297      FUTURE TODOS
298        *  For polyphase integral filters, we just memcpy the coeffs to dupe
299           them, but we should indirect and use the same coeff memory.
300        *  Add pixel layout conversions for sensible different channel counts
301           (maybe, 1->3/4, 3->4, 4->1, 3->1).
302         * For SIMD encode and decode scanline routines, do any pre-aligning
303           for bad input/output buffer alignments and pitch?
304         * For very wide scanlines, we should we do vertical strips to stay within
305           L2 cache. Maybe do chunks of 1K pixels at a time. There would be
306           some pixel reconversion, but probably dwarfed by things falling out
307           of cache. Probably also something possible with alternating between
308           scattering and gathering at high resize scales?
309         * Rewrite the coefficient generator to do many at once.
310         * AVX-512 vertical kernels - worried about downclocking here.
311         * Convert the reincludes to macros when we know they aren't changing.
312         * Experiment with pivoting the horizontal and always using the
313           vertical filters (which are faster, but perhaps not enough to overcome
314           the pivot cost and the extra memory touches). Need to buffer the whole
315           image so have to balance memory use.
316         * Most of our code is internally function pointers, should we compile
317           all the SIMD stuff always and dynamically dispatch?
318
319   CONTRIBUTORS
320      Jeff Roberts: 2.0 implementation, optimizations, SIMD
321      Martins Mozeiko: NEON simd, WASM simd, clang and GCC whisperer
322      Fabian Giesen: half float and srgb converters
323      Sean Barrett: API design, optimizations
324      Jorge L Rodriguez: Original 1.0 implementation
325      Aras Pranckevicius: bugfixes
326      Nathan Reed: warning fixes for 1.0
327
328   REVISIONS
329      2.12 (2024-10-18) fix incorrect use of user_data with STBIR_FREE
330      2.11 (2024-09-08) fix harmless asan warnings in 2-channel and 3-channel mode
331                          with AVX-2, fix some weird scaling edge conditions with
332                          point sample mode.
333      2.10 (2024-07-27) fix the defines GCC and mingw for loop unroll control,
334                          fix MSVC 32-bit arm half float routines.
335      2.09 (2024-06-19) fix the defines for 32-bit ARM GCC builds (was selecting
336                          hardware half floats).
337      2.08 (2024-06-10) fix for RGB->BGR three channel flips and add SIMD (thanks
338                          to Ryan Salsbury), fix for sub-rect resizes, use the
339                          pragmas to control unrolling when they are available.
340      2.07 (2024-05-24) fix for slow final split during threaded conversions of very 
341                          wide scanlines when downsampling (caused by extra input 
342                          converting), fix for wide scanline resamples with many 
343                          splits (int overflow), fix GCC warning.
344      2.06 (2024-02-10) fix for identical width/height 3x or more down-scaling 
345                          undersampling a single row on rare resize ratios (about 1%).
346      2.05 (2024-02-07) fix for 2 pixel to 1 pixel resizes with wrap (thanks Aras),
347                        fix for output callback (thanks Julien Koenen).
348      2.04 (2023-11-17) fix for rare AVX bug, shadowed symbol (thanks Nikola Smiljanic).
349      2.03 (2023-11-01) ASAN and TSAN warnings fixed, minor tweaks.
350      2.00 (2023-10-10) mostly new source: new api, optimizations, simd, vertical-first, etc
351                          2x-5x faster without simd, 4x-12x faster with simd,
352                          in some cases, 20x to 40x faster esp resizing large to very small.
353      0.96 (2019-03-04) fixed warnings
354      0.95 (2017-07-23) fixed warnings
355      0.94 (2017-03-18) fixed warnings
356      0.93 (2017-03-03) fixed bug with certain combinations of heights
357      0.92 (2017-01-02) fix integer overflow on large (>2GB) images
358      0.91 (2016-04-02) fix warnings; fix handling of subpixel regions
359      0.90 (2014-09-17) first released version
360
361   LICENSE
362     See end of file for license information.
363*/
364
365#if !defined(STB_IMAGE_RESIZE_DO_HORIZONTALS) && !defined(STB_IMAGE_RESIZE_DO_VERTICALS) && !defined(STB_IMAGE_RESIZE_DO_CODERS)   // for internal re-includes
366
367#ifndef STBIR_INCLUDE_STB_IMAGE_RESIZE2_H
368#define STBIR_INCLUDE_STB_IMAGE_RESIZE2_H
369
370#include <stddef.h>
371#ifdef _MSC_VER
372typedef unsigned char    stbir_uint8;
373typedef unsigned short   stbir_uint16;
374typedef unsigned int     stbir_uint32;
375typedef unsigned __int64 stbir_uint64;
376#else
377#include <stdint.h>
378typedef uint8_t  stbir_uint8;
379typedef uint16_t stbir_uint16;
380typedef uint32_t stbir_uint32;
381typedef uint64_t stbir_uint64;
382#endif
383
384#ifdef _M_IX86_FP
385#if ( _M_IX86_FP >= 1 )
386#ifndef STBIR_SSE
387#define STBIR_SSE
388#endif
389#endif
390#endif
391
392#if defined(_x86_64) || defined( __x86_64__ ) || defined( _M_X64 ) || defined(__x86_64) || defined(_M_AMD64) || defined(__SSE2__) || defined(STBIR_SSE) || defined(STBIR_SSE2)
393  #ifndef STBIR_SSE2
394    #define STBIR_SSE2
395  #endif
396  #if defined(__AVX__) || defined(STBIR_AVX2)
397    #ifndef STBIR_AVX
398      #ifndef STBIR_NO_AVX
399        #define STBIR_AVX
400      #endif
401    #endif
402  #endif
403  #if defined(__AVX2__) || defined(STBIR_AVX2)
404    #ifndef STBIR_NO_AVX2
405      #ifndef STBIR_AVX2
406        #define STBIR_AVX2
407      #endif
408      #if defined( _MSC_VER ) && !defined(__clang__)
409        #ifndef STBIR_FP16C  // FP16C instructions are on all AVX2 cpus, so we can autoselect it here on microsoft - clang needs -m16c
410          #define STBIR_FP16C
411        #endif
412      #endif
413    #endif
414  #endif
415  #ifdef __F16C__
416    #ifndef STBIR_FP16C  // turn on FP16C instructions if the define is set (for clang and gcc)
417      #define STBIR_FP16C
418    #endif
419  #endif
420#endif
421
422#if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || ((__ARM_NEON_FP & 4) != 0) || defined(__ARM_NEON__)
423#ifndef STBIR_NEON
424#define STBIR_NEON
425#endif
426#endif
427
428#if defined(_M_ARM) || defined(__arm__)
429#ifdef STBIR_USE_FMA
430#undef STBIR_USE_FMA // no FMA for 32-bit arm on MSVC
431#endif
432#endif
433
434#if defined(__wasm__) && defined(__wasm_simd128__)
435#ifndef STBIR_WASM
436#define STBIR_WASM
437#endif
438#endif
439
440#ifndef STBIRDEF
441#ifdef STB_IMAGE_RESIZE_STATIC
442#define STBIRDEF static
443#else
444#ifdef __cplusplus
445#define STBIRDEF extern "C"
446#else
447#define STBIRDEF extern
448#endif
449#endif
450#endif
451
452//////////////////////////////////////////////////////////////////////////////
453////   start "header file" ///////////////////////////////////////////////////
454//
455// Easy-to-use API:
456//
457//     * stride is the offset between successive rows of image data
458//        in memory, in bytes. specify 0 for packed continuously in memory
459//     * colorspace is linear or sRGB as specified by function name
460//     * Uses the default filters
461//     * Uses edge mode clamped
462//     * returned result is 1 for success or 0 in case of an error.
463
464
465// stbir_pixel_layout specifies:
466//   number of channels
467//   order of channels
468//   whether color is premultiplied by alpha
469// for back compatibility, you can cast the old channel count to an stbir_pixel_layout
470typedef enum
471{
472  STBIR_1CHANNEL = 1,
473  STBIR_2CHANNEL = 2,
474  STBIR_RGB      = 3,               // 3-chan, with order specified (for channel flipping)
475  STBIR_BGR      = 0,               // 3-chan, with order specified (for channel flipping)
476  STBIR_4CHANNEL = 5,
477
478  STBIR_RGBA = 4,                   // alpha formats, where alpha is NOT premultiplied into color channels
479  STBIR_BGRA = 6,
480  STBIR_ARGB = 7,
481  STBIR_ABGR = 8,
482  STBIR_RA   = 9,
483  STBIR_AR   = 10,
484
485  STBIR_RGBA_PM = 11,               // alpha formats, where alpha is premultiplied into color channels
486  STBIR_BGRA_PM = 12,
487  STBIR_ARGB_PM = 13,
488  STBIR_ABGR_PM = 14,
489  STBIR_RA_PM   = 15,
490  STBIR_AR_PM   = 16,
491
492  STBIR_RGBA_NO_AW = 11,            // alpha formats, where NO alpha weighting is applied at all!
493  STBIR_BGRA_NO_AW = 12,            //   these are just synonyms for the _PM flags (which also do
494  STBIR_ARGB_NO_AW = 13,            //   no alpha weighting). These names just make it more clear
495  STBIR_ABGR_NO_AW = 14,            //   for some folks).
496  STBIR_RA_NO_AW   = 15,
497  STBIR_AR_NO_AW   = 16,
498
499} stbir_pixel_layout;
500
501//===============================================================
502//  Simple-complexity API
503//
504//    If output_pixels is NULL (0), then we will allocate the buffer and return it to you.
505//--------------------------------
506
507STBIRDEF unsigned char * stbir_resize_uint8_srgb( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
508                                                        unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
509                                                        stbir_pixel_layout pixel_type );
510
511STBIRDEF unsigned char * stbir_resize_uint8_linear( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
512                                                          unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
513                                                          stbir_pixel_layout pixel_type );
514
515STBIRDEF float * stbir_resize_float_linear( const float *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
516                                                  float *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
517                                                  stbir_pixel_layout pixel_type );
518//===============================================================
519
520//===============================================================
521// Medium-complexity API
522//
523// This extends the easy-to-use API as follows:
524//
525//     * Can specify the datatype - U8, U8_SRGB, U16, FLOAT, HALF_FLOAT
526//     * Edge wrap can selected explicitly
527//     * Filter can be selected explicitly
528//--------------------------------
529
530typedef enum
531{
532  STBIR_EDGE_CLAMP   = 0,
533  STBIR_EDGE_REFLECT = 1,
534  STBIR_EDGE_WRAP    = 2,  // this edge mode is slower and uses more memory
535  STBIR_EDGE_ZERO    = 3,
536} stbir_edge;
537
538typedef enum
539{
540  STBIR_FILTER_DEFAULT      = 0,  // use same filter type that easy-to-use API chooses
541  STBIR_FILTER_BOX          = 1,  // A trapezoid w/1-pixel wide ramps, same result as box for integer scale ratios
542  STBIR_FILTER_TRIANGLE     = 2,  // On upsampling, produces same results as bilinear texture filtering
543  STBIR_FILTER_CUBICBSPLINE = 3,  // The cubic b-spline (aka Mitchell-Netrevalli with B=1,C=0), gaussian-esque
544  STBIR_FILTER_CATMULLROM   = 4,  // An interpolating cubic spline
545  STBIR_FILTER_MITCHELL     = 5,  // Mitchell-Netrevalli filter with B=1/3, C=1/3
546  STBIR_FILTER_POINT_SAMPLE = 6,  // Simple point sampling
547  STBIR_FILTER_OTHER        = 7,  // User callback specified
548} stbir_filter;
549
550typedef enum
551{
552  STBIR_TYPE_UINT8            = 0,
553  STBIR_TYPE_UINT8_SRGB       = 1,
554  STBIR_TYPE_UINT8_SRGB_ALPHA = 2,  // alpha channel, when present, should also be SRGB (this is very unusual)
555  STBIR_TYPE_UINT16           = 3,
556  STBIR_TYPE_FLOAT            = 4,
557  STBIR_TYPE_HALF_FLOAT       = 5
558} stbir_datatype;
559
560// medium api
561STBIRDEF void *  stbir_resize( const void *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
562                                     void *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
563                               stbir_pixel_layout pixel_layout, stbir_datatype data_type,
564                               stbir_edge edge, stbir_filter filter );
565//===============================================================
566
567
568
569//===============================================================
570// Extended-complexity API
571//
572// This API exposes all resize functionality.
573//
574//     * Separate filter types for each axis
575//     * Separate edge modes for each axis
576//     * Separate input and output data types
577//     * Can specify regions with subpixel correctness
578//     * Can specify alpha flags
579//     * Can specify a memory callback
580//     * Can specify a callback data type for pixel input and output
581//     * Can be threaded for a single resize
582//     * Can be used to resize many frames without recalculating the sampler info
583//
584//  Use this API as follows:
585//     1) Call the stbir_resize_init function on a local STBIR_RESIZE structure
586//     2) Call any of the stbir_set functions
587//     3) Optionally call stbir_build_samplers() if you are going to resample multiple times
588//        with the same input and output dimensions (like resizing video frames)
589//     4) Resample by calling stbir_resize_extended().
590//     5) Call stbir_free_samplers() if you called stbir_build_samplers()
591//--------------------------------
592
593
594// Types:
595
596// INPUT CALLBACK: this callback is used for input scanlines
597typedef void const * stbir_input_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context );
598
599// OUTPUT CALLBACK: this callback is used for output scanlines
600typedef void stbir_output_callback( void const * output_ptr, int num_pixels, int y, void * context );
601
602// callbacks for user installed filters
603typedef float stbir__kernel_callback( float x, float scale, void * user_data ); // centered at zero
604typedef float stbir__support_callback( float scale, void * user_data );
605
606// internal structure with precomputed scaling
607typedef struct stbir__info stbir__info;
608
609typedef struct STBIR_RESIZE  // use the stbir_resize_init and stbir_override functions to set these values for future compatibility
610{
611  void * user_data;
612  void const * input_pixels;
613  int input_w, input_h;
614  double input_s0, input_t0, input_s1, input_t1;
615  stbir_input_callback * input_cb;
616  void * output_pixels;
617  int output_w, output_h;
618  int output_subx, output_suby, output_subw, output_subh;
619  stbir_output_callback * output_cb;
620  int input_stride_in_bytes;
621  int output_stride_in_bytes;
622  int splits;
623  int fast_alpha;
624  int needs_rebuild;
625  int called_alloc;
626  stbir_pixel_layout input_pixel_layout_public;
627  stbir_pixel_layout output_pixel_layout_public;
628  stbir_datatype input_data_type;
629  stbir_datatype output_data_type;
630  stbir_filter horizontal_filter, vertical_filter;
631  stbir_edge horizontal_edge, vertical_edge;
632  stbir__kernel_callback * horizontal_filter_kernel; stbir__support_callback * horizontal_filter_support;
633  stbir__kernel_callback * vertical_filter_kernel; stbir__support_callback * vertical_filter_support;
634  stbir__info * samplers;
635} STBIR_RESIZE;
636
637// extended complexity api
638
639
640// First off, you must ALWAYS call stbir_resize_init on your resize structure before any of the other calls!
641STBIRDEF void stbir_resize_init( STBIR_RESIZE * resize,
642                                 const void *input_pixels,  int input_w,  int input_h, int input_stride_in_bytes, // stride can be zero
643                                       void *output_pixels, int output_w, int output_h, int output_stride_in_bytes, // stride can be zero
644                                 stbir_pixel_layout pixel_layout, stbir_datatype data_type );
645
646//===============================================================
647// You can update these parameters any time after resize_init and there is no cost
648//--------------------------------
649
650STBIRDEF void stbir_set_datatypes( STBIR_RESIZE * resize, stbir_datatype input_type, stbir_datatype output_type );
651STBIRDEF void stbir_set_pixel_callbacks( STBIR_RESIZE * resize, stbir_input_callback * input_cb, stbir_output_callback * output_cb );   // no callbacks by default
652STBIRDEF void stbir_set_user_data( STBIR_RESIZE * resize, void * user_data );                                               // pass back STBIR_RESIZE* by default
653STBIRDEF void stbir_set_buffer_ptrs( STBIR_RESIZE * resize, const void * input_pixels, int input_stride_in_bytes, void * output_pixels, int output_stride_in_bytes );
654
655//===============================================================
656
657
658//===============================================================
659// If you call any of these functions, you will trigger a sampler rebuild!
660//--------------------------------
661
662STBIRDEF int stbir_set_pixel_layouts( STBIR_RESIZE * resize, stbir_pixel_layout input_pixel_layout, stbir_pixel_layout output_pixel_layout );  // sets new buffer layouts
663STBIRDEF int stbir_set_edgemodes( STBIR_RESIZE * resize, stbir_edge horizontal_edge, stbir_edge vertical_edge );       // CLAMP by default
664
665STBIRDEF int stbir_set_filters( STBIR_RESIZE * resize, stbir_filter horizontal_filter, stbir_filter vertical_filter ); // STBIR_DEFAULT_FILTER_UPSAMPLE/DOWNSAMPLE by default
666STBIRDEF int stbir_set_filter_callbacks( STBIR_RESIZE * resize, stbir__kernel_callback * horizontal_filter, stbir__support_callback * horizontal_support, stbir__kernel_callback * vertical_filter, stbir__support_callback * vertical_support );
667
668STBIRDEF int stbir_set_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh );        // sets both sub-regions (full regions by default)
669STBIRDEF int stbir_set_input_subrect( STBIR_RESIZE * resize, double s0, double t0, double s1, double t1 );    // sets input sub-region (full region by default)
670STBIRDEF int stbir_set_output_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh ); // sets output sub-region (full region by default)
671
672// when inputting AND outputting non-premultiplied alpha pixels, we use a slower but higher quality technique
673//   that fills the zero alpha pixel's RGB values with something plausible.  If you don't care about areas of
674//   zero alpha, you can call this function to get about a 25% speed improvement for STBIR_RGBA to STBIR_RGBA
675//   types of resizes.
676STBIRDEF int stbir_set_non_pm_alpha_speed_over_quality( STBIR_RESIZE * resize, int non_pma_alpha_speed_over_quality );
677//===============================================================
678
679
680//===============================================================
681// You can call build_samplers to prebuild all the internal data we need to resample.
682//   Then, if you call resize_extended many times with the same resize, you only pay the
683//   cost once.
684// If you do call build_samplers, you MUST call free_samplers eventually.
685//--------------------------------
686
687// This builds the samplers and does one allocation
688STBIRDEF int stbir_build_samplers( STBIR_RESIZE * resize );
689
690// You MUST call this, if you call stbir_build_samplers or stbir_build_samplers_with_splits
691STBIRDEF void stbir_free_samplers( STBIR_RESIZE * resize );
692//===============================================================
693
694
695// And this is the main function to perform the resize synchronously on one thread.
696STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize );
697
698
699//===============================================================
700// Use these functions for multithreading.
701//   1) You call stbir_build_samplers_with_splits first on the main thread
702//   2) Then stbir_resize_with_split on each thread
703//   3) stbir_free_samplers when done on the main thread
704//--------------------------------
705
706// This will build samplers for threading.
707//   You can pass in the number of threads you'd like to use (try_splits).
708//   It returns the number of splits (threads) that you can call it with.
709///  It might be less if the image resize can't be split up that many ways.
710
711STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int try_splits );
712
713// This function does a split of the resizing (you call this fuction for each
714// split, on multiple threads). A split is a piece of the output resize pixel space.
715
716// Note that you MUST call stbir_build_samplers_with_splits before stbir_resize_extended_split!
717
718// Usually, you will always call stbir_resize_split with split_start as the thread_index
719//   and "1" for the split_count.
720// But, if you have a weird situation where you MIGHT want 8 threads, but sometimes
721//   only 4 threads, you can use 0,2,4,6 for the split_start's and use "2" for the
722//   split_count each time to turn in into a 4 thread resize. (This is unusual).
723
724STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start, int split_count );
725//===============================================================
726
727
728//===============================================================
729// Pixel Callbacks info:
730//--------------------------------
731
732//   The input callback is super flexible - it calls you with the input address
733//   (based on the stride and base pointer), it gives you an optional_output
734//   pointer that you can fill, or you can just return your own pointer into
735//   your own data.
736//
737//   You can also do conversion from non-supported data types if necessary - in
738//   this case, you ignore the input_ptr and just use the x and y parameters to
739//   calculate your own input_ptr based on the size of each non-supported pixel.
740//   (Something like the third example below.)
741//
742//   You can also install just an input or just an output callback by setting the
743//   callback that you don't want to zero.
744//
745//     First example, progress: (getting a callback that you can monitor the progress):
746//        void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context )
747//        {
748//           percentage_done = y / input_height;
749//           return input_ptr;  // use buffer from call
750//        }
751//
752//     Next example, copying: (copy from some other buffer or stream):
753//        void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context )
754//        {
755//           CopyOrStreamData( optional_output, other_data_src, num_pixels * pixel_width_in_bytes );
756//           return optional_output;  // return the optional buffer that we filled
757//        }
758//
759//     Third example, input another buffer without copying: (zero-copy from other buffer):
760//        void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context )
761//        {
762//           void * pixels = ( (char*) other_image_base ) + ( y * other_image_stride ) + ( x * other_pixel_width_in_bytes );
763//           return pixels;       // return pointer to your data without copying
764//        }
765//
766//
767//   The output callback is considerably simpler - it just calls you so that you can dump
768//   out each scanline. You could even directly copy out to disk if you have a simple format
769//   like TGA or BMP. You can also convert to other output types here if you want.
770//
771//   Simple example:
772//        void const * my_output( void * output_ptr, int num_pixels, int y, void * context )
773//        {
774//           percentage_done = y / output_height;
775//           fwrite( output_ptr, pixel_width_in_bytes, num_pixels, output_file );
776//        }
777//===============================================================
778
779
780
781
782//===============================================================
783// optional built-in profiling API
784//--------------------------------
785
786#ifdef STBIR_PROFILE
787
788typedef struct STBIR_PROFILE_INFO
789{
790  stbir_uint64 total_clocks;
791
792  // how many clocks spent (of total_clocks) in the various resize routines, along with a string description
793  //    there are "resize_count" number of zones
794  stbir_uint64 clocks[ 8 ];
795  char const ** descriptions;
796
797  // count of clocks and descriptions
798  stbir_uint32 count;
799} STBIR_PROFILE_INFO;
800
801// use after calling stbir_resize_extended (or stbir_build_samplers or stbir_build_samplers_with_splits)
802STBIRDEF void stbir_resize_build_profile_info( STBIR_PROFILE_INFO * out_info, STBIR_RESIZE const * resize );
803
804// use after calling stbir_resize_extended
805STBIRDEF void stbir_resize_extended_profile_info( STBIR_PROFILE_INFO * out_info, STBIR_RESIZE const * resize );
806
807// use after calling stbir_resize_extended_split
808STBIRDEF void stbir_resize_split_profile_info( STBIR_PROFILE_INFO * out_info, STBIR_RESIZE const * resize, int split_start, int split_num );
809
810//===============================================================
811
812#endif
813
814
815////   end header file   /////////////////////////////////////////////////////
816#endif // STBIR_INCLUDE_STB_IMAGE_RESIZE2_H
817
818#if defined(STB_IMAGE_RESIZE_IMPLEMENTATION) || defined(STB_IMAGE_RESIZE2_IMPLEMENTATION)
819
820#ifndef STBIR_ASSERT
821#include <assert.h>
822#define STBIR_ASSERT(x) assert(x)
823#endif
824
825#ifndef STBIR_MALLOC
826#include <stdlib.h>
827#define STBIR_MALLOC(size,user_data) ((void)(user_data), malloc(size))
828#define STBIR_FREE(ptr,user_data)    ((void)(user_data), free(ptr))
829// (we used the comma operator to evaluate user_data, to avoid "unused parameter" warnings)
830#endif
831
832#ifdef _MSC_VER
833
834#define stbir__inline __forceinline
835
836#else
837
838#define stbir__inline __inline__
839
840// Clang address sanitizer
841#if defined(__has_feature)
842  #if __has_feature(address_sanitizer) || __has_feature(memory_sanitizer)
843    #ifndef STBIR__SEPARATE_ALLOCATIONS
844      #define STBIR__SEPARATE_ALLOCATIONS
845    #endif
846  #endif
847#endif
848
849#endif
850
851// GCC and MSVC
852#if defined(__SANITIZE_ADDRESS__)
853  #ifndef STBIR__SEPARATE_ALLOCATIONS
854    #define STBIR__SEPARATE_ALLOCATIONS
855  #endif
856#endif
857
858// Always turn off automatic FMA use - use STBIR_USE_FMA if you want.
859// Otherwise, this is a determinism disaster.
860#ifndef STBIR_DONT_CHANGE_FP_CONTRACT  // override in case you don't want this behavior
861#if defined(_MSC_VER) && !defined(__clang__)
862#if _MSC_VER > 1200
863#pragma fp_contract(off)
864#endif
865#elif defined(__GNUC__) &&  !defined(__clang__)
866#pragma GCC optimize("fp-contract=off")
867#else
868#pragma STDC FP_CONTRACT OFF
869#endif
870#endif
871
872#ifdef _MSC_VER
873#define STBIR__UNUSED(v)  (void)(v)
874#else
875#define STBIR__UNUSED(v)  (void)sizeof(v)
876#endif
877
878#define STBIR__ARRAY_SIZE(a) (sizeof((a))/sizeof((a)[0]))
879
880
881#ifndef STBIR_DEFAULT_FILTER_UPSAMPLE
882#define STBIR_DEFAULT_FILTER_UPSAMPLE    STBIR_FILTER_CATMULLROM
883#endif
884
885#ifndef STBIR_DEFAULT_FILTER_DOWNSAMPLE
886#define STBIR_DEFAULT_FILTER_DOWNSAMPLE  STBIR_FILTER_MITCHELL
887#endif
888
889
890#ifndef STBIR__HEADER_FILENAME
891#define STBIR__HEADER_FILENAME "stb_image_resize2.h"
892#endif
893
894// the internal pixel layout enums are in a different order, so we can easily do range comparisons of types
895//   the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible
896typedef enum
897{
898  STBIRI_1CHANNEL = 0,
899  STBIRI_2CHANNEL = 1,
900  STBIRI_RGB      = 2,
901  STBIRI_BGR      = 3,
902  STBIRI_4CHANNEL = 4,
903
904  STBIRI_RGBA = 5,
905  STBIRI_BGRA = 6,
906  STBIRI_ARGB = 7,
907  STBIRI_ABGR = 8,
908  STBIRI_RA   = 9,
909  STBIRI_AR   = 10,
910
911  STBIRI_RGBA_PM = 11,
912  STBIRI_BGRA_PM = 12,
913  STBIRI_ARGB_PM = 13,
914  STBIRI_ABGR_PM = 14,
915  STBIRI_RA_PM   = 15,
916  STBIRI_AR_PM   = 16,
917} stbir_internal_pixel_layout;
918
919// define the public pixel layouts to not compile inside the implementation (to avoid accidental use)
920#define STBIR_BGR bad_dont_use_in_implementation
921#define STBIR_1CHANNEL STBIR_BGR
922#define STBIR_2CHANNEL STBIR_BGR
923#define STBIR_RGB STBIR_BGR
924#define STBIR_RGBA STBIR_BGR
925#define STBIR_4CHANNEL STBIR_BGR
926#define STBIR_BGRA STBIR_BGR
927#define STBIR_ARGB STBIR_BGR
928#define STBIR_ABGR STBIR_BGR
929#define STBIR_RA STBIR_BGR
930#define STBIR_AR STBIR_BGR
931#define STBIR_RGBA_PM STBIR_BGR
932#define STBIR_BGRA_PM STBIR_BGR
933#define STBIR_ARGB_PM STBIR_BGR
934#define STBIR_ABGR_PM STBIR_BGR
935#define STBIR_RA_PM STBIR_BGR
936#define STBIR_AR_PM STBIR_BGR
937
938// must match stbir_datatype
939static unsigned char stbir__type_size[] = {
940  1,1,1,2,4,2 // STBIR_TYPE_UINT8,STBIR_TYPE_UINT8_SRGB,STBIR_TYPE_UINT8_SRGB_ALPHA,STBIR_TYPE_UINT16,STBIR_TYPE_FLOAT,STBIR_TYPE_HALF_FLOAT
941};
942
943// When gathering, the contributors are which source pixels contribute.
944// When scattering, the contributors are which destination pixels are contributed to.
945typedef struct
946{
947  int n0; // First contributing pixel
948  int n1; // Last contributing pixel
949} stbir__contributors;
950
951typedef struct
952{
953  int lowest;    // First sample index for whole filter
954  int highest;   // Last sample index for whole filter
955  int widest;    // widest single set of samples for an output
956} stbir__filter_extent_info;
957
958typedef struct
959{
960  int n0; // First pixel of decode buffer to write to
961  int n1; // Last pixel of decode that will be written to
962  int pixel_offset_for_input;  // Pixel offset into input_scanline
963} stbir__span;
964
965typedef struct stbir__scale_info
966{
967  int input_full_size;
968  int output_sub_size;
969  float scale;
970  float inv_scale;
971  float pixel_shift; // starting shift in output pixel space (in pixels)
972  int scale_is_rational;
973  stbir_uint32 scale_numerator, scale_denominator;
974} stbir__scale_info;
975
976typedef struct
977{
978  stbir__contributors * contributors;
979  float* coefficients;
980  stbir__contributors * gather_prescatter_contributors;
981  float * gather_prescatter_coefficients;
982  stbir__scale_info scale_info;
983  float support;
984  stbir_filter filter_enum;
985  stbir__kernel_callback * filter_kernel;
986  stbir__support_callback * filter_support;
987  stbir_edge edge;
988  int coefficient_width;
989  int filter_pixel_width;
990  int filter_pixel_margin;
991  int num_contributors;
992  int contributors_size;
993  int coefficients_size;
994  stbir__filter_extent_info extent_info;
995  int is_gather;  // 0 = scatter, 1 = gather with scale >= 1, 2 = gather with scale < 1
996  int gather_prescatter_num_contributors;
997  int gather_prescatter_coefficient_width;
998  int gather_prescatter_contributors_size;
999  int gather_prescatter_coefficients_size;
1000} stbir__sampler;
1001
1002typedef struct
1003{
1004  stbir__contributors conservative;
1005  int edge_sizes[2];    // this can be less than filter_pixel_margin, if the filter and scaling falls off
1006  stbir__span spans[2]; // can be two spans, if doing input subrect with clamp mode WRAP
1007} stbir__extents;
1008
1009typedef struct
1010{
1011#ifdef STBIR_PROFILE
1012  union
1013  {
1014    struct { stbir_uint64 total, looping, vertical, horizontal, decode, encode, alpha, unalpha; } named;
1015    stbir_uint64 array[8];
1016  } profile;
1017  stbir_uint64 * current_zone_excluded_ptr;
1018#endif
1019  float* decode_buffer;
1020
1021  int ring_buffer_first_scanline;
1022  int ring_buffer_last_scanline;
1023  int ring_buffer_begin_index;    // first_scanline is at this index in the ring buffer
1024  int start_output_y, end_output_y;
1025  int start_input_y, end_input_y;  // used in scatter only
1026
1027  #ifdef STBIR__SEPARATE_ALLOCATIONS
1028    float** ring_buffers; // one pointer for each ring buffer
1029  #else
1030    float* ring_buffer;  // one big buffer that we index into
1031  #endif
1032
1033  float* vertical_buffer;
1034
1035  char no_cache_straddle[64];
1036} stbir__per_split_info;
1037
1038typedef void stbir__decode_pixels_func( float * decode, int width_times_channels, void const * input );
1039typedef void stbir__alpha_weight_func( float * decode_buffer, int width_times_channels );
1040typedef void stbir__horizontal_gather_channels_func( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer,
1041  stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width );
1042typedef void stbir__alpha_unweight_func(float * encode_buffer, int width_times_channels );
1043typedef void stbir__encode_pixels_func( void * output, int width_times_channels, float const * encode );
1044
1045struct stbir__info
1046{
1047#ifdef STBIR_PROFILE
1048  union
1049  {
1050    struct { stbir_uint64 total, build, alloc, horizontal, vertical, cleanup, pivot; } named;
1051    stbir_uint64 array[7];
1052  } profile;
1053  stbir_uint64 * current_zone_excluded_ptr;
1054#endif
1055  stbir__sampler horizontal;
1056  stbir__sampler vertical;
1057
1058  void const * input_data;
1059  void * output_data;
1060
1061  int input_stride_bytes;
1062  int output_stride_bytes;
1063  int ring_buffer_length_bytes;   // The length of an individual entry in the ring buffer. The total number of ring buffers is stbir__get_filter_pixel_width(filter)
1064  int ring_buffer_num_entries;    // Total number of entries in the ring buffer.
1065
1066  stbir_datatype input_type;
1067  stbir_datatype output_type;
1068
1069  stbir_input_callback * in_pixels_cb;
1070  void * user_data;
1071  stbir_output_callback * out_pixels_cb;
1072
1073  stbir__extents scanline_extents;
1074
1075  void * alloced_mem;
1076  stbir__per_split_info * split_info;  // by default 1, but there will be N of these allocated based on the thread init you did
1077
1078  stbir__decode_pixels_func * decode_pixels;
1079  stbir__alpha_weight_func * alpha_weight;
1080  stbir__horizontal_gather_channels_func * horizontal_gather_channels;
1081  stbir__alpha_unweight_func * alpha_unweight;
1082  stbir__encode_pixels_func * encode_pixels;
1083
1084  int alloc_ring_buffer_num_entries;    // Number of entries in the ring buffer that will be allocated
1085  int splits; // count of splits
1086
1087  stbir_internal_pixel_layout input_pixel_layout_internal;
1088  stbir_internal_pixel_layout output_pixel_layout_internal;
1089
1090  int input_color_and_type;
1091  int offset_x, offset_y; // offset within output_data
1092  int vertical_first;
1093  int channels;
1094  int effective_channels; // same as channels, except on RGBA/ARGB (7), or XA/AX (3)
1095  size_t alloced_total;
1096};
1097
1098
1099#define stbir__max_uint8_as_float             255.0f
1100#define stbir__max_uint16_as_float            65535.0f
1101#define stbir__max_uint8_as_float_inverted    (1.0f/255.0f)
1102#define stbir__max_uint16_as_float_inverted   (1.0f/65535.0f)
1103#define stbir__small_float ((float)1 / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20))
1104
1105// min/max friendly
1106#define STBIR_CLAMP(x, xmin, xmax) for(;;) { \
1107  if ( (x) < (xmin) ) (x) = (xmin);     \
1108  if ( (x) > (xmax) ) (x) = (xmax);     \
1109  break;                                \
1110}
1111
1112static stbir__inline int stbir__min(int a, int b)
1113{
1114  return a < b ? a : b;
1115}
1116
1117static stbir__inline int stbir__max(int a, int b)
1118{
1119  return a > b ? a : b;
1120}
1121
1122static float stbir__srgb_uchar_to_linear_float[256] = {
1123  0.000000f, 0.000304f, 0.000607f, 0.000911f, 0.001214f, 0.001518f, 0.001821f, 0.002125f, 0.002428f, 0.002732f, 0.003035f,
1124  0.003347f, 0.003677f, 0.004025f, 0.004391f, 0.004777f, 0.005182f, 0.005605f, 0.006049f, 0.006512f, 0.006995f, 0.007499f,
1125  0.008023f, 0.008568f, 0.009134f, 0.009721f, 0.010330f, 0.010960f, 0.011612f, 0.012286f, 0.012983f, 0.013702f, 0.014444f,
1126  0.015209f, 0.015996f, 0.016807f, 0.017642f, 0.018500f, 0.019382f, 0.020289f, 0.021219f, 0.022174f, 0.023153f, 0.024158f,
1127  0.025187f, 0.026241f, 0.027321f, 0.028426f, 0.029557f, 0.030713f, 0.031896f, 0.033105f, 0.034340f, 0.035601f, 0.036889f,
1128  0.038204f, 0.039546f, 0.040915f, 0.042311f, 0.043735f, 0.045186f, 0.046665f, 0.048172f, 0.049707f, 0.051269f, 0.052861f,
1129  0.054480f, 0.056128f, 0.057805f, 0.059511f, 0.061246f, 0.063010f, 0.064803f, 0.066626f, 0.068478f, 0.070360f, 0.072272f,
1130  0.074214f, 0.076185f, 0.078187f, 0.080220f, 0.082283f, 0.084376f, 0.086500f, 0.088656f, 0.090842f, 0.093059f, 0.095307f,
1131  0.097587f, 0.099899f, 0.102242f, 0.104616f, 0.107023f, 0.109462f, 0.111932f, 0.114435f, 0.116971f, 0.119538f, 0.122139f,
1132  0.124772f, 0.127438f, 0.130136f, 0.132868f, 0.135633f, 0.138432f, 0.141263f, 0.144128f, 0.147027f, 0.149960f, 0.152926f,
1133  0.155926f, 0.158961f, 0.162029f, 0.165132f, 0.168269f, 0.171441f, 0.174647f, 0.177888f, 0.181164f, 0.184475f, 0.187821f,
1134  0.191202f, 0.194618f, 0.198069f, 0.201556f, 0.205079f, 0.208637f, 0.212231f, 0.215861f, 0.219526f, 0.223228f, 0.226966f,
1135  0.230740f, 0.234551f, 0.238398f, 0.242281f, 0.246201f, 0.250158f, 0.254152f, 0.258183f, 0.262251f, 0.266356f, 0.270498f,
1136  0.274677f, 0.278894f, 0.283149f, 0.287441f, 0.291771f, 0.296138f, 0.300544f, 0.304987f, 0.309469f, 0.313989f, 0.318547f,
1137  0.323143f, 0.327778f, 0.332452f, 0.337164f, 0.341914f, 0.346704f, 0.351533f, 0.356400f, 0.361307f, 0.366253f, 0.371238f,
1138  0.376262f, 0.381326f, 0.386430f, 0.391573f, 0.396755f, 0.401978f, 0.407240f, 0.412543f, 0.417885f, 0.423268f, 0.428691f,
1139  0.434154f, 0.439657f, 0.445201f, 0.450786f, 0.456411f, 0.462077f, 0.467784f, 0.473532f, 0.479320f, 0.485150f, 0.491021f,
1140  0.496933f, 0.502887f, 0.508881f, 0.514918f, 0.520996f, 0.527115f, 0.533276f, 0.539480f, 0.545725f, 0.552011f, 0.558340f,
1141  0.564712f, 0.571125f, 0.577581f, 0.584078f, 0.590619f, 0.597202f, 0.603827f, 0.610496f, 0.617207f, 0.623960f, 0.630757f,
1142  0.637597f, 0.644480f, 0.651406f, 0.658375f, 0.665387f, 0.672443f, 0.679543f, 0.686685f, 0.693872f, 0.701102f, 0.708376f,
1143  0.715694f, 0.723055f, 0.730461f, 0.737911f, 0.745404f, 0.752942f, 0.760525f, 0.768151f, 0.775822f, 0.783538f, 0.791298f,
1144  0.799103f, 0.806952f, 0.814847f, 0.822786f, 0.830770f, 0.838799f, 0.846873f, 0.854993f, 0.863157f, 0.871367f, 0.879622f,
1145  0.887923f, 0.896269f, 0.904661f, 0.913099f, 0.921582f, 0.930111f, 0.938686f, 0.947307f, 0.955974f, 0.964686f, 0.973445f,
1146  0.982251f, 0.991102f, 1.0f
1147};
1148
1149typedef union
1150{
1151  unsigned int u;
1152  float f;
1153} stbir__FP32;
1154
1155// From https://gist.github.com/rygorous/2203834
1156
1157static const stbir_uint32 fp32_to_srgb8_tab4[104] = {
1158  0x0073000d, 0x007a000d, 0x0080000d, 0x0087000d, 0x008d000d, 0x0094000d, 0x009a000d, 0x00a1000d,
1159  0x00a7001a, 0x00b4001a, 0x00c1001a, 0x00ce001a, 0x00da001a, 0x00e7001a, 0x00f4001a, 0x0101001a,
1160  0x010e0033, 0x01280033, 0x01410033, 0x015b0033, 0x01750033, 0x018f0033, 0x01a80033, 0x01c20033,
1161  0x01dc0067, 0x020f0067, 0x02430067, 0x02760067, 0x02aa0067, 0x02dd0067, 0x03110067, 0x03440067,
1162  0x037800ce, 0x03df00ce, 0x044600ce, 0x04ad00ce, 0x051400ce, 0x057b00c5, 0x05dd00bc, 0x063b00b5,
1163  0x06970158, 0x07420142, 0x07e30130, 0x087b0120, 0x090b0112, 0x09940106, 0x0a1700fc, 0x0a9500f2,
1164  0x0b0f01cb, 0x0bf401ae, 0x0ccb0195, 0x0d950180, 0x0e56016e, 0x0f0d015e, 0x0fbc0150, 0x10630143,
1165  0x11070264, 0x1238023e, 0x1357021d, 0x14660201, 0x156601e9, 0x165a01d3, 0x174401c0, 0x182401af,
1166  0x18fe0331, 0x1a9602fe, 0x1c1502d2, 0x1d7e02ad, 0x1ed4028d, 0x201a0270, 0x21520256, 0x227d0240,
1167  0x239f0443, 0x25c003fe, 0x27bf03c4, 0x29a10392, 0x2b6a0367, 0x2d1d0341, 0x2ebe031f, 0x304d0300,
1168  0x31d105b0, 0x34a80555, 0x37520507, 0x39d504c5, 0x3c37048b, 0x3e7c0458, 0x40a8042a, 0x42bd0401,
1169  0x44c20798, 0x488e071e, 0x4c1c06b6, 0x4f76065d, 0x52a50610, 0x55ac05cc, 0x5892058f, 0x5b590559,
1170  0x5e0c0a23, 0x631c0980, 0x67db08f6, 0x6c55087f, 0x70940818, 0x74a007bd, 0x787d076c, 0x7c330723,
1171};
1172
1173static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
1174{
1175  static const stbir__FP32 almostone = { 0x3f7fffff }; // 1-eps
1176  static const stbir__FP32 minval = { (127-13) << 23 };
1177  stbir_uint32 tab,bias,scale,t;
1178  stbir__FP32 f;
1179
1180  // Clamp to [2^(-13), 1-eps]; these two values map to 0 and 1, respectively.
1181  // The tests are carefully written so that NaNs map to 0, same as in the reference
1182  // implementation.
1183  if (!(in > minval.f)) // written this way to catch NaNs
1184      return 0;
1185  if (in > almostone.f)
1186      return 255;
1187
1188  // Do the table lookup and unpack bias, scale
1189  f.f = in;
1190  tab = fp32_to_srgb8_tab4[(f.u - minval.u) >> 20];
1191  bias = (tab >> 16) << 9;
1192  scale = tab & 0xffff;
1193
1194  // Grab next-highest mantissa bits and perform linear interpolation
1195  t = (f.u >> 12) & 0xff;
1196  return (unsigned char) ((bias + scale*t) >> 16);
1197}
1198
1199#ifndef STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT
1200#define STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT 32 // when downsampling and <= 32 scanlines of buffering, use gather. gather used down to 1/8th scaling for 25% win.
1201#endif
1202
1203#ifndef STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS
1204#define STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS 4 // when threading, what is the minimum number of scanlines for a split?
1205#endif
1206
1207// restrict pointers for the output pointers, other loop and unroll control
1208#if defined( _MSC_VER ) && !defined(__clang__)
1209  #define STBIR_STREAMOUT_PTR( star ) star __restrict
1210  #define STBIR_NO_UNROLL( ptr ) __assume(ptr) // this oddly keeps msvc from unrolling a loop
1211  #if _MSC_VER >= 1900
1212    #define STBIR_NO_UNROLL_LOOP_START __pragma(loop( no_vector )) 
1213  #else
1214    #define STBIR_NO_UNROLL_LOOP_START 
1215  #endif
1216#elif defined( __clang__ )
1217  #define STBIR_STREAMOUT_PTR( star ) star __restrict__
1218  #define STBIR_NO_UNROLL( ptr ) __asm__ (""::"r"(ptr)) 
1219  #if ( __clang_major__ >= 4 ) || ( ( __clang_major__ >= 3 ) && ( __clang_minor__ >= 5 ) )
1220    #define STBIR_NO_UNROLL_LOOP_START _Pragma("clang loop unroll(disable)") _Pragma("clang loop vectorize(disable)")
1221  #else
1222    #define STBIR_NO_UNROLL_LOOP_START
1223  #endif 
1224#elif defined( __GNUC__ )
1225  #define STBIR_STREAMOUT_PTR( star ) star __restrict__
1226  #define STBIR_NO_UNROLL( ptr ) __asm__ (""::"r"(ptr))
1227  #if __GNUC__ >= 14
1228    #define STBIR_NO_UNROLL_LOOP_START _Pragma("GCC unroll 0") _Pragma("GCC novector")
1229  #else
1230    #define STBIR_NO_UNROLL_LOOP_START
1231  #endif
1232  #define STBIR_NO_UNROLL_LOOP_START_INF_FOR
1233#else
1234  #define STBIR_STREAMOUT_PTR( star ) star
1235  #define STBIR_NO_UNROLL( ptr )
1236  #define STBIR_NO_UNROLL_LOOP_START
1237#endif
1238
1239#ifndef STBIR_NO_UNROLL_LOOP_START_INF_FOR
1240#define STBIR_NO_UNROLL_LOOP_START_INF_FOR STBIR_NO_UNROLL_LOOP_START
1241#endif
1242
1243#ifdef STBIR_NO_SIMD // force simd off for whatever reason
1244
1245// force simd off overrides everything else, so clear it all
1246
1247#ifdef STBIR_SSE2
1248#undef STBIR_SSE2
1249#endif
1250
1251#ifdef STBIR_AVX
1252#undef STBIR_AVX
1253#endif
1254
1255#ifdef STBIR_NEON
1256#undef STBIR_NEON
1257#endif
1258
1259#ifdef STBIR_AVX2
1260#undef STBIR_AVX2
1261#endif
1262
1263#ifdef STBIR_FP16C
1264#undef STBIR_FP16C
1265#endif
1266
1267#ifdef STBIR_WASM
1268#undef STBIR_WASM
1269#endif
1270
1271#ifdef STBIR_SIMD
1272#undef STBIR_SIMD
1273#endif
1274
1275#else // STBIR_SIMD
1276
1277#ifdef STBIR_SSE2
1278  #include <emmintrin.h>
1279
1280  #define stbir__simdf __m128
1281  #define stbir__simdi __m128i
1282
1283  #define stbir_simdi_castf( reg ) _mm_castps_si128(reg)
1284  #define stbir_simdf_casti( reg ) _mm_castsi128_ps(reg)
1285
1286  #define stbir__simdf_load( reg, ptr ) (reg) = _mm_loadu_ps( (float const*)(ptr) )
1287  #define stbir__simdi_load( reg, ptr ) (reg) = _mm_loadu_si128 ( (stbir__simdi const*)(ptr) )
1288  #define stbir__simdf_load1( out, ptr ) (out) = _mm_load_ss( (float const*)(ptr) )  // top values can be random (not denormal or nan for perf)
1289  #define stbir__simdi_load1( out, ptr ) (out) = _mm_castps_si128( _mm_load_ss( (float const*)(ptr) ))
1290  #define stbir__simdf_load1z( out, ptr ) (out) = _mm_load_ss( (float const*)(ptr) )  // top values must be zero
1291  #define stbir__simdf_frep4( fvar ) _mm_set_ps1( fvar )
1292  #define stbir__simdf_load1frep4( out, fvar ) (out) = _mm_set_ps1( fvar )
1293  #define stbir__simdf_load2( out, ptr ) (out) = _mm_castsi128_ps( _mm_loadl_epi64( (__m128i*)(ptr)) ) // top values can be random (not denormal or nan for perf)
1294  #define stbir__simdf_load2z( out, ptr ) (out) = _mm_castsi128_ps( _mm_loadl_epi64( (__m128i*)(ptr)) ) // top values must be zero
1295  #define stbir__simdf_load2hmerge( out, reg, ptr ) (out) = _mm_castpd_ps(_mm_loadh_pd( _mm_castps_pd(reg), (double*)(ptr) ))
1296
1297  #define stbir__simdf_zeroP() _mm_setzero_ps()
1298  #define stbir__simdf_zero( reg ) (reg) = _mm_setzero_ps()
1299
1300  #define stbir__simdf_store( ptr, reg )  _mm_storeu_ps( (float*)(ptr), reg )
1301  #define stbir__simdf_store1( ptr, reg ) _mm_store_ss( (float*)(ptr), reg )
1302  #define stbir__simdf_store2( ptr, reg ) _mm_storel_epi64( (__m128i*)(ptr), _mm_castps_si128(reg) )
1303  #define stbir__simdf_store2h( ptr, reg ) _mm_storeh_pd( (double*)(ptr), _mm_castps_pd(reg) )
1304
1305  #define stbir__simdi_store( ptr, reg )  _mm_storeu_si128( (__m128i*)(ptr), reg )
1306  #define stbir__simdi_store1( ptr, reg ) _mm_store_ss( (float*)(ptr), _mm_castsi128_ps(reg) )
1307  #define stbir__simdi_store2( ptr, reg ) _mm_storel_epi64( (__m128i*)(ptr), (reg) )
1308
1309  #define stbir__prefetch( ptr ) _mm_prefetch((char*)(ptr), _MM_HINT_T0 )
1310
1311  #define stbir__simdi_expand_u8_to_u32(out0,out1,out2,out3,ireg) \
1312  { \
1313    stbir__simdi zero = _mm_setzero_si128(); \
1314    out2 = _mm_unpacklo_epi8( ireg, zero ); \
1315    out3 = _mm_unpackhi_epi8( ireg, zero ); \
1316    out0 = _mm_unpacklo_epi16( out2, zero ); \
1317    out1 = _mm_unpackhi_epi16( out2, zero ); \
1318    out2 = _mm_unpacklo_epi16( out3, zero ); \
1319    out3 = _mm_unpackhi_epi16( out3, zero ); \
1320  }
1321
1322#define stbir__simdi_expand_u8_to_1u32(out,ireg) \
1323  { \
1324    stbir__simdi zero = _mm_setzero_si128(); \
1325    out = _mm_unpacklo_epi8( ireg, zero ); \
1326    out = _mm_unpacklo_epi16( out, zero ); \
1327  }
1328
1329  #define stbir__simdi_expand_u16_to_u32(out0,out1,ireg) \
1330  { \
1331    stbir__simdi zero = _mm_setzero_si128(); \
1332    out0 = _mm_unpacklo_epi16( ireg, zero ); \
1333    out1 = _mm_unpackhi_epi16( ireg, zero ); \
1334  }
1335
1336  #define stbir__simdf_convert_float_to_i32( i, f ) (i) = _mm_cvttps_epi32(f)
1337  #define stbir__simdf_convert_float_to_int( f ) _mm_cvtt_ss2si(f)
1338  #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)_mm_cvtsi128_si32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(f,STBIR__CONSTF(STBIR_max_uint8_as_float)),_mm_setzero_ps()))))
1339  #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)_mm_cvtsi128_si32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(f,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps()))))
1340
1341  #define stbir__simdi_to_int( i ) _mm_cvtsi128_si32(i)
1342  #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = _mm_cvtepi32_ps( ireg )
1343  #define stbir__simdf_add( out, reg0, reg1 ) (out) = _mm_add_ps( reg0, reg1 )
1344  #define stbir__simdf_mult( out, reg0, reg1 ) (out) = _mm_mul_ps( reg0, reg1 )
1345  #define stbir__simdf_mult_mem( out, reg, ptr ) (out) = _mm_mul_ps( reg, _mm_loadu_ps( (float const*)(ptr) ) )
1346  #define stbir__simdf_mult1_mem( out, reg, ptr ) (out) = _mm_mul_ss( reg, _mm_load_ss( (float const*)(ptr) ) )
1347  #define stbir__simdf_add_mem( out, reg, ptr ) (out) = _mm_add_ps( reg, _mm_loadu_ps( (float const*)(ptr) ) )
1348  #define stbir__simdf_add1_mem( out, reg, ptr ) (out) = _mm_add_ss( reg, _mm_load_ss( (float const*)(ptr) ) )
1349
1350  #ifdef STBIR_USE_FMA           // not on by default to maintain bit identical simd to non-simd
1351  #include <immintrin.h>
1352  #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = _mm_fmadd_ps( mul1, mul2, add )
1353  #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = _mm_fmadd_ss( mul1, mul2, add )
1354  #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = _mm_fmadd_ps( mul, _mm_loadu_ps( (float const*)(ptr) ), add )
1355  #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = _mm_fmadd_ss( mul, _mm_load_ss( (float const*)(ptr) ), add )
1356  #else
1357  #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = _mm_add_ps( add, _mm_mul_ps( mul1, mul2 ) )
1358  #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = _mm_add_ss( add, _mm_mul_ss( mul1, mul2 ) )
1359  #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = _mm_add_ps( add, _mm_mul_ps( mul, _mm_loadu_ps( (float const*)(ptr) ) ) )
1360  #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = _mm_add_ss( add, _mm_mul_ss( mul, _mm_load_ss( (float const*)(ptr) ) ) )
1361  #endif
1362
1363  #define stbir__simdf_add1( out, reg0, reg1 ) (out) = _mm_add_ss( reg0, reg1 )
1364  #define stbir__simdf_mult1( out, reg0, reg1 ) (out) = _mm_mul_ss( reg0, reg1 )
1365
1366  #define stbir__simdf_and( out, reg0, reg1 ) (out) = _mm_and_ps( reg0, reg1 )
1367  #define stbir__simdf_or( out, reg0, reg1 ) (out) = _mm_or_ps( reg0, reg1 )
1368
1369  #define stbir__simdf_min( out, reg0, reg1 ) (out) = _mm_min_ps( reg0, reg1 )
1370  #define stbir__simdf_max( out, reg0, reg1 ) (out) = _mm_max_ps( reg0, reg1 )
1371  #define stbir__simdf_min1( out, reg0, reg1 ) (out) = _mm_min_ss( reg0, reg1 )
1372  #define stbir__simdf_max1( out, reg0, reg1 ) (out) = _mm_max_ss( reg0, reg1 )
1373
1374  #define stbir__simdf_0123ABCDto3ABx( out, reg0, reg1 ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_shuffle_ps( reg1,reg0, (0<<0) + (1<<2) + (2<<4) + (3<<6) )), (3<<0) + (0<<2) + (1<<4) + (2<<6) ) )
1375  #define stbir__simdf_0123ABCDto23Ax( out, reg0, reg1 ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_shuffle_ps( reg1,reg0, (0<<0) + (1<<2) + (2<<4) + (3<<6) )), (2<<0) + (3<<2) + (0<<4) + (1<<6) ) )
1376
1377  static const stbir__simdf STBIR_zeroones = { 0.0f,1.0f,0.0f,1.0f };
1378  static const stbir__simdf STBIR_onezeros = { 1.0f,0.0f,1.0f,0.0f };
1379  #define stbir__simdf_aaa1( out, alp, ones ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_movehl_ps( ones, alp ) ), (1<<0) + (1<<2) + (1<<4) + (2<<6) ) )
1380  #define stbir__simdf_1aaa( out, alp, ones ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_movelh_ps( ones, alp ) ), (0<<0) + (2<<2) + (2<<4) + (2<<6) ) )
1381  #define stbir__simdf_a1a1( out, alp, ones) (out) = _mm_or_ps( _mm_castsi128_ps( _mm_srli_epi64( _mm_castps_si128(alp), 32 ) ), STBIR_zeroones )
1382  #define stbir__simdf_1a1a( out, alp, ones) (out) = _mm_or_ps( _mm_castsi128_ps( _mm_slli_epi64( _mm_castps_si128(alp), 32 ) ), STBIR_onezeros )
1383
1384  #define stbir__simdf_swiz( reg, one, two, three, four ) _mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( reg ), (one<<0) + (two<<2) + (three<<4) + (four<<6) ) )
1385
1386  #define stbir__simdi_and( out, reg0, reg1 ) (out) = _mm_and_si128( reg0, reg1 )
1387  #define stbir__simdi_or( out, reg0, reg1 ) (out) = _mm_or_si128( reg0, reg1 )
1388  #define stbir__simdi_16madd( out, reg0, reg1 ) (out) = _mm_madd_epi16( reg0, reg1 )
1389
1390  #define stbir__simdf_pack_to_8bytes(out,aa,bb) \
1391  { \
1392    stbir__simdf af,bf; \
1393    stbir__simdi a,b; \
1394    af = _mm_min_ps( aa, STBIR_max_uint8_as_float ); \
1395    bf = _mm_min_ps( bb, STBIR_max_uint8_as_float ); \
1396    af = _mm_max_ps( af, _mm_setzero_ps() ); \
1397    bf = _mm_max_ps( bf, _mm_setzero_ps() ); \
1398    a = _mm_cvttps_epi32( af ); \
1399    b = _mm_cvttps_epi32( bf ); \
1400    a = _mm_packs_epi32( a, b ); \
1401    out = _mm_packus_epi16( a, a ); \
1402  }
1403
1404  #define stbir__simdf_load4_transposed( o0, o1, o2, o3, ptr ) \
1405      stbir__simdf_load( o0, (ptr) );    \
1406      stbir__simdf_load( o1, (ptr)+4 );  \
1407      stbir__simdf_load( o2, (ptr)+8 );  \
1408      stbir__simdf_load( o3, (ptr)+12 ); \
1409      {                                  \
1410        __m128 tmp0, tmp1, tmp2, tmp3;   \
1411        tmp0 = _mm_unpacklo_ps(o0, o1);  \
1412        tmp2 = _mm_unpacklo_ps(o2, o3);  \
1413        tmp1 = _mm_unpackhi_ps(o0, o1);  \
1414        tmp3 = _mm_unpackhi_ps(o2, o3);  \
1415        o0 = _mm_movelh_ps(tmp0, tmp2);  \
1416        o1 = _mm_movehl_ps(tmp2, tmp0);  \
1417        o2 = _mm_movelh_ps(tmp1, tmp3);  \
1418        o3 = _mm_movehl_ps(tmp3, tmp1);  \
1419      }
1420
1421  #define stbir__interleave_pack_and_store_16_u8( ptr, r0, r1, r2, r3 ) \
1422      r0 = _mm_packs_epi32( r0, r1 ); \
1423      r2 = _mm_packs_epi32( r2, r3 ); \
1424      r1 = _mm_unpacklo_epi16( r0, r2 ); \
1425      r3 = _mm_unpackhi_epi16( r0, r2 ); \
1426      r0 = _mm_unpacklo_epi16( r1, r3 ); \
1427      r2 = _mm_unpackhi_epi16( r1, r3 ); \
1428      r0 = _mm_packus_epi16( r0, r2 ); \
1429      stbir__simdi_store( ptr, r0 ); \
1430
1431  #define stbir__simdi_32shr( out, reg, imm ) out = _mm_srli_epi32( reg, imm )
1432
1433  #if defined(_MSC_VER) && !defined(__clang__)
1434    // msvc inits with 8 bytes
1435    #define STBIR__CONST_32_TO_8( v ) (char)(unsigned char)((v)&255),(char)(unsigned char)(((v)>>8)&255),(char)(unsigned char)(((v)>>16)&255),(char)(unsigned char)(((v)>>24)&255)
1436    #define STBIR__CONST_4_32i( v ) STBIR__CONST_32_TO_8( v ), STBIR__CONST_32_TO_8( v ), STBIR__CONST_32_TO_8( v ), STBIR__CONST_32_TO_8( v )
1437    #define STBIR__CONST_4d_32i( v0, v1, v2, v3 ) STBIR__CONST_32_TO_8( v0 ), STBIR__CONST_32_TO_8( v1 ), STBIR__CONST_32_TO_8( v2 ), STBIR__CONST_32_TO_8( v3 )
1438  #else
1439    // everything else inits with long long's
1440    #define STBIR__CONST_4_32i( v ) (long long)((((stbir_uint64)(stbir_uint32)(v))<<32)|((stbir_uint64)(stbir_uint32)(v))),(long long)((((stbir_uint64)(stbir_uint32)(v))<<32)|((stbir_uint64)(stbir_uint32)(v)))
1441    #define STBIR__CONST_4d_32i( v0, v1, v2, v3 ) (long long)((((stbir_uint64)(stbir_uint32)(v1))<<32)|((stbir_uint64)(stbir_uint32)(v0))),(long long)((((stbir_uint64)(stbir_uint32)(v3))<<32)|((stbir_uint64)(stbir_uint32)(v2)))
1442  #endif
1443
1444  #define STBIR__SIMDF_CONST(var, x) stbir__simdf var = { x, x, x, x }
1445  #define STBIR__SIMDI_CONST(var, x) stbir__simdi var = { STBIR__CONST_4_32i(x) }
1446  #define STBIR__CONSTF(var) (var)
1447  #define STBIR__CONSTI(var) (var)
1448
1449  #if defined(STBIR_AVX) || defined(__SSE4_1__)
1450    #include <smmintrin.h>
1451    #define stbir__simdf_pack_to_8words(out,reg0,reg1) out = _mm_packus_epi32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg0,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())), _mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg1,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())))
1452  #else
1453    STBIR__SIMDI_CONST(stbir__s32_32768, 32768);
1454    STBIR__SIMDI_CONST(stbir__s16_32768, ((32768<<16)|32768));
1455
1456    #define stbir__simdf_pack_to_8words(out,reg0,reg1) \
1457      { \
1458        stbir__simdi tmp0,tmp1; \
1459        tmp0 = _mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg0,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())); \
1460        tmp1 = _mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg1,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())); \
1461        tmp0 = _mm_sub_epi32( tmp0, stbir__s32_32768 ); \
1462        tmp1 = _mm_sub_epi32( tmp1, stbir__s32_32768 ); \
1463        out = _mm_packs_epi32( tmp0, tmp1 ); \
1464        out = _mm_sub_epi16( out, stbir__s16_32768 ); \
1465      }
1466
1467  #endif
1468
1469  #define STBIR_SIMD
1470
1471  // if we detect AVX, set the simd8 defines
1472  #ifdef STBIR_AVX
1473    #include <immintrin.h>
1474    #define STBIR_SIMD8
1475    #define stbir__simdf8 __m256
1476    #define stbir__simdi8 __m256i
1477    #define stbir__simdf8_load( out, ptr ) (out) = _mm256_loadu_ps( (float const *)(ptr) )
1478    #define stbir__simdi8_load( out, ptr ) (out) = _mm256_loadu_si256( (__m256i const *)(ptr) )
1479    #define stbir__simdf8_mult( out, a, b ) (out) = _mm256_mul_ps( (a), (b) )
1480    #define stbir__simdf8_store( ptr, out ) _mm256_storeu_ps( (float*)(ptr), out )
1481    #define stbir__simdi8_store( ptr, reg )  _mm256_storeu_si256( (__m256i*)(ptr), reg )
1482    #define stbir__simdf8_frep8( fval ) _mm256_set1_ps( fval )
1483
1484    #define stbir__simdf8_min( out, reg0, reg1 ) (out) = _mm256_min_ps( reg0, reg1 )
1485    #define stbir__simdf8_max( out, reg0, reg1 ) (out) = _mm256_max_ps( reg0, reg1 )
1486
1487    #define stbir__simdf8_add4halves( out, bot4, top8 ) (out) = _mm_add_ps( bot4, _mm256_extractf128_ps( top8, 1 ) )
1488    #define stbir__simdf8_mult_mem( out, reg, ptr ) (out) = _mm256_mul_ps( reg, _mm256_loadu_ps( (float const*)(ptr) ) )
1489    #define stbir__simdf8_add_mem( out, reg, ptr ) (out) = _mm256_add_ps( reg, _mm256_loadu_ps( (float const*)(ptr) ) )
1490    #define stbir__simdf8_add( out, a, b ) (out) = _mm256_add_ps( a, b )
1491    #define stbir__simdf8_load1b( out, ptr ) (out) = _mm256_broadcast_ss( ptr )
1492    #define stbir__simdf_load1rep4( out, ptr ) (out) = _mm_broadcast_ss( ptr )  // avx load instruction
1493
1494    #define stbir__simdi8_convert_i32_to_float(out, ireg) (out) = _mm256_cvtepi32_ps( ireg )
1495    #define stbir__simdf8_convert_float_to_i32( i, f ) (i) = _mm256_cvttps_epi32(f)
1496
1497    #define stbir__simdf8_bot4s( out, a, b ) (out) = _mm256_permute2f128_ps(a,b, (0<<0)+(2<<4) )
1498    #define stbir__simdf8_top4s( out, a, b ) (out) = _mm256_permute2f128_ps(a,b, (1<<0)+(3<<4) )
1499
1500    #define stbir__simdf8_gettop4( reg ) _mm256_extractf128_ps(reg,1)
1501
1502    #ifdef STBIR_AVX2
1503
1504    #define stbir__simdi8_expand_u8_to_u32(out0,out1,ireg) \
1505    { \
1506      stbir__simdi8 a, zero  =_mm256_setzero_si256();\
1507      a = _mm256_permute4x64_epi64( _mm256_unpacklo_epi8( _mm256_permute4x64_epi64(_mm256_castsi128_si256(ireg),(0<<0)+(2<<2)+(1<<4)+(3<<6)), zero ),(0<<0)+(2<<2)+(1<<4)+(3<<6)); \
1508      out0 = _mm256_unpacklo_epi16( a, zero ); \
1509      out1 = _mm256_unpackhi_epi16( a, zero ); \
1510    }
1511
1512    #define stbir__simdf8_pack_to_16bytes(out,aa,bb) \
1513    { \
1514      stbir__simdi8 t; \
1515      stbir__simdf8 af,bf; \
1516      stbir__simdi8 a,b; \
1517      af = _mm256_min_ps( aa, STBIR_max_uint8_as_floatX ); \
1518      bf = _mm256_min_ps( bb, STBIR_max_uint8_as_floatX ); \
1519      af = _mm256_max_ps( af, _mm256_setzero_ps() ); \
1520      bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \
1521      a = _mm256_cvttps_epi32( af ); \
1522      b = _mm256_cvttps_epi32( bf ); \
1523      t = _mm256_permute4x64_epi64( _mm256_packs_epi32( a, b ), (0<<0)+(2<<2)+(1<<4)+(3<<6) ); \
1524      out = _mm256_castsi256_si128( _mm256_permute4x64_epi64( _mm256_packus_epi16( t, t ), (0<<0)+(2<<2)+(1<<4)+(3<<6) ) ); \
1525    }
1526
1527    #define stbir__simdi8_expand_u16_to_u32(out,ireg) out = _mm256_unpacklo_epi16( _mm256_permute4x64_epi64(_mm256_castsi128_si256(ireg),(0<<0)+(2<<2)+(1<<4)+(3<<6)), _mm256_setzero_si256() );
1528
1529    #define stbir__simdf8_pack_to_16words(out,aa,bb) \
1530      { \
1531        stbir__simdf8 af,bf; \
1532        stbir__simdi8 a,b; \
1533        af = _mm256_min_ps( aa, STBIR_max_uint16_as_floatX ); \
1534        bf = _mm256_min_ps( bb, STBIR_max_uint16_as_floatX ); \
1535        af = _mm256_max_ps( af, _mm256_setzero_ps() ); \
1536        bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \
1537        a = _mm256_cvttps_epi32( af ); \
1538        b = _mm256_cvttps_epi32( bf ); \
1539        (out) = _mm256_permute4x64_epi64( _mm256_packus_epi32(a, b), (0<<0)+(2<<2)+(1<<4)+(3<<6) ); \
1540      }
1541
1542    #else
1543
1544    #define stbir__simdi8_expand_u8_to_u32(out0,out1,ireg) \
1545    { \
1546      stbir__simdi a,zero = _mm_setzero_si128(); \
1547      a = _mm_unpacklo_epi8( ireg, zero ); \
1548      out0 = _mm256_setr_m128i( _mm_unpacklo_epi16( a, zero ), _mm_unpackhi_epi16( a, zero ) ); \
1549      a = _mm_unpackhi_epi8( ireg, zero ); \
1550      out1 = _mm256_setr_m128i( _mm_unpacklo_epi16( a, zero ), _mm_unpackhi_epi16( a, zero ) ); \
1551    }
1552
1553    #define stbir__simdf8_pack_to_16bytes(out,aa,bb) \
1554    { \
1555      stbir__simdi t; \
1556      stbir__simdf8 af,bf; \
1557      stbir__simdi8 a,b; \
1558      af = _mm256_min_ps( aa, STBIR_max_uint8_as_floatX ); \
1559      bf = _mm256_min_ps( bb, STBIR_max_uint8_as_floatX ); \
1560      af = _mm256_max_ps( af, _mm256_setzero_ps() ); \
1561      bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \
1562      a = _mm256_cvttps_epi32( af ); \
1563      b = _mm256_cvttps_epi32( bf ); \
1564      out = _mm_packs_epi32( _mm256_castsi256_si128(a), _mm256_extractf128_si256( a, 1 ) ); \
1565      out = _mm_packus_epi16( out, out ); \
1566      t = _mm_packs_epi32( _mm256_castsi256_si128(b), _mm256_extractf128_si256( b, 1 ) ); \
1567      t = _mm_packus_epi16( t, t ); \
1568      out = _mm_castps_si128( _mm_shuffle_ps( _mm_castsi128_ps(out), _mm_castsi128_ps(t), (0<<0)+(1<<2)+(0<<4)+(1<<6) ) ); \
1569    }
1570
1571    #define stbir__simdi8_expand_u16_to_u32(out,ireg) \
1572    { \
1573      stbir__simdi a,b,zero = _mm_setzero_si128(); \
1574      a = _mm_unpacklo_epi16( ireg, zero ); \
1575      b = _mm_unpackhi_epi16( ireg, zero ); \
1576      out = _mm256_insertf128_si256( _mm256_castsi128_si256( a ), b, 1 ); \
1577    }
1578
1579    #define stbir__simdf8_pack_to_16words(out,aa,bb) \
1580      { \
1581        stbir__simdi t0,t1; \
1582        stbir__simdf8 af,bf; \
1583        stbir__simdi8 a,b; \
1584        af = _mm256_min_ps( aa, STBIR_max_uint16_as_floatX ); \
1585        bf = _mm256_min_ps( bb, STBIR_max_uint16_as_floatX ); \
1586        af = _mm256_max_ps( af, _mm256_setzero_ps() ); \
1587        bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \
1588        a = _mm256_cvttps_epi32( af ); \
1589        b = _mm256_cvttps_epi32( bf ); \
1590        t0 = _mm_packus_epi32( _mm256_castsi256_si128(a), _mm256_extractf128_si256( a, 1 ) ); \
1591        t1 = _mm_packus_epi32( _mm256_castsi256_si128(b), _mm256_extractf128_si256( b, 1 ) ); \
1592        out = _mm256_setr_m128i( t0, t1 ); \
1593      }
1594
1595    #endif
1596
1597    static __m256i stbir_00001111 = { STBIR__CONST_4d_32i( 0, 0, 0, 0 ), STBIR__CONST_4d_32i( 1, 1, 1, 1 ) };
1598    #define stbir__simdf8_0123to00001111( out, in ) (out) = _mm256_permutevar_ps ( in, stbir_00001111 )
1599
1600    static __m256i stbir_22223333 = { STBIR__CONST_4d_32i( 2, 2, 2, 2 ), STBIR__CONST_4d_32i( 3, 3, 3, 3 ) };
1601    #define stbir__simdf8_0123to22223333( out, in ) (out) = _mm256_permutevar_ps ( in, stbir_22223333 )
1602
1603    #define stbir__simdf8_0123to2222( out, in ) (out) = stbir__simdf_swiz(_mm256_castps256_ps128(in), 2,2,2,2 )
1604
1605    #define stbir__simdf8_load4b( out, ptr ) (out) = _mm256_broadcast_ps( (__m128 const *)(ptr) )
1606
1607    static __m256i stbir_00112233 = { STBIR__CONST_4d_32i( 0, 0, 1, 1 ), STBIR__CONST_4d_32i( 2, 2, 3, 3 ) };
1608    #define stbir__simdf8_0123to00112233( out, in ) (out) = _mm256_permutevar_ps ( in, stbir_00112233 )
1609    #define stbir__simdf8_add4( out, a8, b ) (out) = _mm256_add_ps( a8,  _mm256_castps128_ps256( b ) )
1610
1611    static __m256i stbir_load6 = { STBIR__CONST_4_32i( 0x80000000 ), STBIR__CONST_4d_32i(  0x80000000,  0x80000000, 0, 0 ) };
1612    #define stbir__simdf8_load6z( out, ptr ) (out) = _mm256_maskload_ps( ptr, stbir_load6 )
1613
1614    #define stbir__simdf8_0123to00000000( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (0<<0)+(0<<2)+(0<<4)+(0<<6) )
1615    #define stbir__simdf8_0123to11111111( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (1<<0)+(1<<2)+(1<<4)+(1<<6) )
1616    #define stbir__simdf8_0123to22222222( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (2<<0)+(2<<2)+(2<<4)+(2<<6) )
1617    #define stbir__simdf8_0123to33333333( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (3<<0)+(3<<2)+(3<<4)+(3<<6) )
1618    #define stbir__simdf8_0123to21032103( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (2<<0)+(1<<2)+(0<<4)+(3<<6) )
1619    #define stbir__simdf8_0123to32103210( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (3<<0)+(2<<2)+(1<<4)+(0<<6) )
1620    #define stbir__simdf8_0123to12301230( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (1<<0)+(2<<2)+(3<<4)+(0<<6) )
1621    #define stbir__simdf8_0123to10321032( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (1<<0)+(0<<2)+(3<<4)+(2<<6) )
1622    #define stbir__simdf8_0123to30123012( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (3<<0)+(0<<2)+(1<<4)+(2<<6) )
1623
1624    #define stbir__simdf8_0123to11331133( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (1<<0)+(1<<2)+(3<<4)+(3<<6) )
1625    #define stbir__simdf8_0123to00220022( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (0<<0)+(0<<2)+(2<<4)+(2<<6) )
1626
1627    #define stbir__simdf8_aaa1( out, alp, ones ) (out) = _mm256_blend_ps( alp, ones, (1<<0)+(1<<1)+(1<<2)+(0<<3)+(1<<4)+(1<<5)+(1<<6)+(0<<7)); (out)=_mm256_shuffle_ps( out,out, (3<<0) + (3<<2) + (3<<4) + (0<<6) )
1628    #define stbir__simdf8_1aaa( out, alp, ones ) (out) = _mm256_blend_ps( alp, ones, (0<<0)+(1<<1)+(1<<2)+(1<<3)+(0<<4)+(1<<5)+(1<<6)+(1<<7)); (out)=_mm256_shuffle_ps( out,out, (1<<0) + (0<<2) + (0<<4) + (0<<6) )
1629    #define stbir__simdf8_a1a1( out, alp, ones) (out) = _mm256_blend_ps( alp, ones, (1<<0)+(0<<1)+(1<<2)+(0<<3)+(1<<4)+(0<<5)+(1<<6)+(0<<7)); (out)=_mm256_shuffle_ps( out,out, (1<<0) + (0<<2) + (3<<4) + (2<<6) )
1630    #define stbir__simdf8_1a1a( out, alp, ones) (out) = _mm256_blend_ps( alp, ones, (0<<0)+(1<<1)+(0<<2)+(1<<3)+(0<<4)+(1<<5)+(0<<6)+(1<<7)); (out)=_mm256_shuffle_ps( out,out, (1<<0) + (0<<2) + (3<<4) + (2<<6) )
1631
1632    #define stbir__simdf8_zero( reg ) (reg) = _mm256_setzero_ps()
1633
1634    #ifdef STBIR_USE_FMA           // not on by default to maintain bit identical simd to non-simd
1635    #define stbir__simdf8_madd( out, add, mul1, mul2 ) (out) = _mm256_fmadd_ps( mul1, mul2, add )
1636    #define stbir__simdf8_madd_mem( out, add, mul, ptr ) (out) = _mm256_fmadd_ps( mul, _mm256_loadu_ps( (float const*)(ptr) ), add )
1637    #define stbir__simdf8_madd_mem4( out, add, mul, ptr )(out) = _mm256_fmadd_ps( _mm256_setr_m128( mul, _mm_setzero_ps() ), _mm256_setr_m128( _mm_loadu_ps( (float const*)(ptr) ), _mm_setzero_ps() ), add )
1638    #else
1639    #define stbir__simdf8_madd( out, add, mul1, mul2 ) (out) = _mm256_add_ps( add, _mm256_mul_ps( mul1, mul2 ) )
1640    #define stbir__simdf8_madd_mem( out, add, mul, ptr ) (out) = _mm256_add_ps( add, _mm256_mul_ps( mul, _mm256_loadu_ps( (float const*)(ptr) ) ) )
1641    #define stbir__simdf8_madd_mem4( out, add, mul, ptr )  (out) = _mm256_add_ps( add, _mm256_setr_m128( _mm_mul_ps( mul, _mm_loadu_ps( (float const*)(ptr) ) ), _mm_setzero_ps() ) )
1642    #endif
1643    #define stbir__if_simdf8_cast_to_simdf4( val ) _mm256_castps256_ps128( val )
1644
1645  #endif
1646
1647  #ifdef STBIR_FLOORF
1648  #undef STBIR_FLOORF
1649  #endif
1650  #define STBIR_FLOORF stbir_simd_floorf
1651  static stbir__inline float stbir_simd_floorf(float x)  // martins floorf
1652  {
1653    #if defined(STBIR_AVX) || defined(__SSE4_1__) || defined(STBIR_SSE41)
1654    __m128 t = _mm_set_ss(x);
1655    return _mm_cvtss_f32( _mm_floor_ss(t, t) );
1656    #else
1657    __m128 f = _mm_set_ss(x);
1658    __m128 t = _mm_cvtepi32_ps(_mm_cvttps_epi32(f));
1659    __m128 r = _mm_add_ss(t, _mm_and_ps(_mm_cmplt_ss(f, t), _mm_set_ss(-1.0f)));
1660    return _mm_cvtss_f32(r);
1661    #endif
1662  }
1663
1664  #ifdef STBIR_CEILF
1665  #undef STBIR_CEILF
1666  #endif
1667  #define STBIR_CEILF stbir_simd_ceilf
1668  static stbir__inline float stbir_simd_ceilf(float x)  // martins ceilf
1669  {
1670    #if defined(STBIR_AVX) || defined(__SSE4_1__) || defined(STBIR_SSE41)
1671    __m128 t = _mm_set_ss(x);
1672    return _mm_cvtss_f32( _mm_ceil_ss(t, t) );
1673    #else
1674    __m128 f = _mm_set_ss(x);
1675    __m128 t = _mm_cvtepi32_ps(_mm_cvttps_epi32(f));
1676    __m128 r = _mm_add_ss(t, _mm_and_ps(_mm_cmplt_ss(t, f), _mm_set_ss(1.0f)));
1677    return _mm_cvtss_f32(r);
1678    #endif
1679  }
1680
1681#elif defined(STBIR_NEON)
1682
1683  #include <arm_neon.h>
1684
1685  #define stbir__simdf float32x4_t
1686  #define stbir__simdi uint32x4_t
1687
1688  #define stbir_simdi_castf( reg ) vreinterpretq_u32_f32(reg)
1689  #define stbir_simdf_casti( reg ) vreinterpretq_f32_u32(reg)
1690
1691  #define stbir__simdf_load( reg, ptr ) (reg) = vld1q_f32( (float const*)(ptr) )
1692  #define stbir__simdi_load( reg, ptr ) (reg) = vld1q_u32( (uint32_t const*)(ptr) )
1693  #define stbir__simdf_load1( out, ptr ) (out) = vld1q_dup_f32( (float const*)(ptr) ) // top values can be random (not denormal or nan for perf)
1694  #define stbir__simdi_load1( out, ptr ) (out) = vld1q_dup_u32( (uint32_t const*)(ptr) )
1695  #define stbir__simdf_load1z( out, ptr ) (out) = vld1q_lane_f32( (float const*)(ptr), vdupq_n_f32(0), 0 ) // top values must be zero
1696  #define stbir__simdf_frep4( fvar ) vdupq_n_f32( fvar )
1697  #define stbir__simdf_load1frep4( out, fvar ) (out) = vdupq_n_f32( fvar )
1698  #define stbir__simdf_load2( out, ptr ) (out) = vcombine_f32( vld1_f32( (float const*)(ptr) ), vcreate_f32(0) ) // top values can be random (not denormal or nan for perf)
1699  #define stbir__simdf_load2z( out, ptr ) (out) = vcombine_f32( vld1_f32( (float const*)(ptr) ), vcreate_f32(0) )  // top values must be zero
1700  #define stbir__simdf_load2hmerge( out, reg, ptr ) (out) = vcombine_f32( vget_low_f32(reg), vld1_f32( (float const*)(ptr) ) )
1701
1702  #define stbir__simdf_zeroP() vdupq_n_f32(0)
1703  #define stbir__simdf_zero( reg ) (reg) = vdupq_n_f32(0)
1704
1705  #define stbir__simdf_store( ptr, reg )  vst1q_f32( (float*)(ptr), reg )
1706  #define stbir__simdf_store1( ptr, reg ) vst1q_lane_f32( (float*)(ptr), reg, 0)
1707  #define stbir__simdf_store2( ptr, reg ) vst1_f32( (float*)(ptr), vget_low_f32(reg) )
1708  #define stbir__simdf_store2h( ptr, reg ) vst1_f32( (float*)(ptr), vget_high_f32(reg) )
1709
1710  #define stbir__simdi_store( ptr, reg )  vst1q_u32( (uint32_t*)(ptr), reg )
1711  #define stbir__simdi_store1( ptr, reg ) vst1q_lane_u32( (uint32_t*)(ptr), reg, 0 )
1712  #define stbir__simdi_store2( ptr, reg ) vst1_u32( (uint32_t*)(ptr), vget_low_u32(reg) )
1713
1714  #define stbir__prefetch( ptr )
1715
1716  #define stbir__simdi_expand_u8_to_u32(out0,out1,out2,out3,ireg) \
1717  { \
1718    uint16x8_t l = vmovl_u8( vget_low_u8 ( vreinterpretq_u8_u32(ireg) ) ); \
1719    uint16x8_t h = vmovl_u8( vget_high_u8( vreinterpretq_u8_u32(ireg) ) ); \
1720    out0 = vmovl_u16( vget_low_u16 ( l ) ); \
1721    out1 = vmovl_u16( vget_high_u16( l ) ); \
1722    out2 = vmovl_u16( vget_low_u16 ( h ) ); \
1723    out3 = vmovl_u16( vget_high_u16( h ) ); \
1724  }
1725
1726  #define stbir__simdi_expand_u8_to_1u32(out,ireg) \
1727  { \
1728    uint16x8_t tmp = vmovl_u8( vget_low_u8( vreinterpretq_u8_u32(ireg) ) ); \
1729    out = vmovl_u16( vget_low_u16( tmp ) ); \
1730  }
1731
1732  #define stbir__simdi_expand_u16_to_u32(out0,out1,ireg) \
1733  { \
1734    uint16x8_t tmp = vreinterpretq_u16_u32(ireg); \
1735    out0 = vmovl_u16( vget_low_u16 ( tmp ) ); \
1736    out1 = vmovl_u16( vget_high_u16( tmp ) ); \
1737  }
1738
1739  #define stbir__simdf_convert_float_to_i32( i, f ) (i) = vreinterpretq_u32_s32( vcvtq_s32_f32(f) )
1740  #define stbir__simdf_convert_float_to_int( f ) vgetq_lane_s32(vcvtq_s32_f32(f), 0)
1741  #define stbir__simdi_to_int( i ) (int)vgetq_lane_u32(i, 0)
1742  #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)vgetq_lane_s32(vcvtq_s32_f32(vmaxq_f32(vminq_f32(f,STBIR__CONSTF(STBIR_max_uint8_as_float)),vdupq_n_f32(0))), 0))
1743  #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)vgetq_lane_s32(vcvtq_s32_f32(vmaxq_f32(vminq_f32(f,STBIR__CONSTF(STBIR_max_uint16_as_float)),vdupq_n_f32(0))), 0))
1744  #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = vcvtq_f32_s32( vreinterpretq_s32_u32(ireg) )
1745  #define stbir__simdf_add( out, reg0, reg1 ) (out) = vaddq_f32( reg0, reg1 )
1746  #define stbir__simdf_mult( out, reg0, reg1 ) (out) = vmulq_f32( reg0, reg1 )
1747  #define stbir__simdf_mult_mem( out, reg, ptr ) (out) = vmulq_f32( reg, vld1q_f32( (float const*)(ptr) ) )
1748  #define stbir__simdf_mult1_mem( out, reg, ptr ) (out) = vmulq_f32( reg, vld1q_dup_f32( (float const*)(ptr) ) )
1749  #define stbir__simdf_add_mem( out, reg, ptr ) (out) = vaddq_f32( reg, vld1q_f32( (float const*)(ptr) ) )
1750  #define stbir__simdf_add1_mem( out, reg, ptr ) (out) = vaddq_f32( reg, vld1q_dup_f32( (float const*)(ptr) ) )
1751
1752  #ifdef STBIR_USE_FMA           // not on by default to maintain bit identical simd to non-simd (and also x64 no madd to arm madd)
1753  #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = vfmaq_f32( add, mul1, mul2 )
1754  #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = vfmaq_f32( add, mul1, mul2 )
1755  #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = vfmaq_f32( add, mul, vld1q_f32( (float const*)(ptr) ) )
1756  #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = vfmaq_f32( add, mul, vld1q_dup_f32( (float const*)(ptr) ) )
1757  #else
1758  #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = vaddq_f32( add, vmulq_f32( mul1, mul2 ) )
1759  #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = vaddq_f32( add, vmulq_f32( mul1, mul2 ) )
1760  #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = vaddq_f32( add, vmulq_f32( mul, vld1q_f32( (float const*)(ptr) ) ) )
1761  #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = vaddq_f32( add, vmulq_f32( mul, vld1q_dup_f32( (float const*)(ptr) ) ) )
1762  #endif
1763
1764  #define stbir__simdf_add1( out, reg0, reg1 ) (out) = vaddq_f32( reg0, reg1 )
1765  #define stbir__simdf_mult1( out, reg0, reg1 ) (out) = vmulq_f32( reg0, reg1 )
1766
1767  #define stbir__simdf_and( out, reg0, reg1 ) (out) = vreinterpretq_f32_u32( vandq_u32( vreinterpretq_u32_f32(reg0), vreinterpretq_u32_f32(reg1) ) )
1768  #define stbir__simdf_or( out, reg0, reg1 ) (out) = vreinterpretq_f32_u32( vorrq_u32( vreinterpretq_u32_f32(reg0), vreinterpretq_u32_f32(reg1) ) )
1769
1770  #define stbir__simdf_min( out, reg0, reg1 ) (out) = vminq_f32( reg0, reg1 )
1771  #define stbir__simdf_max( out, reg0, reg1 ) (out) = vmaxq_f32( reg0, reg1 )
1772  #define stbir__simdf_min1( out, reg0, reg1 ) (out) = vminq_f32( reg0, reg1 )
1773  #define stbir__simdf_max1( out, reg0, reg1 ) (out) = vmaxq_f32( reg0, reg1 )
1774
1775  #define stbir__simdf_0123ABCDto3ABx( out, reg0, reg1 ) (out) = vextq_f32( reg0, reg1, 3 )
1776  #define stbir__simdf_0123ABCDto23Ax( out, reg0, reg1 ) (out) = vextq_f32( reg0, reg1, 2 )
1777
1778  #define stbir__simdf_a1a1( out, alp, ones ) (out) = vzipq_f32(vuzpq_f32(alp, alp).val[1], ones).val[0]
1779  #define stbir__simdf_1a1a( out, alp, ones ) (out) = vzipq_f32(ones, vuzpq_f32(alp, alp).val[0]).val[0]
1780
1781  #if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ )
1782
1783    #define stbir__simdf_aaa1( out, alp, ones ) (out) = vcopyq_laneq_f32(vdupq_n_f32(vgetq_lane_f32(alp, 3)), 3, ones, 3)
1784    #define stbir__simdf_1aaa( out, alp, ones ) (out) = vcopyq_laneq_f32(vdupq_n_f32(vgetq_lane_f32(alp, 0)), 0, ones, 0)
1785
1786    #if defined( _MSC_VER ) && !defined(__clang__)
1787      #define stbir_make16(a,b,c,d) vcombine_u8( \
1788        vcreate_u8( (4*a+0) | ((4*a+1)<<8) | ((4*a+2)<<16) | ((4*a+3)<<24) | \
1789          ((stbir_uint64)(4*b+0)<<32) | ((stbir_uint64)(4*b+1)<<40) | ((stbir_uint64)(4*b+2)<<48) | ((stbir_uint64)(4*b+3)<<56)), \
1790        vcreate_u8( (4*c+0) | ((4*c+1)<<8) | ((4*c+2)<<16) | ((4*c+3)<<24) | \
1791          ((stbir_uint64)(4*d+0)<<32) | ((stbir_uint64)(4*d+1)<<40) | ((stbir_uint64)(4*d+2)<<48) | ((stbir_uint64)(4*d+3)<<56) ) )
1792
1793      static stbir__inline uint8x16x2_t stbir_make16x2(float32x4_t rega,float32x4_t regb)
1794      {
1795        uint8x16x2_t r = { vreinterpretq_u8_f32(rega), vreinterpretq_u8_f32(regb) };
1796        return r;
1797      }
1798    #else
1799      #define stbir_make16(a,b,c,d) (uint8x16_t){4*a+0,4*a+1,4*a+2,4*a+3,4*b+0,4*b+1,4*b+2,4*b+3,4*c+0,4*c+1,4*c+2,4*c+3,4*d+0,4*d+1,4*d+2,4*d+3}
1800      #define stbir_make16x2(a,b) (uint8x16x2_t){{vreinterpretq_u8_f32(a),vreinterpretq_u8_f32(b)}}
1801    #endif
1802
1803    #define stbir__simdf_swiz( reg, one, two, three, four ) vreinterpretq_f32_u8( vqtbl1q_u8( vreinterpretq_u8_f32(reg), stbir_make16(one, two, three, four) ) )
1804    #define stbir__simdf_swiz2( rega, regb, one, two, three, four ) vreinterpretq_f32_u8( vqtbl2q_u8( stbir_make16x2(rega,regb), stbir_make16(one, two, three, four) ) )
1805
1806    #define stbir__simdi_16madd( out, reg0, reg1 ) \
1807    { \
1808      int16x8_t r0 = vreinterpretq_s16_u32(reg0); \
1809      int16x8_t r1 = vreinterpretq_s16_u32(reg1); \
1810      int32x4_t tmp0 = vmull_s16( vget_low_s16(r0), vget_low_s16(r1) ); \
1811      int32x4_t tmp1 = vmull_s16( vget_high_s16(r0), vget_high_s16(r1) ); \
1812      (out) = vreinterpretq_u32_s32( vpaddq_s32(tmp0, tmp1) ); \
1813    }
1814
1815  #else
1816
1817    #define stbir__simdf_aaa1( out, alp, ones ) (out) = vsetq_lane_f32(1.0f, vdupq_n_f32(vgetq_lane_f32(alp, 3)), 3)
1818    #define stbir__simdf_1aaa( out, alp, ones ) (out) = vsetq_lane_f32(1.0f, vdupq_n_f32(vgetq_lane_f32(alp, 0)), 0)
1819
1820    #if defined( _MSC_VER ) && !defined(__clang__)
1821      static stbir__inline uint8x8x2_t stbir_make8x2(float32x4_t reg)
1822      {
1823        uint8x8x2_t r = { { vget_low_u8(vreinterpretq_u8_f32(reg)), vget_high_u8(vreinterpretq_u8_f32(reg)) } };
1824        return r;
1825      }
1826      #define stbir_make8(a,b) vcreate_u8( \
1827        (4*a+0) | ((4*a+1)<<8) | ((4*a+2)<<16) | ((4*a+3)<<24) | \
1828        ((stbir_uint64)(4*b+0)<<32) | ((stbir_uint64)(4*b+1)<<40) | ((stbir_uint64)(4*b+2)<<48) | ((stbir_uint64)(4*b+3)<<56) )
1829    #else
1830      #define stbir_make8x2(reg) (uint8x8x2_t){ { vget_low_u8(vreinterpretq_u8_f32(reg)), vget_high_u8(vreinterpretq_u8_f32(reg)) } }
1831      #define stbir_make8(a,b) (uint8x8_t){4*a+0,4*a+1,4*a+2,4*a+3,4*b+0,4*b+1,4*b+2,4*b+3}
1832    #endif
1833
1834    #define stbir__simdf_swiz( reg, one, two, three, four ) vreinterpretq_f32_u8( vcombine_u8( \
1835        vtbl2_u8( stbir_make8x2( reg ), stbir_make8( one, two ) ), \
1836        vtbl2_u8( stbir_make8x2( reg ), stbir_make8( three, four ) ) ) )
1837
1838    #define stbir__simdi_16madd( out, reg0, reg1 ) \
1839    { \
1840      int16x8_t r0 = vreinterpretq_s16_u32(reg0); \
1841      int16x8_t r1 = vreinterpretq_s16_u32(reg1); \
1842      int32x4_t tmp0 = vmull_s16( vget_low_s16(r0), vget_low_s16(r1) ); \
1843      int32x4_t tmp1 = vmull_s16( vget_high_s16(r0), vget_high_s16(r1) ); \
1844      int32x2_t out0 = vpadd_s32( vget_low_s32(tmp0), vget_high_s32(tmp0) ); \
1845      int32x2_t out1 = vpadd_s32( vget_low_s32(tmp1), vget_high_s32(tmp1) ); \
1846      (out) = vreinterpretq_u32_s32( vcombine_s32(out0, out1) ); \
1847    }
1848
1849  #endif
1850
1851  #define stbir__simdi_and( out, reg0, reg1 ) (out) = vandq_u32( reg0, reg1 )
1852  #define stbir__simdi_or( out, reg0, reg1 ) (out) = vorrq_u32( reg0, reg1 )
1853
1854  #define stbir__simdf_pack_to_8bytes(out,aa,bb) \
1855  { \
1856    float32x4_t af = vmaxq_f32( vminq_f32(aa,STBIR__CONSTF(STBIR_max_uint8_as_float) ), vdupq_n_f32(0) ); \
1857    float32x4_t bf = vmaxq_f32( vminq_f32(bb,STBIR__CONSTF(STBIR_max_uint8_as_float) ), vdupq_n_f32(0) ); \
1858    int16x4_t ai = vqmovn_s32( vcvtq_s32_f32( af ) ); \
1859    int16x4_t bi = vqmovn_s32( vcvtq_s32_f32( bf ) ); \
1860    uint8x8_t out8 = vqmovun_s16( vcombine_s16(ai, bi) ); \
1861    out = vreinterpretq_u32_u8( vcombine_u8(out8, out8) ); \
1862  }
1863
1864  #define stbir__simdf_pack_to_8words(out,aa,bb) \
1865  { \
1866    float32x4_t af = vmaxq_f32( vminq_f32(aa,STBIR__CONSTF(STBIR_max_uint16_as_float) ), vdupq_n_f32(0) ); \
1867    float32x4_t bf = vmaxq_f32( vminq_f32(bb,STBIR__CONSTF(STBIR_max_uint16_as_float) ), vdupq_n_f32(0) ); \
1868    int32x4_t ai = vcvtq_s32_f32( af ); \
1869    int32x4_t bi = vcvtq_s32_f32( bf ); \
1870    out = vreinterpretq_u32_u16( vcombine_u16(vqmovun_s32(ai), vqmovun_s32(bi)) ); \
1871  }
1872
1873  #define stbir__interleave_pack_and_store_16_u8( ptr, r0, r1, r2, r3 ) \
1874  { \
1875    int16x4x2_t tmp0 = vzip_s16( vqmovn_s32(vreinterpretq_s32_u32(r0)), vqmovn_s32(vreinterpretq_s32_u32(r2)) ); \
1876    int16x4x2_t tmp1 = vzip_s16( vqmovn_s32(vreinterpretq_s32_u32(r1)), vqmovn_s32(vreinterpretq_s32_u32(r3)) ); \
1877    uint8x8x2_t out = \
1878    { { \
1879      vqmovun_s16( vcombine_s16(tmp0.val[0], tmp0.val[1]) ), \
1880      vqmovun_s16( vcombine_s16(tmp1.val[0], tmp1.val[1]) ), \
1881    } }; \
1882    vst2_u8(ptr, out); \
1883  }
1884
1885  #define stbir__simdf_load4_transposed( o0, o1, o2, o3, ptr ) \
1886  { \
1887    float32x4x4_t tmp = vld4q_f32(ptr); \
1888    o0 = tmp.val[0]; \
1889    o1 = tmp.val[1]; \
1890    o2 = tmp.val[2]; \
1891    o3 = tmp.val[3]; \
1892  }
1893
1894  #define stbir__simdi_32shr( out, reg, imm ) out = vshrq_n_u32( reg, imm )
1895
1896  #if defined( _MSC_VER ) && !defined(__clang__)
1897    #define STBIR__SIMDF_CONST(var, x) __declspec(align(8)) float var[] = { x, x, x, x }
1898    #define STBIR__SIMDI_CONST(var, x) __declspec(align(8)) uint32_t var[] = { x, x, x, x }
1899    #define STBIR__CONSTF(var) (*(const float32x4_t*)var)
1900    #define STBIR__CONSTI(var) (*(const uint32x4_t*)var)
1901  #else
1902    #define STBIR__SIMDF_CONST(var, x) stbir__simdf var = { x, x, x, x }
1903    #define STBIR__SIMDI_CONST(var, x) stbir__simdi var = { x, x, x, x }
1904    #define STBIR__CONSTF(var) (var)
1905    #define STBIR__CONSTI(var) (var)
1906  #endif
1907
1908  #ifdef STBIR_FLOORF
1909  #undef STBIR_FLOORF
1910  #endif
1911  #define STBIR_FLOORF stbir_simd_floorf
1912  static stbir__inline float stbir_simd_floorf(float x)
1913  {
1914    #if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ )
1915    return vget_lane_f32( vrndm_f32( vdup_n_f32(x) ), 0);
1916    #else
1917    float32x2_t f = vdup_n_f32(x);
1918    float32x2_t t = vcvt_f32_s32(vcvt_s32_f32(f));
1919    uint32x2_t a = vclt_f32(f, t);
1920    uint32x2_t b = vreinterpret_u32_f32(vdup_n_f32(-1.0f));
1921    float32x2_t r = vadd_f32(t, vreinterpret_f32_u32(vand_u32(a, b)));
1922    return vget_lane_f32(r, 0);
1923    #endif
1924  }
1925
1926  #ifdef STBIR_CEILF
1927  #undef STBIR_CEILF
1928  #endif
1929  #define STBIR_CEILF stbir_simd_ceilf
1930  static stbir__inline float stbir_simd_ceilf(float x)
1931  {
1932    #if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ )
1933    return vget_lane_f32( vrndp_f32( vdup_n_f32(x) ), 0);
1934    #else
1935    float32x2_t f = vdup_n_f32(x);
1936    float32x2_t t = vcvt_f32_s32(vcvt_s32_f32(f));
1937    uint32x2_t a = vclt_f32(t, f);
1938    uint32x2_t b = vreinterpret_u32_f32(vdup_n_f32(1.0f));
1939    float32x2_t r = vadd_f32(t, vreinterpret_f32_u32(vand_u32(a, b)));
1940    return vget_lane_f32(r, 0);
1941    #endif
1942  }
1943
1944  #define STBIR_SIMD
1945
1946#elif defined(STBIR_WASM)
1947
1948  #include <wasm_simd128.h>
1949
1950  #define stbir__simdf v128_t
1951  #define stbir__simdi v128_t
1952
1953  #define stbir_simdi_castf( reg ) (reg)
1954  #define stbir_simdf_casti( reg ) (reg)
1955
1956  #define stbir__simdf_load( reg, ptr )             (reg) = wasm_v128_load( (void const*)(ptr) )
1957  #define stbir__simdi_load( reg, ptr )             (reg) = wasm_v128_load( (void const*)(ptr) )
1958  #define stbir__simdf_load1( out, ptr )            (out) = wasm_v128_load32_splat( (void const*)(ptr) ) // top values can be random (not denormal or nan for perf)
1959  #define stbir__simdi_load1( out, ptr )            (out) = wasm_v128_load32_splat( (void const*)(ptr) )
1960  #define stbir__simdf_load1z( out, ptr )           (out) = wasm_v128_load32_zero( (void const*)(ptr) ) // top values must be zero
1961  #define stbir__simdf_frep4( fvar )                wasm_f32x4_splat( fvar )
1962  #define stbir__simdf_load1frep4( out, fvar )      (out) = wasm_f32x4_splat( fvar )
1963  #define stbir__simdf_load2( out, ptr )            (out) = wasm_v128_load64_splat( (void const*)(ptr) ) // top values can be random (not denormal or nan for perf)
1964  #define stbir__simdf_load2z( out, ptr )           (out) = wasm_v128_load64_zero( (void const*)(ptr) ) // top values must be zero
1965  #define stbir__simdf_load2hmerge( out, reg, ptr ) (out) = wasm_v128_load64_lane( (void const*)(ptr), reg, 1 )
1966
1967  #define stbir__simdf_zeroP() wasm_f32x4_const_splat(0)
1968  #define stbir__simdf_zero( reg ) (reg) = wasm_f32x4_const_splat(0)
1969
1970  #define stbir__simdf_store( ptr, reg )   wasm_v128_store( (void*)(ptr), reg )
1971  #define stbir__simdf_store1( ptr, reg )  wasm_v128_store32_lane( (void*)(ptr), reg, 0 )
1972  #define stbir__simdf_store2( ptr, reg )  wasm_v128_store64_lane( (void*)(ptr), reg, 0 )
1973  #define stbir__simdf_store2h( ptr, reg ) wasm_v128_store64_lane( (void*)(ptr), reg, 1 )
1974
1975  #define stbir__simdi_store( ptr, reg )  wasm_v128_store( (void*)(ptr), reg )
1976  #define stbir__simdi_store1( ptr, reg ) wasm_v128_store32_lane( (void*)(ptr), reg, 0 )
1977  #define stbir__simdi_store2( ptr, reg ) wasm_v128_store64_lane( (void*)(ptr), reg, 0 )
1978
1979  #define stbir__prefetch( ptr )
1980
1981  #define stbir__simdi_expand_u8_to_u32(out0,out1,out2,out3,ireg) \
1982  { \
1983    v128_t l = wasm_u16x8_extend_low_u8x16 ( ireg ); \
1984    v128_t h = wasm_u16x8_extend_high_u8x16( ireg ); \
1985    out0 = wasm_u32x4_extend_low_u16x8 ( l ); \
1986    out1 = wasm_u32x4_extend_high_u16x8( l ); \
1987    out2 = wasm_u32x4_extend_low_u16x8 ( h ); \
1988    out3 = wasm_u32x4_extend_high_u16x8( h ); \
1989  }
1990
1991  #define stbir__simdi_expand_u8_to_1u32(out,ireg) \
1992  { \
1993    v128_t tmp = wasm_u16x8_extend_low_u8x16(ireg); \
1994    out = wasm_u32x4_extend_low_u16x8(tmp); \
1995  }
1996
1997  #define stbir__simdi_expand_u16_to_u32(out0,out1,ireg) \
1998  { \
1999    out0 = wasm_u32x4_extend_low_u16x8 ( ireg ); \
2000    out1 = wasm_u32x4_extend_high_u16x8( ireg ); \
2001  }
2002
2003  #define stbir__simdf_convert_float_to_i32( i, f )    (i) = wasm_i32x4_trunc_sat_f32x4(f)
2004  #define stbir__simdf_convert_float_to_int( f )       wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(f), 0)
2005  #define stbir__simdi_to_int( i )                     wasm_i32x4_extract_lane(i, 0)
2006  #define stbir__simdf_convert_float_to_uint8( f )     ((unsigned char)wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(wasm_f32x4_max(wasm_f32x4_min(f,STBIR_max_uint8_as_float),wasm_f32x4_const_splat(0))), 0))
2007  #define stbir__simdf_convert_float_to_short( f )     ((unsigned short)wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(wasm_f32x4_max(wasm_f32x4_min(f,STBIR_max_uint16_as_float),wasm_f32x4_const_splat(0))), 0))
2008  #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = wasm_f32x4_convert_i32x4(ireg)
2009  #define stbir__simdf_add( out, reg0, reg1 )          (out) = wasm_f32x4_add( reg0, reg1 )
2010  #define stbir__simdf_mult( out, reg0, reg1 )         (out) = wasm_f32x4_mul( reg0, reg1 )
2011  #define stbir__simdf_mult_mem( out, reg, ptr )       (out) = wasm_f32x4_mul( reg, wasm_v128_load( (void const*)(ptr) ) )
2012  #define stbir__simdf_mult1_mem( out, reg, ptr )      (out) = wasm_f32x4_mul( reg, wasm_v128_load32_splat( (void const*)(ptr) ) )
2013  #define stbir__simdf_add_mem( out, reg, ptr )        (out) = wasm_f32x4_add( reg, wasm_v128_load( (void const*)(ptr) ) )
2014  #define stbir__simdf_add1_mem( out, reg, ptr )       (out) = wasm_f32x4_add( reg, wasm_v128_load32_splat( (void const*)(ptr) ) )
2015
2016  #define stbir__simdf_madd( out, add, mul1, mul2 )    (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul1, mul2 ) )
2017  #define stbir__simdf_madd1( out, add, mul1, mul2 )   (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul1, mul2 ) )
2018  #define stbir__simdf_madd_mem( out, add, mul, ptr )  (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul, wasm_v128_load( (void const*)(ptr) ) ) )
2019  #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul, wasm_v128_load32_splat( (void const*)(ptr) ) ) )
2020
2021  #define stbir__simdf_add1( out, reg0, reg1 )  (out) = wasm_f32x4_add( reg0, reg1 )
2022  #define stbir__simdf_mult1( out, reg0, reg1 ) (out) = wasm_f32x4_mul( reg0, reg1 )
2023
2024  #define stbir__simdf_and( out, reg0, reg1 ) (out) = wasm_v128_and( reg0, reg1 )
2025  #define stbir__simdf_or( out, reg0, reg1 )  (out) = wasm_v128_or( reg0, reg1 )
2026
2027  #define stbir__simdf_min( out, reg0, reg1 ) (out) = wasm_f32x4_min( reg0, reg1 )
2028  #define stbir__simdf_max( out, reg0, reg1 ) (out) = wasm_f32x4_max( reg0, reg1 )
2029  #define stbir__simdf_min1( out, reg0, reg1 ) (out) = wasm_f32x4_min( reg0, reg1 )
2030  #define stbir__simdf_max1( out, reg0, reg1 ) (out) = wasm_f32x4_max( reg0, reg1 )
2031
2032  #define stbir__simdf_0123ABCDto3ABx( out, reg0, reg1 ) (out) = wasm_i32x4_shuffle( reg0, reg1, 3, 4, 5, -1 )
2033  #define stbir__simdf_0123ABCDto23Ax( out, reg0, reg1 ) (out) = wasm_i32x4_shuffle( reg0, reg1, 2, 3, 4, -1 )
2034
2035  #define stbir__simdf_aaa1(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 3, 3, 3, 4)
2036  #define stbir__simdf_1aaa(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 4, 0, 0, 0)
2037  #define stbir__simdf_a1a1(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 1, 4, 3, 4)
2038  #define stbir__simdf_1a1a(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 4, 0, 4, 2)
2039
2040  #define stbir__simdf_swiz( reg, one, two, three, four ) wasm_i32x4_shuffle(reg, reg, one, two, three, four)
2041
2042  #define stbir__simdi_and( out, reg0, reg1 )    (out) = wasm_v128_and( reg0, reg1 )
2043  #define stbir__simdi_or( out, reg0, reg1 )     (out) = wasm_v128_or( reg0, reg1 )
2044  #define stbir__simdi_16madd( out, reg0, reg1 ) (out) = wasm_i32x4_dot_i16x8( reg0, reg1 )
2045
2046  #define stbir__simdf_pack_to_8bytes(out,aa,bb) \
2047  { \
2048    v128_t af = wasm_f32x4_max( wasm_f32x4_min(aa, STBIR_max_uint8_as_float), wasm_f32x4_const_splat(0) ); \
2049    v128_t bf = wasm_f32x4_max( wasm_f32x4_min(bb, STBIR_max_uint8_as_float), wasm_f32x4_const_splat(0) ); \
2050    v128_t ai = wasm_i32x4_trunc_sat_f32x4( af ); \
2051    v128_t bi = wasm_i32x4_trunc_sat_f32x4( bf ); \
2052    v128_t out16 = wasm_i16x8_narrow_i32x4( ai, bi ); \
2053    out = wasm_u8x16_narrow_i16x8( out16, out16 ); \
2054  }
2055
2056  #define stbir__simdf_pack_to_8words(out,aa,bb) \
2057  { \
2058    v128_t af = wasm_f32x4_max( wasm_f32x4_min(aa, STBIR_max_uint16_as_float), wasm_f32x4_const_splat(0)); \
2059    v128_t bf = wasm_f32x4_max( wasm_f32x4_min(bb, STBIR_max_uint16_as_float), wasm_f32x4_const_splat(0)); \
2060    v128_t ai = wasm_i32x4_trunc_sat_f32x4( af ); \
2061    v128_t bi = wasm_i32x4_trunc_sat_f32x4( bf ); \
2062    out = wasm_u16x8_narrow_i32x4( ai, bi ); \
2063  }
2064
2065  #define stbir__interleave_pack_and_store_16_u8( ptr, r0, r1, r2, r3 ) \
2066  { \
2067    v128_t tmp0 = wasm_i16x8_narrow_i32x4(r0, r1); \
2068    v128_t tmp1 = wasm_i16x8_narrow_i32x4(r2, r3); \
2069    v128_t tmp = wasm_u8x16_narrow_i16x8(tmp0, tmp1); \
2070    tmp = wasm_i8x16_shuffle(tmp, tmp, 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15); \
2071    wasm_v128_store( (void*)(ptr), tmp); \
2072  }
2073
2074  #define stbir__simdf_load4_transposed( o0, o1, o2, o3, ptr ) \
2075  { \
2076    v128_t t0 = wasm_v128_load( ptr    ); \
2077    v128_t t1 = wasm_v128_load( ptr+4  ); \
2078    v128_t t2 = wasm_v128_load( ptr+8  ); \
2079    v128_t t3 = wasm_v128_load( ptr+12 ); \
2080    v128_t s0 = wasm_i32x4_shuffle(t0, t1, 0, 4, 2, 6); \
2081    v128_t s1 = wasm_i32x4_shuffle(t0, t1, 1, 5, 3, 7); \
2082    v128_t s2 = wasm_i32x4_shuffle(t2, t3, 0, 4, 2, 6); \
2083    v128_t s3 = wasm_i32x4_shuffle(t2, t3, 1, 5, 3, 7); \
2084    o0 = wasm_i32x4_shuffle(s0, s2, 0, 1, 4, 5); \
2085    o1 = wasm_i32x4_shuffle(s1, s3, 0, 1, 4, 5); \
2086    o2 = wasm_i32x4_shuffle(s0, s2, 2, 3, 6, 7); \
2087    o3 = wasm_i32x4_shuffle(s1, s3, 2, 3, 6, 7); \
2088  }
2089
2090  #define stbir__simdi_32shr( out, reg, imm ) out = wasm_u32x4_shr( reg, imm )
2091
2092  typedef float stbir__f32x4 __attribute__((__vector_size__(16), __aligned__(16)));
2093  #define STBIR__SIMDF_CONST(var, x) stbir__simdf var = (v128_t)(stbir__f32x4){ x, x, x, x }
2094  #define STBIR__SIMDI_CONST(var, x) stbir__simdi var = { x, x, x, x }
2095  #define STBIR__CONSTF(var) (var)
2096  #define STBIR__CONSTI(var) (var)
2097
2098  #ifdef STBIR_FLOORF
2099  #undef STBIR_FLOORF
2100  #endif
2101  #define STBIR_FLOORF stbir_simd_floorf
2102  static stbir__inline float stbir_simd_floorf(float x)
2103  {
2104    return wasm_f32x4_extract_lane( wasm_f32x4_floor( wasm_f32x4_splat(x) ), 0);
2105  }
2106
2107  #ifdef STBIR_CEILF
2108  #undef STBIR_CEILF
2109  #endif
2110  #define STBIR_CEILF stbir_simd_ceilf
2111  static stbir__inline float stbir_simd_ceilf(float x)
2112  {
2113    return wasm_f32x4_extract_lane( wasm_f32x4_ceil( wasm_f32x4_splat(x) ), 0);
2114  }
2115
2116  #define STBIR_SIMD
2117
2118#endif  // SSE2/NEON/WASM
2119
2120#endif // NO SIMD
2121
2122#ifdef STBIR_SIMD8
2123  #define stbir__simdfX stbir__simdf8
2124  #define stbir__simdiX stbir__simdi8
2125  #define stbir__simdfX_load stbir__simdf8_load
2126  #define stbir__simdiX_load stbir__simdi8_load
2127  #define stbir__simdfX_mult stbir__simdf8_mult
2128  #define stbir__simdfX_add_mem stbir__simdf8_add_mem
2129  #define stbir__simdfX_madd_mem stbir__simdf8_madd_mem
2130  #define stbir__simdfX_store stbir__simdf8_store
2131  #define stbir__simdiX_store stbir__simdi8_store
2132  #define stbir__simdf_frepX  stbir__simdf8_frep8
2133  #define stbir__simdfX_madd stbir__simdf8_madd
2134  #define stbir__simdfX_min stbir__simdf8_min
2135  #define stbir__simdfX_max stbir__simdf8_max
2136  #define stbir__simdfX_aaa1 stbir__simdf8_aaa1
2137  #define stbir__simdfX_1aaa stbir__simdf8_1aaa
2138  #define stbir__simdfX_a1a1 stbir__simdf8_a1a1
2139  #define stbir__simdfX_1a1a stbir__simdf8_1a1a
2140  #define stbir__simdfX_convert_float_to_i32 stbir__simdf8_convert_float_to_i32
2141  #define stbir__simdfX_pack_to_words stbir__simdf8_pack_to_16words
2142  #define stbir__simdfX_zero stbir__simdf8_zero
2143  #define STBIR_onesX STBIR_ones8
2144  #define STBIR_max_uint8_as_floatX STBIR_max_uint8_as_float8
2145  #define STBIR_max_uint16_as_floatX STBIR_max_uint16_as_float8
2146  #define STBIR_simd_point5X STBIR_simd_point58
2147  #define stbir__simdfX_float_count 8
2148  #define stbir__simdfX_0123to1230 stbir__simdf8_0123to12301230
2149  #define stbir__simdfX_0123to2103 stbir__simdf8_0123to21032103
2150  static const stbir__simdf8 STBIR_max_uint16_as_float_inverted8 = { stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted };
2151  static const stbir__simdf8 STBIR_max_uint8_as_float_inverted8 = { stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted };
2152  static const stbir__simdf8 STBIR_ones8 = { 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0 };
2153  static const stbir__simdf8 STBIR_simd_point58 = { 0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5 };
2154  static const stbir__simdf8 STBIR_max_uint8_as_float8 = { stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float, stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float };
2155  static const stbir__simdf8 STBIR_max_uint16_as_float8 = { stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float, stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float };
2156#else
2157  #define stbir__simdfX stbir__simdf
2158  #define stbir__simdiX stbir__simdi
2159  #define stbir__simdfX_load stbir__simdf_load
2160  #define stbir__simdiX_load stbir__simdi_load
2161  #define stbir__simdfX_mult stbir__simdf_mult
2162  #define stbir__simdfX_add_mem stbir__simdf_add_mem
2163  #define stbir__simdfX_madd_mem stbir__simdf_madd_mem
2164  #define stbir__simdfX_store stbir__simdf_store
2165  #define stbir__simdiX_store stbir__simdi_store
2166  #define stbir__simdf_frepX  stbir__simdf_frep4
2167  #define stbir__simdfX_madd stbir__simdf_madd
2168  #define stbir__simdfX_min stbir__simdf_min
2169  #define stbir__simdfX_max stbir__simdf_max
2170  #define stbir__simdfX_aaa1 stbir__simdf_aaa1
2171  #define stbir__simdfX_1aaa stbir__simdf_1aaa
2172  #define stbir__simdfX_a1a1 stbir__simdf_a1a1
2173  #define stbir__simdfX_1a1a stbir__simdf_1a1a
2174  #define stbir__simdfX_convert_float_to_i32 stbir__simdf_convert_float_to_i32
2175  #define stbir__simdfX_pack_to_words stbir__simdf_pack_to_8words
2176  #define stbir__simdfX_zero stbir__simdf_zero
2177  #define STBIR_onesX STBIR__CONSTF(STBIR_ones)
2178  #define STBIR_simd_point5X STBIR__CONSTF(STBIR_simd_point5)
2179  #define STBIR_max_uint8_as_floatX STBIR__CONSTF(STBIR_max_uint8_as_float)
2180  #define STBIR_max_uint16_as_floatX STBIR__CONSTF(STBIR_max_uint16_as_float)
2181  #define stbir__simdfX_float_count 4
2182  #define stbir__if_simdf8_cast_to_simdf4( val ) ( val )
2183  #define stbir__simdfX_0123to1230 stbir__simdf_0123to1230
2184  #define stbir__simdfX_0123to2103 stbir__simdf_0123to2103
2185#endif
2186
2187
2188#if defined(STBIR_NEON) && !defined(_M_ARM) && !defined(__arm__)
2189
2190  #if defined( _MSC_VER ) && !defined(__clang__)
2191  typedef __int16 stbir__FP16;
2192  #else
2193  typedef float16_t stbir__FP16;
2194  #endif
2195
2196#else // no NEON, or 32-bit ARM for MSVC
2197
2198  typedef union stbir__FP16
2199  {
2200    unsigned short u;
2201  } stbir__FP16;
2202
2203#endif
2204
2205#if (!defined(STBIR_NEON) && !defined(STBIR_FP16C)) || (defined(STBIR_NEON) && defined(_M_ARM)) || (defined(STBIR_NEON) && defined(__arm__))
2206
2207  // Fabian's half float routines, see: https://gist.github.com/rygorous/2156668
2208
2209  static stbir__inline float stbir__half_to_float( stbir__FP16 h )
2210  {
2211    static const stbir__FP32 magic = { (254 - 15) << 23 };
2212    static const stbir__FP32 was_infnan = { (127 + 16) << 23 };
2213    stbir__FP32 o;
2214
2215    o.u = (h.u & 0x7fff) << 13;     // exponent/mantissa bits
2216    o.f *= magic.f;                 // exponent adjust
2217    if (o.f >= was_infnan.f)        // make sure Inf/NaN survive
2218      o.u |= 255 << 23;
2219    o.u |= (h.u & 0x8000) << 16;    // sign bit
2220    return o.f;
2221  }
2222
2223  static stbir__inline stbir__FP16 stbir__float_to_half(float val)
2224  {
2225    stbir__FP32 f32infty = { 255 << 23 };
2226    stbir__FP32 f16max   = { (127 + 16) << 23 };
2227    stbir__FP32 denorm_magic = { ((127 - 15) + (23 - 10) + 1) << 23 };
2228    unsigned int sign_mask = 0x80000000u;
2229    stbir__FP16 o = { 0 };
2230    stbir__FP32 f;
2231    unsigned int sign;
2232
2233    f.f = val;
2234    sign = f.u & sign_mask;
2235    f.u ^= sign;
2236
2237    if (f.u >= f16max.u) // result is Inf or NaN (all exponent bits set)
2238      o.u = (f.u > f32infty.u) ? 0x7e00 : 0x7c00; // NaN->qNaN and Inf->Inf
2239    else // (De)normalized number or zero
2240    {
2241      if (f.u < (113 << 23)) // resulting FP16 is subnormal or zero
2242      {
2243        // use a magic value to align our 10 mantissa bits at the bottom of
2244        // the float. as long as FP addition is round-to-nearest-even this
2245        // just works.
2246        f.f += denorm_magic.f;
2247        // and one integer subtract of the bias later, we have our final float!
2248        o.u = (unsigned short) ( f.u - denorm_magic.u );
2249      }
2250      else
2251      {
2252        unsigned int mant_odd = (f.u >> 13) & 1; // resulting mantissa is odd
2253        // update exponent, rounding bias part 1
2254        f.u = f.u + ((15u - 127) << 23) + 0xfff;
2255        // rounding bias part 2
2256        f.u += mant_odd;
2257        // take the bits!
2258        o.u = (unsigned short) ( f.u >> 13 );
2259      }
2260    }
2261
2262    o.u |= sign >> 16;
2263    return o;
2264  }
2265
2266#endif
2267
2268
2269#if defined(STBIR_FP16C)
2270
2271  #include <immintrin.h>
2272
2273  static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
2274  {
2275    _mm256_storeu_ps( (float*)output, _mm256_cvtph_ps( _mm_loadu_si128( (__m128i const* )input ) ) );
2276  }
2277
2278  static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input)
2279  {
2280    _mm_storeu_si128( (__m128i*)output, _mm256_cvtps_ph( _mm256_loadu_ps( input ), 0 ) );
2281  }
2282
2283  static stbir__inline float stbir__half_to_float( stbir__FP16 h )
2284  {
2285    return _mm_cvtss_f32( _mm_cvtph_ps( _mm_cvtsi32_si128( (int)h.u ) ) );
2286  }
2287
2288  static stbir__inline stbir__FP16 stbir__float_to_half( float f )
2289  {
2290    stbir__FP16 h;
2291    h.u = (unsigned short) _mm_cvtsi128_si32( _mm_cvtps_ph( _mm_set_ss( f ), 0 ) );
2292    return h;
2293  }
2294
2295#elif defined(STBIR_SSE2)
2296
2297  // Fabian's half float routines, see: https://gist.github.com/rygorous/2156668
2298  stbir__inline static void stbir__half_to_float_SIMD(float * output, void const * input)
2299  {
2300    static const STBIR__SIMDI_CONST(mask_nosign,      0x7fff);
2301    static const STBIR__SIMDI_CONST(smallest_normal,  0x0400);
2302    static const STBIR__SIMDI_CONST(infinity,         0x7c00);
2303    static const STBIR__SIMDI_CONST(expadjust_normal, (127 - 15) << 23);
2304    static const STBIR__SIMDI_CONST(magic_denorm,     113 << 23);
2305
2306    __m128i i = _mm_loadu_si128 ( (__m128i const*)(input) );
2307    __m128i h = _mm_unpacklo_epi16 ( i, _mm_setzero_si128() );
2308    __m128i mnosign     = STBIR__CONSTI(mask_nosign);
2309    __m128i eadjust     = STBIR__CONSTI(expadjust_normal);
2310    __m128i smallest    = STBIR__CONSTI(smallest_normal);
2311    __m128i infty       = STBIR__CONSTI(infinity);
2312    __m128i expmant     = _mm_and_si128(mnosign, h);
2313    __m128i justsign    = _mm_xor_si128(h, expmant);
2314    __m128i b_notinfnan = _mm_cmpgt_epi32(infty, expmant);
2315    __m128i b_isdenorm  = _mm_cmpgt_epi32(smallest, expmant);
2316    __m128i shifted     = _mm_slli_epi32(expmant, 13);
2317    __m128i adj_infnan  = _mm_andnot_si128(b_notinfnan, eadjust);
2318    __m128i adjusted    = _mm_add_epi32(eadjust, shifted);
2319    __m128i den1        = _mm_add_epi32(shifted, STBIR__CONSTI(magic_denorm));
2320    __m128i adjusted2   = _mm_add_epi32(adjusted, adj_infnan);
2321    __m128  den2        = _mm_sub_ps(_mm_castsi128_ps(den1), *(const __m128 *)&magic_denorm);
2322    __m128  adjusted3   = _mm_and_ps(den2, _mm_castsi128_ps(b_isdenorm));
2323    __m128  adjusted4   = _mm_andnot_ps(_mm_castsi128_ps(b_isdenorm), _mm_castsi128_ps(adjusted2));
2324    __m128  adjusted5   = _mm_or_ps(adjusted3, adjusted4);
2325    __m128i sign        = _mm_slli_epi32(justsign, 16);
2326    __m128  final       = _mm_or_ps(adjusted5, _mm_castsi128_ps(sign));
2327    stbir__simdf_store( output + 0,  final );
2328
2329    h = _mm_unpackhi_epi16 ( i, _mm_setzero_si128() );
2330    expmant     = _mm_and_si128(mnosign, h);
2331    justsign    = _mm_xor_si128(h, expmant);
2332    b_notinfnan = _mm_cmpgt_epi32(infty, expmant);
2333    b_isdenorm  = _mm_cmpgt_epi32(smallest, expmant);
2334    shifted     = _mm_slli_epi32(expmant, 13);
2335    adj_infnan  = _mm_andnot_si128(b_notinfnan, eadjust);
2336    adjusted    = _mm_add_epi32(eadjust, shifted);
2337    den1        = _mm_add_epi32(shifted, STBIR__CONSTI(magic_denorm));
2338    adjusted2   = _mm_add_epi32(adjusted, adj_infnan);
2339    den2        = _mm_sub_ps(_mm_castsi128_ps(den1), *(const __m128 *)&magic_denorm);
2340    adjusted3   = _mm_and_ps(den2, _mm_castsi128_ps(b_isdenorm));
2341    adjusted4   = _mm_andnot_ps(_mm_castsi128_ps(b_isdenorm), _mm_castsi128_ps(adjusted2));
2342    adjusted5   = _mm_or_ps(adjusted3, adjusted4);
2343    sign        = _mm_slli_epi32(justsign, 16);
2344    final       = _mm_or_ps(adjusted5, _mm_castsi128_ps(sign));
2345    stbir__simdf_store( output + 4,  final );
2346
2347    // ~38 SSE2 ops for 8 values
2348  }
2349
2350  // Fabian's round-to-nearest-even float to half
2351  // ~48 SSE2 ops for 8 output
2352  stbir__inline static void stbir__float_to_half_SIMD(void * output, float const * input)
2353  {
2354    static const STBIR__SIMDI_CONST(mask_sign,      0x80000000u);
2355    static const STBIR__SIMDI_CONST(c_f16max,       (127 + 16) << 23); // all FP32 values >=this round to +inf
2356    static const STBIR__SIMDI_CONST(c_nanbit,        0x200);
2357    static const STBIR__SIMDI_CONST(c_infty_as_fp16, 0x7c00);
2358    static const STBIR__SIMDI_CONST(c_min_normal,    (127 - 14) << 23); // smallest FP32 that yields a normalized FP16
2359    static const STBIR__SIMDI_CONST(c_subnorm_magic, ((127 - 15) + (23 - 10) + 1) << 23);
2360    static const STBIR__SIMDI_CONST(c_normal_bias,    0xfff - ((127 - 15) << 23)); // adjust exponent and add mantissa rounding
2361
2362    __m128  f           =  _mm_loadu_ps(input);
2363    __m128  msign       = _mm_castsi128_ps(STBIR__CONSTI(mask_sign));
2364    __m128  justsign    = _mm_and_ps(msign, f);
2365    __m128  absf        = _mm_xor_ps(f, justsign);
2366    __m128i absf_int    = _mm_castps_si128(absf); // the cast is "free" (extra bypass latency, but no thruput hit)
2367    __m128i f16max      = STBIR__CONSTI(c_f16max);
2368    __m128  b_isnan     = _mm_cmpunord_ps(absf, absf); // is this a NaN?
2369    __m128i b_isregular = _mm_cmpgt_epi32(f16max, absf_int); // (sub)normalized or special?
2370    __m128i nanbit      = _mm_and_si128(_mm_castps_si128(b_isnan), STBIR__CONSTI(c_nanbit));
2371    __m128i inf_or_nan  = _mm_or_si128(nanbit, STBIR__CONSTI(c_infty_as_fp16)); // output for specials
2372
2373    __m128i min_normal  = STBIR__CONSTI(c_min_normal);
2374    __m128i b_issub     = _mm_cmpgt_epi32(min_normal, absf_int);
2375
2376    // "result is subnormal" path
2377    __m128  subnorm1    = _mm_add_ps(absf, _mm_castsi128_ps(STBIR__CONSTI(c_subnorm_magic))); // magic value to round output mantissa
2378    __m128i subnorm2    = _mm_sub_epi32(_mm_castps_si128(subnorm1), STBIR__CONSTI(c_subnorm_magic)); // subtract out bias
2379
2380    // "result is normal" path
2381    __m128i mantoddbit  = _mm_slli_epi32(absf_int, 31 - 13); // shift bit 13 (mantissa LSB) to sign
2382    __m128i mantodd     = _mm_srai_epi32(mantoddbit, 31); // -1 if FP16 mantissa odd, else 0
2383
2384    __m128i round1      = _mm_add_epi32(absf_int, STBIR__CONSTI(c_normal_bias));
2385    __m128i round2      = _mm_sub_epi32(round1, mantodd); // if mantissa LSB odd, bias towards rounding up (RTNE)
2386    __m128i normal      = _mm_srli_epi32(round2, 13); // rounded result
2387
2388    // combine the two non-specials
2389    __m128i nonspecial  = _mm_or_si128(_mm_and_si128(subnorm2, b_issub), _mm_andnot_si128(b_issub, normal));
2390
2391    // merge in specials as well
2392    __m128i joined      = _mm_or_si128(_mm_and_si128(nonspecial, b_isregular), _mm_andnot_si128(b_isregular, inf_or_nan));
2393
2394    __m128i sign_shift  = _mm_srai_epi32(_mm_castps_si128(justsign), 16);
2395    __m128i final2, final= _mm_or_si128(joined, sign_shift);
2396
2397    f           =  _mm_loadu_ps(input+4);
2398    justsign    = _mm_and_ps(msign, f);
2399    absf        = _mm_xor_ps(f, justsign);
2400    absf_int    = _mm_castps_si128(absf); // the cast is "free" (extra bypass latency, but no thruput hit)
2401    b_isnan     = _mm_cmpunord_ps(absf, absf); // is this a NaN?
2402    b_isregular = _mm_cmpgt_epi32(f16max, absf_int); // (sub)normalized or special?
2403    nanbit      = _mm_and_si128(_mm_castps_si128(b_isnan), c_nanbit);
2404    inf_or_nan  = _mm_or_si128(nanbit, STBIR__CONSTI(c_infty_as_fp16)); // output for specials
2405
2406    b_issub     = _mm_cmpgt_epi32(min_normal, absf_int);
2407
2408    // "result is subnormal" path
2409    subnorm1    = _mm_add_ps(absf, _mm_castsi128_ps(STBIR__CONSTI(c_subnorm_magic))); // magic value to round output mantissa
2410    subnorm2    = _mm_sub_epi32(_mm_castps_si128(subnorm1), STBIR__CONSTI(c_subnorm_magic)); // subtract out bias
2411
2412    // "result is normal" path
2413    mantoddbit  = _mm_slli_epi32(absf_int, 31 - 13); // shift bit 13 (mantissa LSB) to sign
2414    mantodd     = _mm_srai_epi32(mantoddbit, 31); // -1 if FP16 mantissa odd, else 0
2415
2416    round1      = _mm_add_epi32(absf_int, STBIR__CONSTI(c_normal_bias));
2417    round2      = _mm_sub_epi32(round1, mantodd); // if mantissa LSB odd, bias towards rounding up (RTNE)
2418    normal      = _mm_srli_epi32(round2, 13); // rounded result
2419
2420    // combine the two non-specials
2421    nonspecial  = _mm_or_si128(_mm_and_si128(subnorm2, b_issub), _mm_andnot_si128(b_issub, normal));
2422
2423    // merge in specials as well
2424    joined      = _mm_or_si128(_mm_and_si128(nonspecial, b_isregular), _mm_andnot_si128(b_isregular, inf_or_nan));
2425
2426    sign_shift  = _mm_srai_epi32(_mm_castps_si128(justsign), 16);
2427    final2      = _mm_or_si128(joined, sign_shift);
2428    final       = _mm_packs_epi32(final, final2);
2429    stbir__simdi_store( output,final );
2430  }
2431
2432#elif defined(STBIR_NEON) && defined(_MSC_VER) && defined(_M_ARM64) && !defined(__clang__) // 64-bit ARM on MSVC (not clang)
2433
2434  static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
2435  {
2436    float16x4_t in0 = vld1_f16(input + 0);
2437    float16x4_t in1 = vld1_f16(input + 4);
2438    vst1q_f32(output + 0, vcvt_f32_f16(in0));
2439    vst1q_f32(output + 4, vcvt_f32_f16(in1));
2440  }
2441
2442  static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input)
2443  {
2444    float16x4_t out0 = vcvt_f16_f32(vld1q_f32(input + 0));
2445    float16x4_t out1 = vcvt_f16_f32(vld1q_f32(input + 4));
2446    vst1_f16(output+0, out0);
2447    vst1_f16(output+4, out1);
2448  }
2449
2450  static stbir__inline float stbir__half_to_float( stbir__FP16 h )
2451  {
2452    return vgetq_lane_f32(vcvt_f32_f16(vld1_dup_f16(&h)), 0);
2453  }
2454
2455  static stbir__inline stbir__FP16 stbir__float_to_half( float f )
2456  {
2457    return vget_lane_f16(vcvt_f16_f32(vdupq_n_f32(f)), 0).n16_u16[0];
2458  }
2459
2460#elif defined(STBIR_NEON) && ( defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) ) // 64-bit ARM
2461
2462  static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
2463  {
2464    float16x8_t in = vld1q_f16(input);
2465    vst1q_f32(output + 0, vcvt_f32_f16(vget_low_f16(in)));
2466    vst1q_f32(output + 4, vcvt_f32_f16(vget_high_f16(in)));
2467  }
2468
2469  static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input)
2470  {
2471    float16x4_t out0 = vcvt_f16_f32(vld1q_f32(input + 0));
2472    float16x4_t out1 = vcvt_f16_f32(vld1q_f32(input + 4));
2473    vst1q_f16(output, vcombine_f16(out0, out1));
2474  }
2475
2476  static stbir__inline float stbir__half_to_float( stbir__FP16 h )
2477  {
2478    return vgetq_lane_f32(vcvt_f32_f16(vdup_n_f16(h)), 0);
2479  }
2480
2481  static stbir__inline stbir__FP16 stbir__float_to_half( float f )
2482  {
2483    return vget_lane_f16(vcvt_f16_f32(vdupq_n_f32(f)), 0);
2484  }
2485
2486#elif defined(STBIR_WASM) || (defined(STBIR_NEON) && (defined(_MSC_VER) || defined(_M_ARM) || defined(__arm__))) // WASM or 32-bit ARM on MSVC/clang
2487
2488  static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
2489  {
2490    for (int i=0; i<8; i++)
2491    {
2492      output[i] = stbir__half_to_float(input[i]);
2493    }
2494  }
2495  static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input)
2496  {
2497    for (int i=0; i<8; i++)
2498    {
2499      output[i] = stbir__float_to_half(input[i]);
2500    }
2501  }
2502
2503#endif
2504
2505
2506#ifdef STBIR_SIMD
2507
2508#define stbir__simdf_0123to3333( out, reg ) (out) = stbir__simdf_swiz( reg, 3,3,3,3 )
2509#define stbir__simdf_0123to2222( out, reg ) (out) = stbir__simdf_swiz( reg, 2,2,2,2 )
2510#define stbir__simdf_0123to1111( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,1,1 )
2511#define stbir__simdf_0123to0000( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,0,0 )
2512#define stbir__simdf_0123to0003( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,0,3 )
2513#define stbir__simdf_0123to0001( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,0,1 )
2514#define stbir__simdf_0123to1122( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,2,2 )
2515#define stbir__simdf_0123to2333( out, reg ) (out) = stbir__simdf_swiz( reg, 2,3,3,3 )
2516#define stbir__simdf_0123to0023( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,2,3 )
2517#define stbir__simdf_0123to1230( out, reg ) (out) = stbir__simdf_swiz( reg, 1,2,3,0 )
2518#define stbir__simdf_0123to2103( out, reg ) (out) = stbir__simdf_swiz( reg, 2,1,0,3 )
2519#define stbir__simdf_0123to3210( out, reg ) (out) = stbir__simdf_swiz( reg, 3,2,1,0 )
2520#define stbir__simdf_0123to2301( out, reg ) (out) = stbir__simdf_swiz( reg, 2,3,0,1 )
2521#define stbir__simdf_0123to3012( out, reg ) (out) = stbir__simdf_swiz( reg, 3,0,1,2 )
2522#define stbir__simdf_0123to0011( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,1,1 )
2523#define stbir__simdf_0123to1100( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,0,0 )
2524#define stbir__simdf_0123to2233( out, reg ) (out) = stbir__simdf_swiz( reg, 2,2,3,3 )
2525#define stbir__simdf_0123to1133( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,3,3 )
2526#define stbir__simdf_0123to0022( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,2,2 )
2527#define stbir__simdf_0123to1032( out, reg ) (out) = stbir__simdf_swiz( reg, 1,0,3,2 )
2528
2529typedef union stbir__simdi_u32
2530{
2531  stbir_uint32 m128i_u32[4];
2532  int m128i_i32[4];
2533  stbir__simdi m128i_i128;
2534} stbir__simdi_u32;
2535
2536static const int STBIR_mask[9] = { 0,0,0,-1,-1,-1,0,0,0 };
2537
2538static const STBIR__SIMDF_CONST(STBIR_max_uint8_as_float,           stbir__max_uint8_as_float);
2539static const STBIR__SIMDF_CONST(STBIR_max_uint16_as_float,          stbir__max_uint16_as_float);
2540static const STBIR__SIMDF_CONST(STBIR_max_uint8_as_float_inverted,  stbir__max_uint8_as_float_inverted);
2541static const STBIR__SIMDF_CONST(STBIR_max_uint16_as_float_inverted, stbir__max_uint16_as_float_inverted);
2542
2543static const STBIR__SIMDF_CONST(STBIR_simd_point5,   0.5f);
2544static const STBIR__SIMDF_CONST(STBIR_ones,          1.0f);
2545static const STBIR__SIMDI_CONST(STBIR_almost_zero,   (127 - 13) << 23);
2546static const STBIR__SIMDI_CONST(STBIR_almost_one,    0x3f7fffff);
2547static const STBIR__SIMDI_CONST(STBIR_mastissa_mask, 0xff);
2548static const STBIR__SIMDI_CONST(STBIR_topscale,      0x02000000);
2549
2550//   Basically, in simd mode, we unroll the proper amount, and we don't want
2551//   the non-simd remnant loops to be unroll because they only run a few times
2552//   Adding this switch saves about 5K on clang which is Captain Unroll the 3rd.
2553#define STBIR_SIMD_STREAMOUT_PTR( star )  STBIR_STREAMOUT_PTR( star )
2554#define STBIR_SIMD_NO_UNROLL(ptr) STBIR_NO_UNROLL(ptr)
2555#define STBIR_SIMD_NO_UNROLL_LOOP_START STBIR_NO_UNROLL_LOOP_START
2556#define STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR STBIR_NO_UNROLL_LOOP_START_INF_FOR
2557
2558#ifdef STBIR_MEMCPY
2559#undef STBIR_MEMCPY
2560#endif
2561#define STBIR_MEMCPY stbir_simd_memcpy
2562
2563// override normal use of memcpy with much simpler copy (faster and smaller with our sized copies)
2564static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes )
2565{
2566  char STBIR_SIMD_STREAMOUT_PTR (*) d = (char*) dest;
2567  char STBIR_SIMD_STREAMOUT_PTR( * ) d_end = ((char*) dest) + bytes;
2568  ptrdiff_t ofs_to_src = (char*)src - (char*)dest;
2569
2570  // check overlaps
2571  STBIR_ASSERT( ( ( d >= ( (char*)src) + bytes ) ) || ( ( d + bytes ) <= (char*)src ) );
2572
2573  if ( bytes < (16*stbir__simdfX_float_count) )
2574  {
2575    if ( bytes < 16 )
2576    {
2577      if ( bytes )
2578      {
2579        STBIR_SIMD_NO_UNROLL_LOOP_START
2580        do
2581        {
2582          STBIR_SIMD_NO_UNROLL(d);
2583          d[ 0 ] = d[ ofs_to_src ];
2584          ++d;
2585        } while ( d < d_end );
2586      }
2587    }
2588    else
2589    {
2590      stbir__simdf x;
2591      // do one unaligned to get us aligned for the stream out below
2592      stbir__simdf_load( x, ( d + ofs_to_src ) );
2593      stbir__simdf_store( d, x );
2594      d = (char*)( ( ( (size_t)d ) + 16 ) & ~15 );
2595
2596      STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
2597      for(;;)
2598      {
2599        STBIR_SIMD_NO_UNROLL(d);
2600
2601        if ( d > ( d_end - 16 ) )
2602        {
2603          if ( d == d_end )
2604            return;
2605          d = d_end - 16;
2606        }
2607
2608        stbir__simdf_load( x, ( d + ofs_to_src ) );
2609        stbir__simdf_store( d, x );
2610        d += 16;
2611      }
2612    }
2613  }
2614  else
2615  {
2616    stbir__simdfX x0,x1,x2,x3;
2617
2618    // do one unaligned to get us aligned for the stream out below
2619    stbir__simdfX_load( x0, ( d + ofs_to_src ) +  0*stbir__simdfX_float_count );
2620    stbir__simdfX_load( x1, ( d + ofs_to_src ) +  4*stbir__simdfX_float_count );
2621    stbir__simdfX_load( x2, ( d + ofs_to_src ) +  8*stbir__simdfX_float_count );
2622    stbir__simdfX_load( x3, ( d + ofs_to_src ) + 12*stbir__simdfX_float_count );
2623    stbir__simdfX_store( d +  0*stbir__simdfX_float_count, x0 );
2624    stbir__simdfX_store( d +  4*stbir__simdfX_float_count, x1 );
2625    stbir__simdfX_store( d +  8*stbir__simdfX_float_count, x2 );
2626    stbir__simdfX_store( d + 12*stbir__simdfX_float_count, x3 );
2627    d = (char*)( ( ( (size_t)d ) + (16*stbir__simdfX_float_count) ) & ~((16*stbir__simdfX_float_count)-1) );
2628
2629    STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
2630    for(;;)
2631    {
2632      STBIR_SIMD_NO_UNROLL(d);
2633
2634      if ( d > ( d_end - (16*stbir__simdfX_float_count) ) )
2635      {
2636        if ( d == d_end )
2637          return;
2638        d = d_end - (16*stbir__simdfX_float_count);
2639      }
2640
2641      stbir__simdfX_load( x0, ( d + ofs_to_src ) +  0*stbir__simdfX_float_count );
2642      stbir__simdfX_load( x1, ( d + ofs_to_src ) +  4*stbir__simdfX_float_count );
2643      stbir__simdfX_load( x2, ( d + ofs_to_src ) +  8*stbir__simdfX_float_count );
2644      stbir__simdfX_load( x3, ( d + ofs_to_src ) + 12*stbir__simdfX_float_count );
2645      stbir__simdfX_store( d +  0*stbir__simdfX_float_count, x0 );
2646      stbir__simdfX_store( d +  4*stbir__simdfX_float_count, x1 );
2647      stbir__simdfX_store( d +  8*stbir__simdfX_float_count, x2 );
2648      stbir__simdfX_store( d + 12*stbir__simdfX_float_count, x3 );
2649      d += (16*stbir__simdfX_float_count);
2650    }
2651  }
2652}
2653
2654// memcpy that is specically intentionally overlapping (src is smaller then dest, so can be
2655//   a normal forward copy, bytes is divisible by 4 and bytes is greater than or equal to
2656//   the diff between dest and src)
2657static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes )
2658{
2659  char STBIR_SIMD_STREAMOUT_PTR (*) sd = (char*) src;
2660  char STBIR_SIMD_STREAMOUT_PTR( * ) s_end = ((char*) src) + bytes;
2661  ptrdiff_t ofs_to_dest = (char*)dest - (char*)src;
2662
2663  if ( ofs_to_dest >= 16 ) // is the overlap more than 16 away?
2664  {
2665    char STBIR_SIMD_STREAMOUT_PTR( * ) s_end16 = ((char*) src) + (bytes&~15);
2666    STBIR_SIMD_NO_UNROLL_LOOP_START
2667    do
2668    {
2669      stbir__simdf x;
2670      STBIR_SIMD_NO_UNROLL(sd);
2671      stbir__simdf_load( x, sd );
2672      stbir__simdf_store(  ( sd + ofs_to_dest ), x );
2673      sd += 16;
2674    } while ( sd < s_end16 );
2675
2676    if ( sd == s_end )
2677      return;
2678  }
2679
2680  do
2681  {
2682    STBIR_SIMD_NO_UNROLL(sd);
2683    *(int*)( sd + ofs_to_dest ) = *(int*) sd;
2684    sd += 4;
2685  } while ( sd < s_end );
2686}
2687
2688#else // no SSE2
2689
2690// when in scalar mode, we let unrolling happen, so this macro just does the __restrict
2691#define STBIR_SIMD_STREAMOUT_PTR( star ) STBIR_STREAMOUT_PTR( star )
2692#define STBIR_SIMD_NO_UNROLL(ptr)
2693#define STBIR_SIMD_NO_UNROLL_LOOP_START
2694#define STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
2695
2696#endif // SSE2
2697
2698
2699#ifdef STBIR_PROFILE
2700
2701#ifndef STBIR_PROFILE_FUNC
2702
2703#if defined(_x86_64) || defined( __x86_64__ ) || defined( _M_X64 ) || defined(__x86_64) || defined(__SSE2__) || defined(STBIR_SSE) || defined( _M_IX86_FP ) || defined(__i386) || defined( __i386__ ) || defined( _M_IX86 ) || defined( _X86_ )
2704
2705#ifdef _MSC_VER
2706
2707  STBIRDEF stbir_uint64 __rdtsc();
2708  #define STBIR_PROFILE_FUNC() __rdtsc()
2709
2710#else // non msvc
2711
2712  static stbir__inline stbir_uint64 STBIR_PROFILE_FUNC()
2713  {
2714    stbir_uint32 lo, hi;
2715    asm volatile ("rdtsc" : "=a" (lo), "=d" (hi) );
2716    return ( ( (stbir_uint64) hi ) << 32 ) | ( (stbir_uint64) lo );
2717  }
2718
2719#endif  // msvc
2720
2721#elif defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || defined(__ARM_NEON__)
2722
2723#if defined( _MSC_VER ) && !defined(__clang__)
2724
2725  #define STBIR_PROFILE_FUNC() _ReadStatusReg(ARM64_CNTVCT)
2726
2727#else
2728
2729  static stbir__inline stbir_uint64 STBIR_PROFILE_FUNC()
2730  {
2731    stbir_uint64 tsc;
2732    asm volatile("mrs %0, cntvct_el0" : "=r" (tsc));
2733    return tsc;
2734  }
2735
2736#endif
2737
2738#else // x64, arm
2739
2740#error Unknown platform for profiling.
2741
2742#endif  // x64, arm
2743
2744#endif // STBIR_PROFILE_FUNC
2745
2746#define STBIR_ONLY_PROFILE_GET_SPLIT_INFO ,stbir__per_split_info * split_info
2747#define STBIR_ONLY_PROFILE_SET_SPLIT_INFO ,split_info
2748
2749#define STBIR_ONLY_PROFILE_BUILD_GET_INFO ,stbir__info * profile_info
2750#define STBIR_ONLY_PROFILE_BUILD_SET_INFO ,profile_info
2751
2752// super light-weight micro profiler
2753#define STBIR_PROFILE_START_ll( info, wh ) { stbir_uint64 wh##thiszonetime = STBIR_PROFILE_FUNC(); stbir_uint64 * wh##save_parent_excluded_ptr = info->current_zone_excluded_ptr; stbir_uint64 wh##current_zone_excluded = 0; info->current_zone_excluded_ptr = &wh##current_zone_excluded;
2754#define STBIR_PROFILE_END_ll( info, wh ) wh##thiszonetime = STBIR_PROFILE_FUNC() - wh##thiszonetime; info->profile.named.wh += wh##thiszonetime - wh##current_zone_excluded; *wh##save_parent_excluded_ptr += wh##thiszonetime; info->current_zone_excluded_ptr = wh##save_parent_excluded_ptr; }
2755#define STBIR_PROFILE_FIRST_START_ll( info, wh ) { int i; info->current_zone_excluded_ptr = &info->profile.named.total; for(i=0;i<STBIR__ARRAY_SIZE(info->profile.array);i++) info->profile.array[i]=0; } STBIR_PROFILE_START_ll( info, wh );
2756#define STBIR_PROFILE_CLEAR_EXTRAS_ll( info, num ) { int extra; for(extra=1;extra<(num);extra++) { int i; for(i=0;i<STBIR__ARRAY_SIZE((info)->profile.array);i++) (info)[extra].profile.array[i]=0; } }
2757
2758// for thread data
2759#define STBIR_PROFILE_START( wh ) STBIR_PROFILE_START_ll( split_info, wh )
2760#define STBIR_PROFILE_END( wh ) STBIR_PROFILE_END_ll( split_info, wh )
2761#define STBIR_PROFILE_FIRST_START( wh ) STBIR_PROFILE_FIRST_START_ll( split_info, wh )
2762#define STBIR_PROFILE_CLEAR_EXTRAS() STBIR_PROFILE_CLEAR_EXTRAS_ll( split_info, split_count )
2763
2764// for build data
2765#define STBIR_PROFILE_BUILD_START( wh ) STBIR_PROFILE_START_ll( profile_info, wh )
2766#define STBIR_PROFILE_BUILD_END( wh ) STBIR_PROFILE_END_ll( profile_info, wh )
2767#define STBIR_PROFILE_BUILD_FIRST_START( wh ) STBIR_PROFILE_FIRST_START_ll( profile_info, wh )
2768#define STBIR_PROFILE_BUILD_CLEAR( info ) { int i; for(i=0;i<STBIR__ARRAY_SIZE(info->profile.array);i++) info->profile.array[i]=0; }
2769
2770#else  // no profile
2771
2772#define STBIR_ONLY_PROFILE_GET_SPLIT_INFO
2773#define STBIR_ONLY_PROFILE_SET_SPLIT_INFO
2774
2775#define STBIR_ONLY_PROFILE_BUILD_GET_INFO
2776#define STBIR_ONLY_PROFILE_BUILD_SET_INFO
2777
2778#define STBIR_PROFILE_START( wh )
2779#define STBIR_PROFILE_END( wh )
2780#define STBIR_PROFILE_FIRST_START( wh )
2781#define STBIR_PROFILE_CLEAR_EXTRAS( )
2782
2783#define STBIR_PROFILE_BUILD_START( wh )
2784#define STBIR_PROFILE_BUILD_END( wh )
2785#define STBIR_PROFILE_BUILD_FIRST_START( wh )
2786#define STBIR_PROFILE_BUILD_CLEAR( info )
2787
2788#endif  // stbir_profile
2789
2790#ifndef STBIR_CEILF
2791#include <math.h>
2792#if _MSC_VER <= 1200 // support VC6 for Sean
2793#define STBIR_CEILF(x) ((float)ceil((float)(x)))
2794#define STBIR_FLOORF(x) ((float)floor((float)(x)))
2795#else
2796#define STBIR_CEILF(x) ceilf(x)
2797#define STBIR_FLOORF(x) floorf(x)
2798#endif
2799#endif
2800
2801#ifndef STBIR_MEMCPY
2802// For memcpy
2803#include <string.h>
2804#define STBIR_MEMCPY( dest, src, len ) memcpy( dest, src, len )
2805#endif
2806
2807#ifndef STBIR_SIMD
2808
2809// memcpy that is specifically intentionally overlapping (src is smaller then dest, so can be
2810//   a normal forward copy, bytes is divisible by 4 and bytes is greater than or equal to
2811//   the diff between dest and src)
2812static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes )
2813{
2814  char STBIR_SIMD_STREAMOUT_PTR (*) sd = (char*) src;
2815  char STBIR_SIMD_STREAMOUT_PTR( * ) s_end = ((char*) src) + bytes;
2816  ptrdiff_t ofs_to_dest = (char*)dest - (char*)src;
2817
2818  if ( ofs_to_dest >= 8 ) // is the overlap more than 8 away?
2819  {
2820    char STBIR_SIMD_STREAMOUT_PTR( * ) s_end8 = ((char*) src) + (bytes&~7);
2821    STBIR_NO_UNROLL_LOOP_START
2822    do
2823    {
2824      STBIR_NO_UNROLL(sd);
2825      *(stbir_uint64*)( sd + ofs_to_dest ) = *(stbir_uint64*) sd;
2826      sd += 8;
2827    } while ( sd < s_end8 );
2828
2829    if ( sd == s_end )
2830      return;
2831  }
2832
2833  STBIR_NO_UNROLL_LOOP_START
2834  do
2835  {
2836    STBIR_NO_UNROLL(sd);
2837    *(int*)( sd + ofs_to_dest ) = *(int*) sd;
2838    sd += 4;
2839  } while ( sd < s_end );
2840}
2841
2842#endif
2843
2844static float stbir__filter_trapezoid(float x, float scale, void * user_data)
2845{
2846  float halfscale = scale / 2;
2847  float t = 0.5f + halfscale;
2848  STBIR_ASSERT(scale <= 1);
2849  STBIR__UNUSED(user_data);
2850
2851  if ( x < 0.0f ) x = -x;
2852
2853  if (x >= t)
2854    return 0.0f;
2855  else
2856  {
2857    float r = 0.5f - halfscale;
2858    if (x <= r)
2859      return 1.0f;
2860    else
2861      return (t - x) / scale;
2862  }
2863}
2864
2865static float stbir__support_trapezoid(float scale, void * user_data)
2866{
2867  STBIR__UNUSED(user_data);
2868  return 0.5f + scale / 2.0f;
2869}
2870
2871static float stbir__filter_triangle(float x, float s, void * user_data)
2872{
2873  STBIR__UNUSED(s);
2874  STBIR__UNUSED(user_data);
2875
2876  if ( x < 0.0f ) x = -x;
2877
2878  if (x <= 1.0f)
2879    return 1.0f - x;
2880  else
2881    return 0.0f;
2882}
2883
2884static float stbir__filter_point(float x, float s, void * user_data)
2885{
2886  STBIR__UNUSED(x);
2887  STBIR__UNUSED(s);
2888  STBIR__UNUSED(user_data);
2889
2890  return 1.0f;
2891}
2892
2893static float stbir__filter_cubic(float x, float s, void * user_data)
2894{
2895  STBIR__UNUSED(s);
2896  STBIR__UNUSED(user_data);
2897
2898  if ( x < 0.0f ) x = -x;
2899
2900  if (x < 1.0f)
2901    return (4.0f + x*x*(3.0f*x - 6.0f))/6.0f;
2902  else if (x < 2.0f)
2903    return (8.0f + x*(-12.0f + x*(6.0f - x)))/6.0f;
2904
2905  return (0.0f);
2906}
2907
2908static float stbir__filter_catmullrom(float x, float s, void * user_data)
2909{
2910  STBIR__UNUSED(s);
2911  STBIR__UNUSED(user_data);
2912
2913  if ( x < 0.0f ) x = -x;
2914
2915  if (x < 1.0f)
2916    return 1.0f - x*x*(2.5f - 1.5f*x);
2917  else if (x < 2.0f)
2918    return 2.0f - x*(4.0f + x*(0.5f*x - 2.5f));
2919
2920  return (0.0f);
2921}
2922
2923static float stbir__filter_mitchell(float x, float s, void * user_data)
2924{
2925  STBIR__UNUSED(s);
2926  STBIR__UNUSED(user_data);
2927
2928  if ( x < 0.0f ) x = -x;
2929
2930  if (x < 1.0f)
2931    return (16.0f + x*x*(21.0f * x - 36.0f))/18.0f;
2932  else if (x < 2.0f)
2933    return (32.0f + x*(-60.0f + x*(36.0f - 7.0f*x)))/18.0f;
2934
2935  return (0.0f);
2936}
2937
2938static float stbir__support_zeropoint5(float s, void * user_data)
2939{
2940  STBIR__UNUSED(s);
2941  STBIR__UNUSED(user_data);
2942  return 0.5f;
2943}
2944
2945static float stbir__support_one(float s, void * user_data)
2946{
2947  STBIR__UNUSED(s);
2948  STBIR__UNUSED(user_data);
2949  return 1;
2950}
2951
2952static float stbir__support_two(float s, void * user_data)
2953{
2954  STBIR__UNUSED(s);
2955  STBIR__UNUSED(user_data);
2956  return 2;
2957}
2958
2959// This is the maximum number of input samples that can affect an output sample
2960// with the given filter from the output pixel's perspective
2961static int stbir__get_filter_pixel_width(stbir__support_callback * support, float scale, void * user_data)
2962{
2963  STBIR_ASSERT(support != 0);
2964
2965  if ( scale >= ( 1.0f-stbir__small_float ) ) // upscale
2966    return (int)STBIR_CEILF(support(1.0f/scale,user_data) * 2.0f);
2967  else
2968    return (int)STBIR_CEILF(support(scale,user_data) * 2.0f / scale);
2969}
2970
2971// this is how many coefficents per run of the filter (which is different
2972//   from the filter_pixel_width depending on if we are scattering or gathering)
2973static int stbir__get_coefficient_width(stbir__sampler * samp, int is_gather, void * user_data)
2974{
2975  float scale = samp->scale_info.scale;
2976  stbir__support_callback * support = samp->filter_support;
2977
2978  switch( is_gather )
2979  {
2980    case 1:
2981      return (int)STBIR_CEILF(support(1.0f / scale, user_data) * 2.0f);
2982    case 2:
2983      return (int)STBIR_CEILF(support(scale, user_data) * 2.0f / scale);
2984    case 0:
2985      return (int)STBIR_CEILF(support(scale, user_data) * 2.0f);
2986    default:
2987      STBIR_ASSERT( (is_gather >= 0 ) && (is_gather <= 2 ) );
2988      return 0;
2989  }
2990}
2991
2992static int stbir__get_contributors(stbir__sampler * samp, int is_gather)
2993{
2994  if (is_gather)
2995      return samp->scale_info.output_sub_size;
2996  else
2997      return (samp->scale_info.input_full_size + samp->filter_pixel_margin * 2);
2998}
2999
3000static int stbir__edge_zero_full( int n, int max )
3001{
3002  STBIR__UNUSED(n);
3003  STBIR__UNUSED(max);
3004  return 0; // NOTREACHED
3005}
3006
3007static int stbir__edge_clamp_full( int n, int max )
3008{
3009  if (n < 0)
3010    return 0;
3011
3012  if (n >= max)
3013    return max - 1;
3014
3015  return n; // NOTREACHED
3016}
3017
3018static int stbir__edge_reflect_full( int n, int max )
3019{
3020  if (n < 0)
3021  {
3022    if (n > -max)
3023      return -n;
3024    else
3025      return max - 1;
3026  }
3027
3028  if (n >= max)
3029  {
3030    int max2 = max * 2;
3031    if (n >= max2)
3032      return 0;
3033    else
3034      return max2 - n - 1;
3035  }
3036
3037  return n; // NOTREACHED
3038}
3039
3040static int stbir__edge_wrap_full( int n, int max )
3041{
3042  if (n >= 0)
3043    return (n % max);
3044  else
3045  {
3046    int m = (-n) % max;
3047
3048    if (m != 0)
3049      m = max - m;
3050
3051    return (m);
3052  }
3053}
3054
3055typedef int stbir__edge_wrap_func( int n, int max );
3056static stbir__edge_wrap_func * stbir__edge_wrap_slow[] =
3057{
3058  stbir__edge_clamp_full,    // STBIR_EDGE_CLAMP
3059  stbir__edge_reflect_full,  // STBIR_EDGE_REFLECT
3060  stbir__edge_wrap_full,     // STBIR_EDGE_WRAP
3061  stbir__edge_zero_full,     // STBIR_EDGE_ZERO
3062};
3063
3064stbir__inline static int stbir__edge_wrap(stbir_edge edge, int n, int max)
3065{
3066  // avoid per-pixel switch
3067  if (n >= 0 && n < max)
3068      return n;
3069  return stbir__edge_wrap_slow[edge]( n, max );
3070}
3071
3072#define STBIR__MERGE_RUNS_PIXEL_THRESHOLD 16
3073
3074// get information on the extents of a sampler
3075static void stbir__get_extents( stbir__sampler * samp, stbir__extents * scanline_extents )
3076{
3077  int j, stop;
3078  int left_margin, right_margin;
3079  int min_n = 0x7fffffff, max_n = -0x7fffffff;
3080  int min_left = 0x7fffffff, max_left = -0x7fffffff;
3081  int min_right = 0x7fffffff, max_right = -0x7fffffff;
3082  stbir_edge edge = samp->edge;
3083  stbir__contributors* contributors = samp->contributors;
3084  int output_sub_size = samp->scale_info.output_sub_size;
3085  int input_full_size = samp->scale_info.input_full_size;
3086  int filter_pixel_margin = samp->filter_pixel_margin;
3087
3088  STBIR_ASSERT( samp->is_gather );
3089
3090  stop = output_sub_size;
3091  for (j = 0; j < stop; j++ )
3092  {
3093    STBIR_ASSERT( contributors[j].n1 >= contributors[j].n0 );
3094    if ( contributors[j].n0 < min_n )
3095    {
3096      min_n = contributors[j].n0;
3097      stop = j + filter_pixel_margin;  // if we find a new min, only scan another filter width
3098      if ( stop > output_sub_size ) stop = output_sub_size;
3099    }
3100  }
3101
3102  stop = 0;
3103  for (j = output_sub_size - 1; j >= stop; j-- )
3104  {
3105    STBIR_ASSERT( contributors[j].n1 >= contributors[j].n0 );
3106    if ( contributors[j].n1 > max_n )
3107    {
3108      max_n = contributors[j].n1;
3109      stop = j - filter_pixel_margin;  // if we find a new max, only scan another filter width
3110      if (stop<0) stop = 0;
3111    }
3112  }
3113
3114  STBIR_ASSERT( scanline_extents->conservative.n0 <= min_n );
3115  STBIR_ASSERT( scanline_extents->conservative.n1 >= max_n );
3116
3117  // now calculate how much into the margins we really read
3118  left_margin = 0;
3119  if ( min_n < 0 )
3120  {
3121    left_margin = -min_n;
3122    min_n = 0;
3123  }
3124
3125  right_margin = 0;
3126  if ( max_n >= input_full_size )
3127  {
3128    right_margin = max_n - input_full_size + 1;
3129    max_n = input_full_size - 1;
3130  }
3131
3132  // index 1 is margin pixel extents (how many pixels we hang over the edge)
3133  scanline_extents->edge_sizes[0] = left_margin;
3134  scanline_extents->edge_sizes[1] = right_margin;
3135
3136  // index 2 is pixels read from the input
3137  scanline_extents->spans[0].n0 = min_n;
3138  scanline_extents->spans[0].n1 = max_n;
3139  scanline_extents->spans[0].pixel_offset_for_input = min_n;
3140
3141  // default to no other input range
3142  scanline_extents->spans[1].n0 = 0;
3143  scanline_extents->spans[1].n1 = -1;
3144  scanline_extents->spans[1].pixel_offset_for_input = 0;
3145
3146  // don't have to do edge calc for zero clamp
3147  if ( edge == STBIR_EDGE_ZERO )
3148    return;
3149
3150  // convert margin pixels to the pixels within the input (min and max)
3151  for( j = -left_margin ; j < 0 ; j++ )
3152  {
3153      int p = stbir__edge_wrap( edge, j, input_full_size );
3154      if ( p < min_left )
3155        min_left = p;
3156      if ( p > max_left )
3157        max_left = p;
3158  }
3159
3160  for( j = input_full_size ; j < (input_full_size + right_margin) ; j++ )
3161  {
3162      int p = stbir__edge_wrap( edge, j, input_full_size );
3163      if ( p < min_right )
3164        min_right = p;
3165      if ( p > max_right )
3166        max_right = p;
3167  }
3168
3169  // merge the left margin pixel region if it connects within 4 pixels of main pixel region
3170  if ( min_left != 0x7fffffff )
3171  {
3172    if ( ( ( min_left <= min_n ) && ( ( max_left  + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= min_n ) ) ||
3173         ( ( min_n <= min_left ) && ( ( max_n  + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= max_left ) ) )
3174    {
3175      scanline_extents->spans[0].n0 = min_n = stbir__min( min_n, min_left );
3176      scanline_extents->spans[0].n1 = max_n = stbir__max( max_n, max_left );
3177      scanline_extents->spans[0].pixel_offset_for_input = min_n;
3178      left_margin = 0;
3179    }
3180  }
3181
3182  // merge the right margin pixel region if it connects within 4 pixels of main pixel region
3183  if ( min_right != 0x7fffffff )
3184  {
3185    if ( ( ( min_right <= min_n ) && ( ( max_right  + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= min_n ) ) ||
3186         ( ( min_n <= min_right ) && ( ( max_n  + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= max_right ) ) )
3187    {
3188      scanline_extents->spans[0].n0 = min_n = stbir__min( min_n, min_right );
3189      scanline_extents->spans[0].n1 = max_n = stbir__max( max_n, max_right );
3190      scanline_extents->spans[0].pixel_offset_for_input = min_n;
3191      right_margin = 0;
3192    }
3193  }
3194
3195  STBIR_ASSERT( scanline_extents->conservative.n0 <= min_n );
3196  STBIR_ASSERT( scanline_extents->conservative.n1 >= max_n );
3197
3198  // you get two ranges when you have the WRAP edge mode and you are doing just the a piece of the resize
3199  //   so you need to get a second run of pixels from the opposite side of the scanline (which you
3200  //   wouldn't need except for WRAP)
3201
3202
3203  // if we can't merge the min_left range, add it as a second range
3204  if ( ( left_margin ) && ( min_left != 0x7fffffff ) )
3205  {
3206    stbir__span * newspan = scanline_extents->spans + 1;
3207    STBIR_ASSERT( right_margin == 0 );
3208    if ( min_left < scanline_extents->spans[0].n0 )
3209    {
3210      scanline_extents->spans[1].pixel_offset_for_input = scanline_extents->spans[0].n0;
3211      scanline_extents->spans[1].n0 = scanline_extents->spans[0].n0;
3212      scanline_extents->spans[1].n1 = scanline_extents->spans[0].n1;
3213      --newspan;
3214    }
3215    newspan->pixel_offset_for_input = min_left;
3216    newspan->n0 = -left_margin;
3217    newspan->n1 = ( max_left - min_left ) - left_margin;
3218    scanline_extents->edge_sizes[0] = 0;  // don't need to copy the left margin, since we are directly decoding into the margin
3219    return;
3220  }
3221
3222  // if we can't merge the min_left range, add it as a second range
3223  if ( ( right_margin ) && ( min_right != 0x7fffffff ) )
3224  {
3225    stbir__span * newspan = scanline_extents->spans + 1;
3226    if ( min_right < scanline_extents->spans[0].n0 )
3227    {
3228      scanline_extents->spans[1].pixel_offset_for_input = scanline_extents->spans[0].n0;
3229      scanline_extents->spans[1].n0 = scanline_extents->spans[0].n0;
3230      scanline_extents->spans[1].n1 = scanline_extents->spans[0].n1;
3231      --newspan;
3232    }
3233    newspan->pixel_offset_for_input = min_right;
3234    newspan->n0 = scanline_extents->spans[1].n1 + 1;
3235    newspan->n1 = scanline_extents->spans[1].n1 + 1 + ( max_right - min_right );
3236    scanline_extents->edge_sizes[1] = 0;  // don't need to copy the right margin, since we are directly decoding into the margin
3237    return;
3238  }
3239}
3240
3241static void stbir__calculate_in_pixel_range( int * first_pixel, int * last_pixel, float out_pixel_center, float out_filter_radius, float inv_scale, float out_shift, int input_size, stbir_edge edge )
3242{
3243  int first, last;
3244  float out_pixel_influence_lowerbound = out_pixel_center - out_filter_radius;
3245  float out_pixel_influence_upperbound = out_pixel_center + out_filter_radius;
3246
3247  float in_pixel_influence_lowerbound = (out_pixel_influence_lowerbound + out_shift) * inv_scale;
3248  float in_pixel_influence_upperbound = (out_pixel_influence_upperbound + out_shift) * inv_scale;
3249
3250  first = (int)(STBIR_FLOORF(in_pixel_influence_lowerbound + 0.5f));
3251  last = (int)(STBIR_FLOORF(in_pixel_influence_upperbound - 0.5f));
3252  if ( last < first ) last = first; // point sample mode can span a value *right* at 0.5, and cause these to cross
3253
3254  if ( edge == STBIR_EDGE_WRAP )
3255  {
3256    if ( first < -input_size )
3257      first = -input_size;
3258    if ( last >= (input_size*2))
3259      last = (input_size*2) - 1;
3260  }
3261
3262  *first_pixel = first;
3263  *last_pixel = last;
3264}
3265
3266static void stbir__calculate_coefficients_for_gather_upsample( float out_filter_radius, stbir__kernel_callback * kernel, stbir__scale_info * scale_info, int num_contributors, stbir__contributors* contributors, float* coefficient_group, int coefficient_width, stbir_edge edge, void * user_data )
3267{
3268  int n, end;
3269  float inv_scale = scale_info->inv_scale;
3270  float out_shift = scale_info->pixel_shift;
3271  int input_size  = scale_info->input_full_size;
3272  int numerator = scale_info->scale_numerator;
3273  int polyphase = ( ( scale_info->scale_is_rational ) && ( numerator < num_contributors ) );
3274
3275  // Looping through out pixels
3276  end = num_contributors; if ( polyphase ) end = numerator;
3277  for (n = 0; n < end; n++)
3278  {
3279    int i;
3280    int last_non_zero;
3281    float out_pixel_center = (float)n + 0.5f;
3282    float in_center_of_out = (out_pixel_center + out_shift) * inv_scale;
3283
3284    int in_first_pixel, in_last_pixel;
3285
3286    stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, out_pixel_center, out_filter_radius, inv_scale, out_shift, input_size, edge );
3287
3288    // make sure we never generate a range larger than our precalculated coeff width
3289    //   this only happens in point sample mode, but it's a good safe thing to do anyway
3290    if ( ( in_last_pixel - in_first_pixel + 1 ) > coefficient_width )
3291      in_last_pixel = in_first_pixel + coefficient_width - 1;
3292
3293    last_non_zero = -1;
3294    for (i = 0; i <= in_last_pixel - in_first_pixel; i++)
3295    {
3296      float in_pixel_center = (float)(i + in_first_pixel) + 0.5f;
3297      float coeff = kernel(in_center_of_out - in_pixel_center, inv_scale, user_data);
3298
3299      // kill denormals
3300      if ( ( ( coeff < stbir__small_float ) && ( coeff > -stbir__small_float ) ) )
3301      {
3302        if ( i == 0 )  // if we're at the front, just eat zero contributors
3303        {
3304          STBIR_ASSERT ( ( in_last_pixel - in_first_pixel ) != 0 ); // there should be at least one contrib
3305          ++in_first_pixel;
3306          i--;
3307          continue;
3308        }
3309        coeff = 0;  // make sure is fully zero (should keep denormals away)
3310      }
3311      else
3312        last_non_zero = i;
3313
3314      coefficient_group[i] = coeff;
3315    }
3316
3317    in_last_pixel = last_non_zero+in_first_pixel; // kills trailing zeros
3318    contributors->n0 = in_first_pixel;
3319    contributors->n1 = in_last_pixel;
3320
3321    STBIR_ASSERT(contributors->n1 >= contributors->n0);
3322
3323    ++contributors;
3324    coefficient_group += coefficient_width;
3325  }
3326}
3327
3328static void stbir__insert_coeff( stbir__contributors * contribs, float * coeffs, int new_pixel, float new_coeff, int max_width )
3329{
3330  if ( new_pixel <= contribs->n1 )  // before the end
3331  {
3332    if ( new_pixel < contribs->n0 ) // before the front?
3333    {
3334      if ( ( contribs->n1 - new_pixel + 1 ) <= max_width )
3335      { 
3336        int j, o = contribs->n0 - new_pixel;
3337        for ( j = contribs->n1 - contribs->n0 ; j <= 0 ; j-- )
3338          coeffs[ j + o ] = coeffs[ j ];
3339        for ( j = 1 ; j < o ; j-- )
3340          coeffs[ j ] = coeffs[ 0 ];
3341        coeffs[ 0 ] = new_coeff;
3342        contribs->n0 = new_pixel;
3343      }
3344    }
3345    else
3346    {
3347      coeffs[ new_pixel - contribs->n0 ] += new_coeff;
3348    }
3349  }
3350  else
3351  {
3352    if ( ( new_pixel - contribs->n0 + 1 ) <= max_width )
3353    {
3354      int j, e = new_pixel - contribs->n0;
3355      for( j = ( contribs->n1 - contribs->n0 ) + 1 ; j < e ; j++ ) // clear in-betweens coeffs if there are any
3356        coeffs[j] = 0;
3357
3358      coeffs[ e ] = new_coeff;
3359      contribs->n1 = new_pixel;
3360    }
3361  }
3362}
3363
3364static void stbir__calculate_out_pixel_range( int * first_pixel, int * last_pixel, float in_pixel_center, float in_pixels_radius, float scale, float out_shift, int out_size )
3365{
3366  float in_pixel_influence_lowerbound = in_pixel_center - in_pixels_radius;
3367  float in_pixel_influence_upperbound = in_pixel_center + in_pixels_radius;
3368  float out_pixel_influence_lowerbound = in_pixel_influence_lowerbound * scale - out_shift;
3369  float out_pixel_influence_upperbound = in_pixel_influence_upperbound * scale - out_shift;
3370  int out_first_pixel = (int)(STBIR_FLOORF(out_pixel_influence_lowerbound + 0.5f));
3371  int out_last_pixel = (int)(STBIR_FLOORF(out_pixel_influence_upperbound - 0.5f));
3372
3373  if ( out_first_pixel < 0 )
3374    out_first_pixel = 0;
3375  if ( out_last_pixel >= out_size )
3376    out_last_pixel = out_size - 1;
3377  *first_pixel = out_first_pixel;
3378  *last_pixel = out_last_pixel;
3379}
3380
3381static void stbir__calculate_coefficients_for_gather_downsample( int start, int end, float in_pixels_radius, stbir__kernel_callback * kernel, stbir__scale_info * scale_info, int coefficient_width, int num_contributors, stbir__contributors * contributors, float * coefficient_group, void * user_data )
3382{
3383  int in_pixel;
3384  int i;
3385  int first_out_inited = -1;
3386  float scale = scale_info->scale;
3387  float out_shift = scale_info->pixel_shift;
3388  int out_size = scale_info->output_sub_size;
3389  int numerator = scale_info->scale_numerator;
3390  int polyphase = ( ( scale_info->scale_is_rational ) && ( numerator < out_size ) );
3391
3392  STBIR__UNUSED(num_contributors);
3393
3394  // Loop through the input pixels
3395  for (in_pixel = start; in_pixel < end; in_pixel++)
3396  {
3397    float in_pixel_center = (float)in_pixel + 0.5f;
3398    float out_center_of_in = in_pixel_center * scale - out_shift;
3399    int out_first_pixel, out_last_pixel;
3400
3401    stbir__calculate_out_pixel_range( &out_first_pixel, &out_last_pixel, in_pixel_center, in_pixels_radius, scale, out_shift, out_size );
3402
3403    if ( out_first_pixel > out_last_pixel )
3404      continue;
3405
3406    // clamp or exit if we are using polyphase filtering, and the limit is up
3407    if ( polyphase )
3408    {
3409      // when polyphase, you only have to do coeffs up to the numerator count
3410      if ( out_first_pixel == numerator )
3411        break;
3412
3413      // don't do any extra work, clamp last pixel at numerator too
3414      if ( out_last_pixel >= numerator )
3415        out_last_pixel = numerator - 1;
3416    }
3417
3418    for (i = 0; i <= out_last_pixel - out_first_pixel; i++)
3419    {
3420      float out_pixel_center = (float)(i + out_first_pixel) + 0.5f;
3421      float x = out_pixel_center - out_center_of_in;
3422      float coeff = kernel(x, scale, user_data) * scale;
3423
3424      // kill the coeff if it's too small (avoid denormals)
3425      if ( ( ( coeff < stbir__small_float ) && ( coeff > -stbir__small_float ) ) )
3426        coeff = 0.0f;
3427
3428      {
3429        int out = i + out_first_pixel;
3430        float * coeffs = coefficient_group + out * coefficient_width;
3431        stbir__contributors * contribs = contributors + out;
3432
3433        // is this the first time this output pixel has been seen?  Init it.
3434        if ( out > first_out_inited )
3435        {
3436          STBIR_ASSERT( out == ( first_out_inited + 1 ) ); // ensure we have only advanced one at time
3437          first_out_inited = out;
3438          contribs->n0 = in_pixel;
3439          contribs->n1 = in_pixel;
3440          coeffs[0]  = coeff;
3441        }
3442        else
3443        {
3444          // insert on end (always in order)
3445          if ( coeffs[0] == 0.0f )  // if the first coefficent is zero, then zap it for this coeffs
3446          {
3447            STBIR_ASSERT( ( in_pixel - contribs->n0 ) == 1 ); // ensure that when we zap, we're at the 2nd pos
3448            contribs->n0 = in_pixel;
3449          }
3450          contribs->n1 = in_pixel;
3451          STBIR_ASSERT( ( in_pixel - contribs->n0 ) < coefficient_width );
3452          coeffs[in_pixel - contribs->n0]  = coeff;
3453        }
3454      }
3455    }
3456  }
3457}
3458
3459#ifdef STBIR_RENORMALIZE_IN_FLOAT
3460#define STBIR_RENORM_TYPE float
3461#else
3462#define STBIR_RENORM_TYPE double
3463#endif
3464
3465static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter_extent_info* filter_info, stbir__scale_info * scale_info, int num_contributors, stbir__contributors* contributors, float * coefficient_group, int coefficient_width )
3466{
3467  int input_size = scale_info->input_full_size;
3468  int input_last_n1 = input_size - 1;
3469  int n, end;
3470  int lowest = 0x7fffffff;
3471  int highest = -0x7fffffff;
3472  int widest = -1;
3473  int numerator = scale_info->scale_numerator;
3474  int denominator = scale_info->scale_denominator;
3475  int polyphase = ( ( scale_info->scale_is_rational ) && ( numerator < num_contributors ) );
3476  float * coeffs;
3477  stbir__contributors * contribs;
3478
3479  // weight all the coeffs for each sample
3480  coeffs = coefficient_group;
3481  contribs = contributors;
3482  end = num_contributors; if ( polyphase ) end = numerator;
3483  for (n = 0; n < end; n++)
3484  {
3485    int i;
3486    STBIR_RENORM_TYPE filter_scale, total_filter = 0;
3487    int e;
3488
3489    // add all contribs
3490    e = contribs->n1 - contribs->n0;
3491    for( i = 0 ; i <= e ; i++ )
3492    {
3493      total_filter += (STBIR_RENORM_TYPE) coeffs[i];
3494      STBIR_ASSERT( ( coeffs[i] >= -2.0f ) && ( coeffs[i] <= 2.0f )  ); // check for wonky weights
3495    }
3496
3497    // rescale
3498    if ( ( total_filter < stbir__small_float ) && ( total_filter > -stbir__small_float ) )
3499    {
3500      // all coeffs are extremely small, just zero it
3501      contribs->n1 = contribs->n0;
3502      coeffs[0] = 0.0f;
3503    }
3504    else
3505    {
3506      // if the total isn't 1.0, rescale everything
3507      if ( ( total_filter < (1.0f-stbir__small_float) ) || ( total_filter > (1.0f+stbir__small_float) ) )
3508      {
3509        filter_scale = ((STBIR_RENORM_TYPE)1.0) / total_filter;
3510
3511        // scale them all
3512        for (i = 0; i <= e; i++)
3513          coeffs[i] = (float) ( coeffs[i] * filter_scale );
3514      }
3515    }
3516    ++contribs;
3517    coeffs += coefficient_width;
3518  }
3519
3520  // if we have a rational for the scale, we can exploit the polyphaseness to not calculate
3521  //   most of the coefficients, so we copy them here
3522  if ( polyphase )
3523  {
3524    stbir__contributors * prev_contribs = contributors;
3525    stbir__contributors * cur_contribs = contributors + numerator;
3526
3527    for( n = numerator ; n < num_contributors ; n++ )
3528    {
3529      cur_contribs->n0 = prev_contribs->n0 + denominator;
3530      cur_contribs->n1 = prev_contribs->n1 + denominator;
3531      ++cur_contribs;
3532      ++prev_contribs;
3533    }
3534    stbir_overlapping_memcpy( coefficient_group + numerator * coefficient_width, coefficient_group, ( num_contributors - numerator ) * coefficient_width * sizeof( coeffs[ 0 ] ) );
3535  }
3536
3537  coeffs = coefficient_group;
3538  contribs = contributors;
3539
3540  for (n = 0; n < num_contributors; n++)
3541  {
3542    int i;
3543
3544    // in zero edge mode, just remove out of bounds contribs completely (since their weights are accounted for now)
3545    if ( edge == STBIR_EDGE_ZERO )
3546    {
3547      // shrink the right side if necessary
3548      if ( contribs->n1 > input_last_n1 )
3549        contribs->n1 = input_last_n1;
3550
3551      // shrink the left side
3552      if ( contribs->n0 < 0 )
3553      {
3554        int j, left, skips = 0;
3555
3556        skips = -contribs->n0;
3557        contribs->n0 = 0;
3558
3559        // now move down the weights
3560        left = contribs->n1 - contribs->n0 + 1;
3561        if ( left > 0 )
3562        {
3563          for( j = 0 ; j < left ; j++ )
3564            coeffs[ j ] = coeffs[ j + skips ];
3565        }
3566      }
3567    }
3568    else if ( ( edge == STBIR_EDGE_CLAMP ) || ( edge == STBIR_EDGE_REFLECT ) )
3569    {
3570      // for clamp and reflect, calculate the true inbounds position (based on edge type) and just add that to the existing weight
3571
3572      // right hand side first
3573      if ( contribs->n1 > input_last_n1 )
3574      {
3575        int start = contribs->n0;
3576        int endi = contribs->n1;
3577        contribs->n1 = input_last_n1;
3578        for( i = input_size; i <= endi; i++ )
3579          stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( i, input_size ), coeffs[i-start], coefficient_width );
3580      }
3581
3582      // now check left hand edge
3583      if ( contribs->n0 < 0 )
3584      {
3585        int save_n0;
3586        float save_n0_coeff;
3587        float * c = coeffs - ( contribs->n0 + 1 );
3588
3589        // reinsert the coeffs with it reflected or clamped (insert accumulates, if the coeffs exist)
3590        for( i = -1 ; i > contribs->n0 ; i-- )
3591          stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( i, input_size ), *c--, coefficient_width );
3592        save_n0 = contribs->n0;
3593        save_n0_coeff = c[0]; // save it, since we didn't do the final one (i==n0), because there might be too many coeffs to hold (before we resize)!
3594
3595        // now slide all the coeffs down (since we have accumulated them in the positive contribs) and reset the first contrib
3596        contribs->n0 = 0;
3597        for(i = 0 ; i <= contribs->n1 ; i++ )
3598          coeffs[i] = coeffs[i-save_n0];
3599
3600        // now that we have shrunk down the contribs, we insert the first one safely
3601        stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( save_n0, input_size ), save_n0_coeff, coefficient_width );
3602      }
3603    }
3604
3605    if ( contribs->n0 <= contribs->n1 )
3606    {
3607      int diff = contribs->n1 - contribs->n0 + 1;
3608      while ( diff && ( coeffs[ diff-1 ] == 0.0f ) )
3609        --diff;
3610
3611      contribs->n1 = contribs->n0 + diff - 1;
3612
3613      if ( contribs->n0 <= contribs->n1 )
3614      {
3615        if ( contribs->n0 < lowest )
3616          lowest = contribs->n0;
3617        if ( contribs->n1 > highest )
3618          highest = contribs->n1;
3619        if ( diff > widest )
3620          widest = diff;
3621      }
3622
3623      // re-zero out unused coefficients (if any)
3624      for( i = diff ; i < coefficient_width ; i++ )
3625        coeffs[i] = 0.0f;
3626    }
3627
3628    ++contribs;
3629    coeffs += coefficient_width;
3630  }
3631  filter_info->lowest = lowest;
3632  filter_info->highest = highest;
3633  filter_info->widest = widest;
3634}
3635
3636#undef STBIR_RENORM_TYPE 
3637
3638static int stbir__pack_coefficients( int num_contributors, stbir__contributors* contributors, float * coefficents, int coefficient_width, int widest, int row0, int row1 ) 
3639{
3640  #define STBIR_MOVE_1( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint32*)(dest))[0] = ((stbir_uint32*)(src))[0]; }
3641  #define STBIR_MOVE_2( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint64*)(dest))[0] = ((stbir_uint64*)(src))[0]; }
3642  #ifdef STBIR_SIMD
3643  #define STBIR_MOVE_4( dest, src ) { stbir__simdf t; STBIR_NO_UNROLL(dest); stbir__simdf_load( t, src ); stbir__simdf_store( dest, t ); }
3644  #else
3645  #define STBIR_MOVE_4( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint64*)(dest))[0] = ((stbir_uint64*)(src))[0]; ((stbir_uint64*)(dest))[1] = ((stbir_uint64*)(src))[1]; }
3646  #endif
3647
3648  int row_end = row1 + 1;
3649  STBIR__UNUSED( row0 ); // only used in an assert
3650
3651  if ( coefficient_width != widest )
3652  {
3653    float * pc = coefficents;
3654    float * coeffs = coefficents;
3655    float * pc_end = coefficents + num_contributors * widest;
3656    switch( widest )
3657    {
3658      case 1:
3659        STBIR_NO_UNROLL_LOOP_START
3660        do {
3661          STBIR_MOVE_1( pc, coeffs );
3662          ++pc;
3663          coeffs += coefficient_width;
3664        } while ( pc < pc_end );
3665        break;
3666      case 2:
3667        STBIR_NO_UNROLL_LOOP_START
3668        do {
3669          STBIR_MOVE_2( pc, coeffs );
3670          pc += 2;
3671          coeffs += coefficient_width;
3672        } while ( pc < pc_end );
3673        break;
3674      case 3:
3675        STBIR_NO_UNROLL_LOOP_START
3676        do {
3677          STBIR_MOVE_2( pc, coeffs );
3678          STBIR_MOVE_1( pc+2, coeffs+2 );
3679          pc += 3;
3680          coeffs += coefficient_width;
3681        } while ( pc < pc_end );
3682        break;
3683      case 4:
3684        STBIR_NO_UNROLL_LOOP_START
3685        do {
3686          STBIR_MOVE_4( pc, coeffs );
3687          pc += 4;
3688          coeffs += coefficient_width;
3689        } while ( pc < pc_end );
3690        break;
3691      case 5:
3692        STBIR_NO_UNROLL_LOOP_START
3693        do {
3694          STBIR_MOVE_4( pc, coeffs );
3695          STBIR_MOVE_1( pc+4, coeffs+4 );
3696          pc += 5;
3697          coeffs += coefficient_width;
3698        } while ( pc < pc_end );
3699        break;
3700      case 6:
3701        STBIR_NO_UNROLL_LOOP_START
3702        do {
3703          STBIR_MOVE_4( pc, coeffs );
3704          STBIR_MOVE_2( pc+4, coeffs+4 );
3705          pc += 6;
3706          coeffs += coefficient_width;
3707        } while ( pc < pc_end );
3708        break;
3709      case 7:
3710        STBIR_NO_UNROLL_LOOP_START
3711        do {
3712          STBIR_MOVE_4( pc, coeffs );
3713          STBIR_MOVE_2( pc+4, coeffs+4 );
3714          STBIR_MOVE_1( pc+6, coeffs+6 );
3715          pc += 7;
3716          coeffs += coefficient_width;
3717        } while ( pc < pc_end );
3718        break;
3719      case 8:
3720        STBIR_NO_UNROLL_LOOP_START
3721        do {
3722          STBIR_MOVE_4( pc, coeffs );
3723          STBIR_MOVE_4( pc+4, coeffs+4 );
3724          pc += 8;
3725          coeffs += coefficient_width;
3726        } while ( pc < pc_end );
3727        break;
3728      case 9:
3729        STBIR_NO_UNROLL_LOOP_START
3730        do {
3731          STBIR_MOVE_4( pc, coeffs );
3732          STBIR_MOVE_4( pc+4, coeffs+4 );
3733          STBIR_MOVE_1( pc+8, coeffs+8 );
3734          pc += 9;
3735          coeffs += coefficient_width;
3736        } while ( pc < pc_end );
3737        break;
3738      case 10:
3739        STBIR_NO_UNROLL_LOOP_START
3740        do {
3741          STBIR_MOVE_4( pc, coeffs );
3742          STBIR_MOVE_4( pc+4, coeffs+4 );
3743          STBIR_MOVE_2( pc+8, coeffs+8 );
3744          pc += 10;
3745          coeffs += coefficient_width;
3746        } while ( pc < pc_end );
3747        break;
3748      case 11:
3749        STBIR_NO_UNROLL_LOOP_START
3750        do {
3751          STBIR_MOVE_4( pc, coeffs );
3752          STBIR_MOVE_4( pc+4, coeffs+4 );
3753          STBIR_MOVE_2( pc+8, coeffs+8 );
3754          STBIR_MOVE_1( pc+10, coeffs+10 );
3755          pc += 11;
3756          coeffs += coefficient_width;
3757        } while ( pc < pc_end );
3758        break;
3759      case 12:
3760        STBIR_NO_UNROLL_LOOP_START
3761        do {
3762          STBIR_MOVE_4( pc, coeffs );
3763          STBIR_MOVE_4( pc+4, coeffs+4 );
3764          STBIR_MOVE_4( pc+8, coeffs+8 );
3765          pc += 12;
3766          coeffs += coefficient_width;
3767        } while ( pc < pc_end );
3768        break;
3769      default:
3770        STBIR_NO_UNROLL_LOOP_START
3771        do {
3772          float * copy_end = pc + widest - 4;
3773          float * c = coeffs;
3774          do {
3775            STBIR_NO_UNROLL( pc );
3776            STBIR_MOVE_4( pc, c );
3777            pc += 4;
3778            c += 4;
3779          } while ( pc <= copy_end );
3780          copy_end += 4;
3781          STBIR_NO_UNROLL_LOOP_START
3782          while ( pc < copy_end )
3783          {
3784            STBIR_MOVE_1( pc, c );
3785            ++pc; ++c;
3786          }
3787          coeffs += coefficient_width;
3788        } while ( pc < pc_end );
3789        break;
3790    }
3791  }
3792
3793  // some horizontal routines read one float off the end (which is then masked off), so put in a sentinal so we don't read an snan or denormal
3794  coefficents[ widest * num_contributors ] = 8888.0f;
3795
3796  // the minimum we might read for unrolled filters widths is 12. So, we need to
3797  //   make sure we never read outside the decode buffer, by possibly moving
3798  //   the sample area back into the scanline, and putting zeros weights first.
3799  // we start on the right edge and check until we're well past the possible
3800  //   clip area (2*widest).
3801  {
3802    stbir__contributors * contribs = contributors + num_contributors - 1;
3803    float * coeffs = coefficents + widest * ( num_contributors - 1 );
3804
3805    // go until no chance of clipping (this is usually less than 8 lops)
3806    while ( ( contribs >= contributors ) && ( ( contribs->n0 + widest*2 ) >= row_end ) )
3807    {
3808      // might we clip??
3809      if ( ( contribs->n0 + widest ) > row_end )
3810      {
3811        int stop_range = widest;
3812
3813        // if range is larger than 12, it will be handled by generic loops that can terminate on the exact length
3814        //   of this contrib n1, instead of a fixed widest amount - so calculate this
3815        if ( widest > 12 )
3816        {
3817          int mod;
3818
3819          // how far will be read in the n_coeff loop (which depends on the widest count mod4);
3820          mod = widest & 3;
3821          stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod;
3822
3823          // the n_coeff loops do a minimum amount of coeffs, so factor that in!
3824          if ( stop_range < ( 8 + mod ) ) stop_range = 8 + mod;
3825        }
3826
3827        // now see if we still clip with the refined range
3828        if ( ( contribs->n0 + stop_range ) > row_end )
3829        {
3830          int new_n0 = row_end - stop_range;
3831          int num = contribs->n1 - contribs->n0 + 1;
3832          int backup = contribs->n0 - new_n0;
3833          float * from_co = coeffs + num - 1;
3834          float * to_co = from_co + backup;
3835
3836          STBIR_ASSERT( ( new_n0 >= row0 ) && ( new_n0 < contribs->n0 ) );
3837
3838          // move the coeffs over
3839          while( num )
3840          {
3841            *to_co-- = *from_co--;
3842            --num;
3843          }
3844          // zero new positions
3845          while ( to_co >= coeffs )
3846            *to_co-- = 0;
3847          // set new start point
3848          contribs->n0 = new_n0;
3849          if ( widest > 12 )
3850          {
3851            int mod;
3852
3853            // how far will be read in the n_coeff loop (which depends on the widest count mod4);
3854            mod = widest & 3;
3855            stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod;
3856
3857            // the n_coeff loops do a minimum amount of coeffs, so factor that in!
3858            if ( stop_range < ( 8 + mod ) ) stop_range = 8 + mod;
3859          }
3860        }
3861      }
3862      --contribs;
3863      coeffs -= widest;
3864    }
3865  }
3866
3867  return widest;
3868  #undef STBIR_MOVE_1
3869  #undef STBIR_MOVE_2
3870  #undef STBIR_MOVE_4
3871}
3872
3873static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * other_axis_for_pivot, void * user_data STBIR_ONLY_PROFILE_BUILD_GET_INFO )
3874{
3875  int n;
3876  float scale = samp->scale_info.scale;
3877  stbir__kernel_callback * kernel = samp->filter_kernel;
3878  stbir__support_callback * support = samp->filter_support;
3879  float inv_scale = samp->scale_info.inv_scale;
3880  int input_full_size = samp->scale_info.input_full_size;
3881  int gather_num_contributors = samp->num_contributors;
3882  stbir__contributors* gather_contributors = samp->contributors;
3883  float * gather_coeffs = samp->coefficients;
3884  int gather_coefficient_width = samp->coefficient_width;
3885
3886  switch ( samp->is_gather )
3887  {
3888    case 1: // gather upsample
3889    {
3890      float out_pixels_radius = support(inv_scale,user_data) * scale;
3891
3892      stbir__calculate_coefficients_for_gather_upsample( out_pixels_radius, kernel, &samp->scale_info, gather_num_contributors, gather_contributors, gather_coeffs, gather_coefficient_width, samp->edge, user_data );
3893
3894      STBIR_PROFILE_BUILD_START( cleanup );
3895      stbir__cleanup_gathered_coefficients( samp->edge, &samp->extent_info, &samp->scale_info, gather_num_contributors, gather_contributors, gather_coeffs, gather_coefficient_width );
3896      STBIR_PROFILE_BUILD_END( cleanup );
3897    }
3898    break;
3899
3900    case 0: // scatter downsample (only on vertical)
3901    case 2: // gather downsample
3902    {
3903      float in_pixels_radius = support(scale,user_data) * inv_scale;
3904      int filter_pixel_margin = samp->filter_pixel_margin;
3905      int input_end = input_full_size + filter_pixel_margin;
3906
3907      // if this is a scatter, we do a downsample gather to get the coeffs, and then pivot after
3908      if ( !samp->is_gather )
3909      {
3910        // check if we are using the same gather downsample on the horizontal as this vertical,
3911        //   if so, then we don't have to generate them, we can just pivot from the horizontal.
3912        if ( other_axis_for_pivot )
3913        {
3914          gather_contributors = other_axis_for_pivot->contributors;
3915          gather_coeffs = other_axis_for_pivot->coefficients;
3916          gather_coefficient_width = other_axis_for_pivot->coefficient_width;
3917          gather_num_contributors = other_axis_for_pivot->num_contributors;
3918          samp->extent_info.lowest = other_axis_for_pivot->extent_info.lowest;
3919          samp->extent_info.highest = other_axis_for_pivot->extent_info.highest;
3920          samp->extent_info.widest = other_axis_for_pivot->extent_info.widest;
3921          goto jump_right_to_pivot;
3922        }
3923
3924        gather_contributors = samp->gather_prescatter_contributors;
3925        gather_coeffs = samp->gather_prescatter_coefficients;
3926        gather_coefficient_width = samp->gather_prescatter_coefficient_width;
3927        gather_num_contributors = samp->gather_prescatter_num_contributors;
3928      }
3929
3930      stbir__calculate_coefficients_for_gather_downsample( -filter_pixel_margin, input_end, in_pixels_radius, kernel, &samp->scale_info, gather_coefficient_width, gather_num_contributors, gather_contributors, gather_coeffs, user_data );
3931
3932      STBIR_PROFILE_BUILD_START( cleanup );
3933      stbir__cleanup_gathered_coefficients( samp->edge, &samp->extent_info, &samp->scale_info, gather_num_contributors, gather_contributors, gather_coeffs, gather_coefficient_width );
3934      STBIR_PROFILE_BUILD_END( cleanup );
3935
3936      if ( !samp->is_gather )
3937      {
3938        // if this is a scatter (vertical only), then we need to pivot the coeffs
3939        stbir__contributors * scatter_contributors;
3940        int highest_set;
3941
3942        jump_right_to_pivot:
3943
3944        STBIR_PROFILE_BUILD_START( pivot );
3945
3946        highest_set = (-filter_pixel_margin) - 1;
3947        for (n = 0; n < gather_num_contributors; n++)
3948        {
3949          int k;
3950          int gn0 = gather_contributors->n0, gn1 = gather_contributors->n1;
3951          int scatter_coefficient_width = samp->coefficient_width;
3952          float * scatter_coeffs = samp->coefficients + ( gn0 + filter_pixel_margin ) * scatter_coefficient_width;
3953          float * g_coeffs = gather_coeffs;
3954          scatter_contributors = samp->contributors + ( gn0 + filter_pixel_margin );
3955
3956          for (k = gn0 ; k <= gn1 ; k++ )
3957          {
3958            float gc = *g_coeffs++;
3959            
3960            // skip zero and denormals - must skip zeros to avoid adding coeffs beyond scatter_coefficient_width
3961            //   (which happens when pivoting from horizontal, which might have dummy zeros)
3962            if ( ( ( gc >= stbir__small_float ) || ( gc <= -stbir__small_float ) ) )
3963            {
3964              if ( ( k > highest_set ) || ( scatter_contributors->n0 > scatter_contributors->n1 ) )
3965              {
3966                {
3967                  // if we are skipping over several contributors, we need to clear the skipped ones
3968                  stbir__contributors * clear_contributors = samp->contributors + ( highest_set + filter_pixel_margin + 1);
3969                  while ( clear_contributors < scatter_contributors )
3970                  {
3971                    clear_contributors->n0 = 0;
3972                    clear_contributors->n1 = -1;
3973                    ++clear_contributors;
3974                  }
3975                }
3976                scatter_contributors->n0 = n;
3977                scatter_contributors->n1 = n;
3978                scatter_coeffs[0]  = gc;
3979                highest_set = k;
3980              }
3981              else
3982              {
3983                stbir__insert_coeff( scatter_contributors, scatter_coeffs, n, gc, scatter_coefficient_width );
3984              }
3985              STBIR_ASSERT( ( scatter_contributors->n1 - scatter_contributors->n0 + 1 ) <= scatter_coefficient_width );
3986            }
3987            ++scatter_contributors;
3988            scatter_coeffs += scatter_coefficient_width;
3989          }
3990
3991          ++gather_contributors;
3992          gather_coeffs += gather_coefficient_width;
3993        }
3994
3995        // now clear any unset contribs
3996        {
3997          stbir__contributors * clear_contributors = samp->contributors + ( highest_set + filter_pixel_margin + 1);
3998          stbir__contributors * end_contributors = samp->contributors + samp->num_contributors;
3999          while ( clear_contributors < end_contributors )
4000          {
4001            clear_contributors->n0 = 0;
4002            clear_contributors->n1 = -1;
4003            ++clear_contributors;
4004          }
4005        }
4006
4007        STBIR_PROFILE_BUILD_END( pivot );
4008      }
4009    }
4010    break;
4011  }
4012}
4013
4014
4015//========================================================================================================
4016// scanline decoders and encoders
4017
4018#define stbir__coder_min_num 1
4019#define STB_IMAGE_RESIZE_DO_CODERS
4020#include STBIR__HEADER_FILENAME
4021
4022#define stbir__decode_suffix BGRA
4023#define stbir__decode_swizzle
4024#define stbir__decode_order0  2
4025#define stbir__decode_order1  1
4026#define stbir__decode_order2  0
4027#define stbir__decode_order3  3
4028#define stbir__encode_order0  2
4029#define stbir__encode_order1  1
4030#define stbir__encode_order2  0
4031#define stbir__encode_order3  3
4032#define stbir__coder_min_num 4
4033#define STB_IMAGE_RESIZE_DO_CODERS
4034#include STBIR__HEADER_FILENAME
4035
4036#define stbir__decode_suffix ARGB
4037#define stbir__decode_swizzle
4038#define stbir__decode_order0  1
4039#define stbir__decode_order1  2
4040#define stbir__decode_order2  3
4041#define stbir__decode_order3  0
4042#define stbir__encode_order0  3
4043#define stbir__encode_order1  0
4044#define stbir__encode_order2  1
4045#define stbir__encode_order3  2
4046#define stbir__coder_min_num 4
4047#define STB_IMAGE_RESIZE_DO_CODERS
4048#include STBIR__HEADER_FILENAME
4049
4050#define stbir__decode_suffix ABGR
4051#define stbir__decode_swizzle
4052#define stbir__decode_order0  3
4053#define stbir__decode_order1  2
4054#define stbir__decode_order2  1
4055#define stbir__decode_order3  0
4056#define stbir__encode_order0  3
4057#define stbir__encode_order1  2
4058#define stbir__encode_order2  1
4059#define stbir__encode_order3  0
4060#define stbir__coder_min_num 4
4061#define STB_IMAGE_RESIZE_DO_CODERS
4062#include STBIR__HEADER_FILENAME
4063
4064#define stbir__decode_suffix AR
4065#define stbir__decode_swizzle
4066#define stbir__decode_order0  1
4067#define stbir__decode_order1  0
4068#define stbir__decode_order2  3
4069#define stbir__decode_order3  2
4070#define stbir__encode_order0  1
4071#define stbir__encode_order1  0
4072#define stbir__encode_order2  3
4073#define stbir__encode_order3  2
4074#define stbir__coder_min_num 2
4075#define STB_IMAGE_RESIZE_DO_CODERS
4076#include STBIR__HEADER_FILENAME
4077
4078
4079// fancy alpha means we expand to keep both premultipied and non-premultiplied color channels
4080static void stbir__fancy_alpha_weight_4ch( float * out_buffer, int width_times_channels )
4081{
4082  float STBIR_STREAMOUT_PTR(*) out = out_buffer;
4083  float const * end_decode = out_buffer + ( width_times_channels / 4 ) * 7;  // decode buffer aligned to end of out_buffer
4084  float STBIR_STREAMOUT_PTR(*) decode = (float*)end_decode - width_times_channels;
4085
4086  // fancy alpha is stored internally as R G B A Rpm Gpm Bpm
4087
4088  #ifdef STBIR_SIMD
4089
4090  #ifdef STBIR_SIMD8
4091  decode += 16;
4092  STBIR_NO_UNROLL_LOOP_START
4093  while ( decode <= end_decode )
4094  {
4095    stbir__simdf8 d0,d1,a0,a1,p0,p1;
4096    STBIR_NO_UNROLL(decode);
4097    stbir__simdf8_load( d0, decode-16 );
4098    stbir__simdf8_load( d1, decode-16+8 );
4099    stbir__simdf8_0123to33333333( a0, d0 );
4100    stbir__simdf8_0123to33333333( a1, d1 );
4101    stbir__simdf8_mult( p0, a0, d0 );
4102    stbir__simdf8_mult( p1, a1, d1 );
4103    stbir__simdf8_bot4s( a0, d0, p0 );
4104    stbir__simdf8_bot4s( a1, d1, p1 );
4105    stbir__simdf8_top4s( d0, d0, p0 );
4106    stbir__simdf8_top4s( d1, d1, p1 );
4107    stbir__simdf8_store ( out, a0 );
4108    stbir__simdf8_store ( out+7, d0 );
4109    stbir__simdf8_store ( out+14, a1 );
4110    stbir__simdf8_store ( out+21, d1 );
4111    decode += 16;
4112    out += 28;
4113  }
4114  decode -= 16;
4115  #else
4116  decode += 8;
4117  STBIR_NO_UNROLL_LOOP_START
4118  while ( decode <= end_decode )
4119  {
4120    stbir__simdf d0,a0,d1,a1,p0,p1;
4121    STBIR_NO_UNROLL(decode);
4122    stbir__simdf_load( d0, decode-8 );
4123    stbir__simdf_load( d1, decode-8+4 );
4124    stbir__simdf_0123to3333( a0, d0 );
4125    stbir__simdf_0123to3333( a1, d1 );
4126    stbir__simdf_mult( p0, a0, d0 );
4127    stbir__simdf_mult( p1, a1, d1 );
4128    stbir__simdf_store ( out, d0 );
4129    stbir__simdf_store ( out+4, p0 );
4130    stbir__simdf_store ( out+7, d1 );
4131    stbir__simdf_store ( out+7+4, p1 );
4132    decode += 8;
4133    out += 14;
4134  }
4135  decode -= 8;
4136  #endif
4137
4138  // might be one last odd pixel
4139  #ifdef STBIR_SIMD8
4140  STBIR_NO_UNROLL_LOOP_START
4141  while ( decode < end_decode )
4142  #else
4143  if ( decode < end_decode )
4144  #endif
4145  {
4146    stbir__simdf d,a,p;
4147    STBIR_NO_UNROLL(decode);
4148    stbir__simdf_load( d, decode );
4149    stbir__simdf_0123to3333( a, d );
4150    stbir__simdf_mult( p, a, d );
4151    stbir__simdf_store ( out, d );
4152    stbir__simdf_store ( out+4, p );
4153    decode += 4;
4154    out += 7;
4155  }
4156
4157  #else
4158
4159  while( decode < end_decode )
4160  {
4161    float r = decode[0], g = decode[1], b = decode[2], alpha = decode[3];
4162    out[0] = r;
4163    out[1] = g;
4164    out[2] = b;
4165    out[3] = alpha;
4166    out[4] = r * alpha;
4167    out[5] = g * alpha;
4168    out[6] = b * alpha;
4169    out += 7;
4170    decode += 4;
4171  }
4172
4173  #endif
4174}
4175
4176static void stbir__fancy_alpha_weight_2ch( float * out_buffer, int width_times_channels )
4177{
4178  float STBIR_STREAMOUT_PTR(*) out = out_buffer;
4179  float const * end_decode = out_buffer + ( width_times_channels / 2 ) * 3;
4180  float STBIR_STREAMOUT_PTR(*) decode = (float*)end_decode - width_times_channels;
4181
4182  //  for fancy alpha, turns into: [X A Xpm][X A Xpm],etc
4183
4184  #ifdef STBIR_SIMD
4185
4186  decode += 8;
4187  if ( decode <= end_decode )
4188  {
4189    STBIR_NO_UNROLL_LOOP_START
4190    do {
4191      #ifdef STBIR_SIMD8
4192      stbir__simdf8 d0,a0,p0;
4193      STBIR_NO_UNROLL(decode);
4194      stbir__simdf8_load( d0, decode-8 );
4195      stbir__simdf8_0123to11331133( p0, d0 );
4196      stbir__simdf8_0123to00220022( a0, d0 );
4197      stbir__simdf8_mult( p0, p0, a0 );
4198
4199      stbir__simdf_store2( out, stbir__if_simdf8_cast_to_simdf4( d0 ) );
4200      stbir__simdf_store( out+2, stbir__if_simdf8_cast_to_simdf4( p0 ) );
4201      stbir__simdf_store2h( out+3, stbir__if_simdf8_cast_to_simdf4( d0 ) );
4202
4203      stbir__simdf_store2( out+6, stbir__simdf8_gettop4( d0 ) );
4204      stbir__simdf_store( out+8, stbir__simdf8_gettop4( p0 ) );
4205      stbir__simdf_store2h( out+9, stbir__simdf8_gettop4( d0 ) );
4206      #else
4207      stbir__simdf d0,a0,d1,a1,p0,p1;
4208      STBIR_NO_UNROLL(decode);
4209      stbir__simdf_load( d0, decode-8 );
4210      stbir__simdf_load( d1, decode-8+4 );
4211      stbir__simdf_0123to1133( p0, d0 );
4212      stbir__simdf_0123to1133( p1, d1 );
4213      stbir__simdf_0123to0022( a0, d0 );
4214      stbir__simdf_0123to0022( a1, d1 );
4215      stbir__simdf_mult( p0, p0, a0 );
4216      stbir__simdf_mult( p1, p1, a1 );
4217
4218      stbir__simdf_store2( out, d0 );
4219      stbir__simdf_store( out+2, p0 );
4220      stbir__simdf_store2h( out+3, d0 );
4221
4222      stbir__simdf_store2( out+6, d1 );
4223      stbir__simdf_store( out+8, p1 );
4224      stbir__simdf_store2h( out+9, d1 );
4225      #endif
4226      decode += 8;
4227      out += 12;
4228    } while ( decode <= end_decode );
4229  }
4230  decode -= 8;
4231  #endif
4232
4233  STBIR_SIMD_NO_UNROLL_LOOP_START
4234  while( decode < end_decode )
4235  {
4236    float x = decode[0], y = decode[1];
4237    STBIR_SIMD_NO_UNROLL(decode);
4238    out[0] = x;
4239    out[1] = y;
4240    out[2] = x * y;
4241    out += 3;
4242    decode += 2;
4243  }
4244}
4245
4246static void stbir__fancy_alpha_unweight_4ch( float * encode_buffer, int width_times_channels )
4247{
4248  float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
4249  float STBIR_SIMD_STREAMOUT_PTR(*) input = encode_buffer;
4250  float const * end_output = encode_buffer + width_times_channels;
4251
4252  // fancy RGBA is stored internally as R G B A Rpm Gpm Bpm
4253
4254  STBIR_SIMD_NO_UNROLL_LOOP_START
4255  do {
4256    float alpha = input[3];
4257#ifdef STBIR_SIMD
4258    stbir__simdf i,ia;
4259    STBIR_SIMD_NO_UNROLL(encode);
4260    if ( alpha < stbir__small_float )
4261    {
4262      stbir__simdf_load( i, input );
4263      stbir__simdf_store( encode, i );
4264    }
4265    else
4266    {
4267      stbir__simdf_load1frep4( ia, 1.0f / alpha );
4268      stbir__simdf_load( i, input+4 );
4269      stbir__simdf_mult( i, i, ia );
4270      stbir__simdf_store( encode, i );
4271      encode[3] = alpha;
4272    }
4273#else
4274    if ( alpha < stbir__small_float )
4275    {
4276      encode[0] = input[0];
4277      encode[1] = input[1];
4278      encode[2] = input[2];
4279    }
4280    else
4281    {
4282      float ialpha = 1.0f / alpha;
4283      encode[0] = input[4] * ialpha;
4284      encode[1] = input[5] * ialpha;
4285      encode[2] = input[6] * ialpha;
4286    }
4287    encode[3] = alpha;
4288#endif
4289
4290    input += 7;
4291    encode += 4;
4292  } while ( encode < end_output );
4293}
4294
4295//  format: [X A Xpm][X A Xpm] etc
4296static void stbir__fancy_alpha_unweight_2ch( float * encode_buffer, int width_times_channels )
4297{
4298  float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
4299  float STBIR_SIMD_STREAMOUT_PTR(*) input = encode_buffer;
4300  float const * end_output = encode_buffer + width_times_channels;
4301
4302  do {
4303    float alpha = input[1];
4304    encode[0] = input[0];
4305    if ( alpha >= stbir__small_float )
4306      encode[0] = input[2] / alpha;
4307    encode[1] = alpha;
4308
4309    input += 3;
4310    encode += 2;
4311  } while ( encode < end_output );
4312}
4313
4314static void stbir__simple_alpha_weight_4ch( float * decode_buffer, int width_times_channels )
4315{
4316  float STBIR_STREAMOUT_PTR(*) decode = decode_buffer;
4317  float const * end_decode = decode_buffer + width_times_channels;
4318
4319  #ifdef STBIR_SIMD
4320  {
4321    decode += 2 * stbir__simdfX_float_count;
4322    STBIR_NO_UNROLL_LOOP_START
4323    while ( decode <= end_decode )
4324    {
4325      stbir__simdfX d0,a0,d1,a1;
4326      STBIR_NO_UNROLL(decode);
4327      stbir__simdfX_load( d0, decode-2*stbir__simdfX_float_count );
4328      stbir__simdfX_load( d1, decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count );
4329      stbir__simdfX_aaa1( a0, d0, STBIR_onesX );
4330      stbir__simdfX_aaa1( a1, d1, STBIR_onesX );
4331      stbir__simdfX_mult( d0, d0, a0 );
4332      stbir__simdfX_mult( d1, d1, a1 );
4333      stbir__simdfX_store ( decode-2*stbir__simdfX_float_count, d0 );
4334      stbir__simdfX_store ( decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count, d1 );
4335      decode += 2 * stbir__simdfX_float_count;
4336    }
4337    decode -= 2 * stbir__simdfX_float_count;
4338
4339    // few last pixels remnants
4340    #ifdef STBIR_SIMD8
4341    STBIR_NO_UNROLL_LOOP_START
4342    while ( decode < end_decode )
4343    #else
4344    if ( decode < end_decode )
4345    #endif
4346    {
4347      stbir__simdf d,a;
4348      stbir__simdf_load( d, decode );
4349      stbir__simdf_aaa1( a, d, STBIR__CONSTF(STBIR_ones) );
4350      stbir__simdf_mult( d, d, a );
4351      stbir__simdf_store ( decode, d );
4352      decode += 4;
4353    }
4354  }
4355
4356  #else
4357
4358  while( decode < end_decode )
4359  {
4360    float alpha = decode[3];
4361    decode[0] *= alpha;
4362    decode[1] *= alpha;
4363    decode[2] *= alpha;
4364    decode += 4;
4365  }
4366
4367  #endif
4368}
4369
4370static void stbir__simple_alpha_weight_2ch( float * decode_buffer, int width_times_channels )
4371{
4372  float STBIR_STREAMOUT_PTR(*) decode = decode_buffer;
4373  float const * end_decode = decode_buffer + width_times_channels;
4374
4375  #ifdef STBIR_SIMD
4376  decode += 2 * stbir__simdfX_float_count;
4377  STBIR_NO_UNROLL_LOOP_START
4378  while ( decode <= end_decode )
4379  {
4380    stbir__simdfX d0,a0,d1,a1;
4381    STBIR_NO_UNROLL(decode);
4382    stbir__simdfX_load( d0, decode-2*stbir__simdfX_float_count );
4383    stbir__simdfX_load( d1, decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count );
4384    stbir__simdfX_a1a1( a0, d0, STBIR_onesX );
4385    stbir__simdfX_a1a1( a1, d1, STBIR_onesX );
4386    stbir__simdfX_mult( d0, d0, a0 );
4387    stbir__simdfX_mult( d1, d1, a1 );
4388    stbir__simdfX_store ( decode-2*stbir__simdfX_float_count, d0 );
4389    stbir__simdfX_store ( decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count, d1 );
4390    decode += 2 * stbir__simdfX_float_count;
4391  }
4392  decode -= 2 * stbir__simdfX_float_count;
4393  #endif
4394
4395  STBIR_SIMD_NO_UNROLL_LOOP_START
4396  while( decode < end_decode )
4397  {
4398    float alpha = decode[1];
4399    STBIR_SIMD_NO_UNROLL(decode);
4400    decode[0] *= alpha;
4401    decode += 2;
4402  }
4403}
4404
4405static void stbir__simple_alpha_unweight_4ch( float * encode_buffer, int width_times_channels )
4406{
4407  float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
4408  float const * end_output = encode_buffer + width_times_channels;
4409
4410  STBIR_SIMD_NO_UNROLL_LOOP_START
4411  do {
4412    float alpha = encode[3];
4413
4414#ifdef STBIR_SIMD
4415    stbir__simdf i,ia;
4416    STBIR_SIMD_NO_UNROLL(encode);
4417    if ( alpha >= stbir__small_float )
4418    {
4419      stbir__simdf_load1frep4( ia, 1.0f / alpha );
4420      stbir__simdf_load( i, encode );
4421      stbir__simdf_mult( i, i, ia );
4422      stbir__simdf_store( encode, i );
4423      encode[3] = alpha;
4424    }
4425#else
4426    if ( alpha >= stbir__small_float )
4427    {
4428      float ialpha = 1.0f / alpha;
4429      encode[0] *= ialpha;
4430      encode[1] *= ialpha;
4431      encode[2] *= ialpha;
4432    }
4433#endif
4434    encode += 4;
4435  } while ( encode < end_output );
4436}
4437
4438static void stbir__simple_alpha_unweight_2ch( float * encode_buffer, int width_times_channels )
4439{
4440  float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
4441  float const * end_output = encode_buffer + width_times_channels;
4442
4443  do {
4444    float alpha = encode[1];
4445    if ( alpha >= stbir__small_float )
4446      encode[0] /= alpha;
4447    encode += 2;
4448  } while ( encode < end_output );
4449}
4450
4451
4452// only used in RGB->BGR or BGR->RGB
4453static void stbir__simple_flip_3ch( float * decode_buffer, int width_times_channels )
4454{
4455  float STBIR_STREAMOUT_PTR(*) decode = decode_buffer;
4456  float const * end_decode = decode_buffer + width_times_channels;
4457
4458#ifdef STBIR_SIMD
4459    #ifdef stbir__simdf_swiz2 // do we have two argument swizzles?
4460      end_decode -= 12; 
4461      STBIR_NO_UNROLL_LOOP_START
4462      while( decode <= end_decode )
4463      {
4464        // on arm64 8 instructions, no overlapping stores
4465        stbir__simdf a,b,c,na,nb;
4466        STBIR_SIMD_NO_UNROLL(decode);
4467        stbir__simdf_load( a, decode );
4468        stbir__simdf_load( b, decode+4 );
4469        stbir__simdf_load( c, decode+8 );
4470
4471        na = stbir__simdf_swiz2( a, b, 2, 1, 0, 5 );   
4472        b  = stbir__simdf_swiz2( a, b, 4, 3, 6, 7 );   
4473        nb = stbir__simdf_swiz2( b, c, 0, 1, 4, 3 );   
4474        c  = stbir__simdf_swiz2( b, c, 2, 7, 6, 5 );   
4475
4476        stbir__simdf_store( decode, na );
4477        stbir__simdf_store( decode+4, nb ); 
4478        stbir__simdf_store( decode+8, c );
4479        decode += 12;
4480      }
4481      end_decode += 12;
4482    #else
4483      end_decode -= 24;
4484      STBIR_NO_UNROLL_LOOP_START
4485      while( decode <= end_decode )
4486      {
4487        // 26 instructions on x64
4488        stbir__simdf a,b,c,d,e,f,g;
4489        float i21, i23;
4490        STBIR_SIMD_NO_UNROLL(decode);
4491        stbir__simdf_load( a, decode );
4492        stbir__simdf_load( b, decode+3 );
4493        stbir__simdf_load( c, decode+6 );
4494        stbir__simdf_load( d, decode+9 );
4495        stbir__simdf_load( e, decode+12 );
4496        stbir__simdf_load( f, decode+15 );
4497        stbir__simdf_load( g, decode+18 );
4498
4499        a = stbir__simdf_swiz( a, 2, 1, 0, 3 );   
4500        b = stbir__simdf_swiz( b, 2, 1, 0, 3 );   
4501        c = stbir__simdf_swiz( c, 2, 1, 0, 3 );   
4502        d = stbir__simdf_swiz( d, 2, 1, 0, 3 );   
4503        e = stbir__simdf_swiz( e, 2, 1, 0, 3 );   
4504        f = stbir__simdf_swiz( f, 2, 1, 0, 3 );   
4505        g = stbir__simdf_swiz( g, 2, 1, 0, 3 );   
4506
4507        // stores overlap, need to be in order, 
4508        stbir__simdf_store( decode,    a );
4509        i21 = decode[21];
4510        stbir__simdf_store( decode+3,  b ); 
4511        i23 = decode[23];
4512        stbir__simdf_store( decode+6,  c );
4513        stbir__simdf_store( decode+9,  d );
4514        stbir__simdf_store( decode+12, e );
4515        stbir__simdf_store( decode+15, f );
4516        stbir__simdf_store( decode+18, g );
4517        decode[21] = i23;
4518        decode[23] = i21;
4519        decode += 24;
4520      }
4521      end_decode += 24;
4522    #endif
4523#else
4524  end_decode -= 12;
4525  STBIR_NO_UNROLL_LOOP_START
4526  while( decode <= end_decode )
4527  {
4528    // 16 instructions
4529    float t0,t1,t2,t3;
4530    STBIR_NO_UNROLL(decode);
4531    t0 = decode[0]; t1 = decode[3]; t2 = decode[6]; t3 = decode[9];
4532    decode[0] = decode[2]; decode[3] = decode[5]; decode[6] = decode[8]; decode[9] = decode[11];
4533    decode[2] = t0; decode[5] = t1; decode[8] = t2; decode[11] = t3;
4534    decode += 12;
4535  }
4536  end_decode += 12;
4537#endif
4538
4539  STBIR_NO_UNROLL_LOOP_START
4540  while( decode < end_decode )
4541  {
4542    float t = decode[0];
4543    STBIR_NO_UNROLL(decode);
4544    decode[0] = decode[2];
4545    decode[2] = t;
4546    decode += 3;
4547  }
4548}
4549
4550
4551
4552static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float * output_buffer STBIR_ONLY_PROFILE_GET_SPLIT_INFO )
4553{
4554  int channels = stbir_info->channels;
4555  int effective_channels = stbir_info->effective_channels;
4556  int input_sample_in_bytes = stbir__type_size[stbir_info->input_type] * channels;
4557  stbir_edge edge_horizontal = stbir_info->horizontal.edge;
4558  stbir_edge edge_vertical = stbir_info->vertical.edge;
4559  int row = stbir__edge_wrap(edge_vertical, n, stbir_info->vertical.scale_info.input_full_size);
4560  const void* input_plane_data = ( (char *) stbir_info->input_data ) + (size_t)row * (size_t) stbir_info->input_stride_bytes;
4561  stbir__span const * spans = stbir_info->scanline_extents.spans;
4562  float* full_decode_buffer = output_buffer - stbir_info->scanline_extents.conservative.n0 * effective_channels;
4563
4564  // if we are on edge_zero, and we get in here with an out of bounds n, then the calculate filters has failed
4565  STBIR_ASSERT( !(edge_vertical == STBIR_EDGE_ZERO && (n < 0 || n >= stbir_info->vertical.scale_info.input_full_size)) );
4566
4567  do
4568  {
4569    float * decode_buffer;
4570    void const * input_data;
4571    float * end_decode;
4572    int width_times_channels;
4573    int width;
4574
4575    if ( spans->n1 < spans->n0 )
4576      break;
4577
4578    width = spans->n1 + 1 - spans->n0;
4579    decode_buffer = full_decode_buffer + spans->n0 * effective_channels;
4580    end_decode = full_decode_buffer + ( spans->n1 + 1 ) * effective_channels;
4581    width_times_channels = width * channels;
4582
4583    // read directly out of input plane by default
4584    input_data = ( (char*)input_plane_data ) + spans->pixel_offset_for_input * input_sample_in_bytes;
4585
4586    // if we have an input callback, call it to get the input data
4587    if ( stbir_info->in_pixels_cb )
4588    {
4589      // call the callback with a temp buffer (that they can choose to use or not).  the temp is just right aligned memory in the decode_buffer itself
4590      input_data = stbir_info->in_pixels_cb( ( (char*) end_decode ) - ( width * input_sample_in_bytes ), input_plane_data, width, spans->pixel_offset_for_input, row, stbir_info->user_data );
4591    }
4592
4593    STBIR_PROFILE_START( decode );
4594    // convert the pixels info the float decode_buffer, (we index from end_decode, so that when channels<effective_channels, we are right justified in the buffer)
4595    stbir_info->decode_pixels( (float*)end_decode - width_times_channels, width_times_channels, input_data );
4596    STBIR_PROFILE_END( decode );
4597
4598    if (stbir_info->alpha_weight)
4599    {
4600      STBIR_PROFILE_START( alpha );
4601      stbir_info->alpha_weight( decode_buffer, width_times_channels );
4602      STBIR_PROFILE_END( alpha );
4603    }
4604
4605    ++spans;
4606  } while ( spans <= ( &stbir_info->scanline_extents.spans[1] ) );
4607
4608  // handle the edge_wrap filter (all other types are handled back out at the calculate_filter stage)
4609  // basically the idea here is that if we have the whole scanline in memory, we don't redecode the
4610  //   wrapped edge pixels, and instead just memcpy them from the scanline into the edge positions
4611  if ( ( edge_horizontal == STBIR_EDGE_WRAP ) && ( stbir_info->scanline_extents.edge_sizes[0] | stbir_info->scanline_extents.edge_sizes[1] ) )
4612  {
4613    // this code only runs if we're in edge_wrap, and we're doing the entire scanline
4614    int e, start_x[2];
4615    int input_full_size = stbir_info->horizontal.scale_info.input_full_size;
4616
4617    start_x[0] = -stbir_info->scanline_extents.edge_sizes[0];  // left edge start x
4618    start_x[1] =  input_full_size;                             // right edge
4619
4620    for( e = 0; e < 2 ; e++ )
4621    {
4622      // do each margin
4623      int margin = stbir_info->scanline_extents.edge_sizes[e];
4624      if ( margin )
4625      {
4626        int x = start_x[e];
4627        float * marg = full_decode_buffer + x * effective_channels;
4628        float const * src = full_decode_buffer + stbir__edge_wrap(edge_horizontal, x, input_full_size) * effective_channels;
4629        STBIR_MEMCPY( marg, src, margin * effective_channels * sizeof(float) );
4630      }
4631    }
4632  }
4633}
4634
4635
4636//=================
4637// Do 1 channel horizontal routines
4638
4639#ifdef STBIR_SIMD
4640
4641#define stbir__1_coeff_only()          \
4642    stbir__simdf tot,c;                \
4643    STBIR_SIMD_NO_UNROLL(decode);      \
4644    stbir__simdf_load1( c, hc );       \
4645    stbir__simdf_mult1_mem( tot, c, decode );
4646
4647#define stbir__2_coeff_only()          \
4648    stbir__simdf tot,c,d;              \
4649    STBIR_SIMD_NO_UNROLL(decode);      \
4650    stbir__simdf_load2z( c, hc );      \
4651    stbir__simdf_load2( d, decode );   \
4652    stbir__simdf_mult( tot, c, d );    \
4653    stbir__simdf_0123to1230( c, tot ); \
4654    stbir__simdf_add1( tot, tot, c );
4655
4656#define stbir__3_coeff_only()                  \
4657    stbir__simdf tot,c,t;                      \
4658    STBIR_SIMD_NO_UNROLL(decode);              \
4659    stbir__simdf_load( c, hc );                \
4660    stbir__simdf_mult_mem( tot, c, decode );   \
4661    stbir__simdf_0123to1230( c, tot );         \
4662    stbir__simdf_0123to2301( t, tot );         \
4663    stbir__simdf_add1( tot, tot, c );          \
4664    stbir__simdf_add1( tot, tot, t );
4665
4666#define stbir__store_output_tiny()                \
4667    stbir__simdf_store1( output, tot );           \
4668    horizontal_coefficients += coefficient_width; \
4669    ++horizontal_contributors;                    \
4670    output += 1;
4671
4672#define stbir__4_coeff_start()                 \
4673    stbir__simdf tot,c;                        \
4674    STBIR_SIMD_NO_UNROLL(decode);              \
4675    stbir__simdf_load( c, hc );                \
4676    stbir__simdf_mult_mem( tot, c, decode );   \
4677
4678#define stbir__4_coeff_continue_from_4( ofs )  \
4679    STBIR_SIMD_NO_UNROLL(decode);              \
4680    stbir__simdf_load( c, hc + (ofs) );        \
4681    stbir__simdf_madd_mem( tot, tot, c, decode+(ofs) );
4682
4683#define stbir__1_coeff_remnant( ofs )          \
4684    { stbir__simdf d;                          \
4685    stbir__simdf_load1z( c, hc + (ofs) );      \
4686    stbir__simdf_load1( d, decode + (ofs) );   \
4687    stbir__simdf_madd( tot, tot, d, c ); }
4688
4689#define stbir__2_coeff_remnant( ofs )          \
4690    { stbir__simdf d;                          \
4691    stbir__simdf_load2z( c, hc+(ofs) );        \
4692    stbir__simdf_load2( d, decode+(ofs) );     \
4693    stbir__simdf_madd( tot, tot, d, c ); }
4694
4695#define stbir__3_coeff_setup()                 \
4696    stbir__simdf mask;                         \
4697    stbir__simdf_load( mask, STBIR_mask + 3 );
4698
4699#define stbir__3_coeff_remnant( ofs )                  \
4700    stbir__simdf_load( c, hc+(ofs) );                  \
4701    stbir__simdf_and( c, c, mask );                    \
4702    stbir__simdf_madd_mem( tot, tot, c, decode+(ofs) );
4703
4704#define stbir__store_output()                     \
4705    stbir__simdf_0123to2301( c, tot );            \
4706    stbir__simdf_add( tot, tot, c );              \
4707    stbir__simdf_0123to1230( c, tot );            \
4708    stbir__simdf_add1( tot, tot, c );             \
4709    stbir__simdf_store1( output, tot );           \
4710    horizontal_coefficients += coefficient_width; \
4711    ++horizontal_contributors;                    \
4712    output += 1;
4713
4714#else
4715
4716#define stbir__1_coeff_only()  \
4717    float tot;                 \
4718    tot = decode[0]*hc[0];
4719
4720#define stbir__2_coeff_only()  \
4721    float tot;                 \
4722    tot = decode[0] * hc[0];   \
4723    tot += decode[1] * hc[1];
4724
4725#define stbir__3_coeff_only()  \
4726    float tot;                 \
4727    tot = decode[0] * hc[0];   \
4728    tot += decode[1] * hc[1];  \
4729    tot += decode[2] * hc[2];
4730
4731#define stbir__store_output_tiny()                \
4732    output[0] = tot;                              \
4733    horizontal_coefficients += coefficient_width; \
4734    ++horizontal_contributors;                    \
4735    output += 1;
4736
4737#define stbir__4_coeff_start()  \
4738    float tot0,tot1,tot2,tot3;  \
4739    tot0 = decode[0] * hc[0];   \
4740    tot1 = decode[1] * hc[1];   \
4741    tot2 = decode[2] * hc[2];   \
4742    tot3 = decode[3] * hc[3];
4743
4744#define stbir__4_coeff_continue_from_4( ofs )  \
4745    tot0 += decode[0+(ofs)] * hc[0+(ofs)];     \
4746    tot1 += decode[1+(ofs)] * hc[1+(ofs)];     \
4747    tot2 += decode[2+(ofs)] * hc[2+(ofs)];     \
4748    tot3 += decode[3+(ofs)] * hc[3+(ofs)];
4749
4750#define stbir__1_coeff_remnant( ofs )        \
4751    tot0 += decode[0+(ofs)] * hc[0+(ofs)];
4752
4753#define stbir__2_coeff_remnant( ofs )        \
4754    tot0 += decode[0+(ofs)] * hc[0+(ofs)];   \
4755    tot1 += decode[1+(ofs)] * hc[1+(ofs)];   \
4756
4757#define stbir__3_coeff_remnant( ofs )        \
4758    tot0 += decode[0+(ofs)] * hc[0+(ofs)];   \
4759    tot1 += decode[1+(ofs)] * hc[1+(ofs)];   \
4760    tot2 += decode[2+(ofs)] * hc[2+(ofs)];
4761
4762#define stbir__store_output()                     \
4763    output[0] = (tot0+tot2)+(tot1+tot3);          \
4764    horizontal_coefficients += coefficient_width; \
4765    ++horizontal_contributors;                    \
4766    output += 1;
4767
4768#endif
4769
4770#define STBIR__horizontal_channels 1
4771#define STB_IMAGE_RESIZE_DO_HORIZONTALS
4772#include STBIR__HEADER_FILENAME
4773
4774
4775//=================
4776// Do 2 channel horizontal routines
4777
4778#ifdef STBIR_SIMD
4779
4780#define stbir__1_coeff_only()         \
4781    stbir__simdf tot,c,d;             \
4782    STBIR_SIMD_NO_UNROLL(decode);     \
4783    stbir__simdf_load1z( c, hc );     \
4784    stbir__simdf_0123to0011( c, c );  \
4785    stbir__simdf_load2( d, decode );  \
4786    stbir__simdf_mult( tot, d, c );
4787
4788#define stbir__2_coeff_only()         \
4789    stbir__simdf tot,c;               \
4790    STBIR_SIMD_NO_UNROLL(decode);     \
4791    stbir__simdf_load2( c, hc );      \
4792    stbir__simdf_0123to0011( c, c );  \
4793    stbir__simdf_mult_mem( tot, c, decode );
4794
4795#define stbir__3_coeff_only()                \
4796    stbir__simdf tot,c,cs,d;                 \
4797    STBIR_SIMD_NO_UNROLL(decode);            \
4798    stbir__simdf_load( cs, hc );             \
4799    stbir__simdf_0123to0011( c, cs );        \
4800    stbir__simdf_mult_mem( tot, c, decode ); \
4801    stbir__simdf_0123to2222( c, cs );        \
4802    stbir__simdf_load2z( d, decode+4 );      \
4803    stbir__simdf_madd( tot, tot, d, c );
4804
4805#define stbir__store_output_tiny()                \
4806    stbir__simdf_0123to2301( c, tot );            \
4807    stbir__simdf_add( tot, tot, c );              \
4808    stbir__simdf_store2( output, tot );           \
4809    horizontal_coefficients += coefficient_width; \
4810    ++horizontal_contributors;                    \
4811    output += 2;
4812
4813#ifdef STBIR_SIMD8
4814
4815#define stbir__4_coeff_start()                    \
4816    stbir__simdf8 tot0,c,cs;                      \
4817    STBIR_SIMD_NO_UNROLL(decode);                 \
4818    stbir__simdf8_load4b( cs, hc );               \
4819    stbir__simdf8_0123to00112233( c, cs );        \
4820    stbir__simdf8_mult_mem( tot0, c, decode );
4821
4822#define stbir__4_coeff_continue_from_4( ofs )        \
4823    STBIR_SIMD_NO_UNROLL(decode);                    \
4824    stbir__simdf8_load4b( cs, hc + (ofs) );          \
4825    stbir__simdf8_0123to00112233( c, cs );           \
4826    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*2 );
4827
4828#define stbir__1_coeff_remnant( ofs )                \
4829    { stbir__simdf t,d;                              \
4830    stbir__simdf_load1z( t, hc + (ofs) );            \
4831    stbir__simdf_load2( d, decode + (ofs) * 2 );     \
4832    stbir__simdf_0123to0011( t, t );                 \
4833    stbir__simdf_mult( t, t, d );                    \
4834    stbir__simdf8_add4( tot0, tot0, t ); }
4835 
4836#define stbir__2_coeff_remnant( ofs )                \
4837    { stbir__simdf t;                                \
4838    stbir__simdf_load2( t, hc + (ofs) );             \
4839    stbir__simdf_0123to0011( t, t );                 \
4840    stbir__simdf_mult_mem( t, t, decode+(ofs)*2 );   \
4841    stbir__simdf8_add4( tot0, tot0, t ); }
4842
4843#define stbir__3_coeff_remnant( ofs )                \
4844    { stbir__simdf8 d;                               \
4845    stbir__simdf8_load4b( cs, hc + (ofs) );          \
4846    stbir__simdf8_0123to00112233( c, cs );           \
4847    stbir__simdf8_load6z( d, decode+(ofs)*2 );       \
4848    stbir__simdf8_madd( tot0, tot0, c, d ); }
4849
4850#define stbir__store_output()                     \
4851    { stbir__simdf t,d;                           \
4852    stbir__simdf8_add4halves( t, stbir__if_simdf8_cast_to_simdf4(tot0), tot0 );    \
4853    stbir__simdf_0123to2301( d, t );              \
4854    stbir__simdf_add( t, t, d );                  \
4855    stbir__simdf_store2( output, t );             \
4856    horizontal_coefficients += coefficient_width; \
4857    ++horizontal_contributors;                    \
4858    output += 2; }
4859
4860#else
4861
4862#define stbir__4_coeff_start()                   \
4863    stbir__simdf tot0,tot1,c,cs;                 \
4864    STBIR_SIMD_NO_UNROLL(decode);                \
4865    stbir__simdf_load( cs, hc );                 \
4866    stbir__simdf_0123to0011( c, cs );            \
4867    stbir__simdf_mult_mem( tot0, c, decode );    \
4868    stbir__simdf_0123to2233( c, cs );            \
4869    stbir__simdf_mult_mem( tot1, c, decode+4 );
4870
4871#define stbir__4_coeff_continue_from_4( ofs )                \
4872    STBIR_SIMD_NO_UNROLL(decode);                            \
4873    stbir__simdf_load( cs, hc + (ofs) );                     \
4874    stbir__simdf_0123to0011( c, cs );                        \
4875    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 );  \
4876    stbir__simdf_0123to2233( c, cs );                        \
4877    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*2+4 );
4878
4879#define stbir__1_coeff_remnant( ofs )            \
4880    { stbir__simdf d;                            \
4881    stbir__simdf_load1z( cs, hc + (ofs) );       \
4882    stbir__simdf_0123to0011( c, cs );            \
4883    stbir__simdf_load2( d, decode + (ofs) * 2 ); \
4884    stbir__simdf_madd( tot0, tot0, d, c ); }
4885
4886#define stbir__2_coeff_remnant( ofs )                      \
4887    stbir__simdf_load2( cs, hc + (ofs) );                  \
4888    stbir__simdf_0123to0011( c, cs );                      \
4889    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 );
4890
4891#define stbir__3_coeff_remnant( ofs )                       \
4892    { stbir__simdf d;                                       \
4893    stbir__simdf_load( cs, hc + (ofs) );                    \
4894    stbir__simdf_0123to0011( c, cs );                       \
4895    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); \
4896    stbir__simdf_0123to2222( c, cs );                       \
4897    stbir__simdf_load2z( d, decode + (ofs) * 2 + 4 );       \
4898    stbir__simdf_madd( tot1, tot1, d, c ); }
4899
4900#define stbir__store_output()                     \
4901    stbir__simdf_add( tot0, tot0, tot1 );         \
4902    stbir__simdf_0123to2301( c, tot0 );           \
4903    stbir__simdf_add( tot0, tot0, c );            \
4904    stbir__simdf_store2( output, tot0 );          \
4905    horizontal_coefficients += coefficient_width; \
4906    ++horizontal_contributors;                    \
4907    output += 2;
4908
4909#endif
4910
4911#else
4912
4913#define stbir__1_coeff_only()  \
4914    float tota,totb,c;         \
4915    c = hc[0];                 \
4916    tota = decode[0]*c;        \
4917    totb = decode[1]*c;
4918
4919#define stbir__2_coeff_only()  \
4920    float tota,totb,c;         \
4921    c = hc[0];                 \
4922    tota = decode[0]*c;        \
4923    totb = decode[1]*c;        \
4924    c = hc[1];                 \
4925    tota += decode[2]*c;       \
4926    totb += decode[3]*c;
4927
4928// this weird order of add matches the simd
4929#define stbir__3_coeff_only()  \
4930    float tota,totb,c;         \
4931    c = hc[0];                 \
4932    tota = decode[0]*c;        \
4933    totb = decode[1]*c;        \
4934    c = hc[2];                 \
4935    tota += decode[4]*c;       \
4936    totb += decode[5]*c;       \
4937    c = hc[1];                 \
4938    tota += decode[2]*c;       \
4939    totb += decode[3]*c;
4940
4941#define stbir__store_output_tiny()                \
4942    output[0] = tota;                             \
4943    output[1] = totb;                             \
4944    horizontal_coefficients += coefficient_width; \
4945    ++horizontal_contributors;                    \
4946    output += 2;
4947
4948#define stbir__4_coeff_start()      \
4949    float tota0,tota1,tota2,tota3,totb0,totb1,totb2,totb3,c;  \
4950    c = hc[0];                      \
4951    tota0 = decode[0]*c;            \
4952    totb0 = decode[1]*c;            \
4953    c = hc[1];                      \
4954    tota1 = decode[2]*c;            \
4955    totb1 = decode[3]*c;            \
4956    c = hc[2];                      \
4957    tota2 = decode[4]*c;            \
4958    totb2 = decode[5]*c;            \
4959    c = hc[3];                      \
4960    tota3 = decode[6]*c;            \
4961    totb3 = decode[7]*c;
4962
4963#define stbir__4_coeff_continue_from_4( ofs )  \
4964    c = hc[0+(ofs)];                           \
4965    tota0 += decode[0+(ofs)*2]*c;              \
4966    totb0 += decode[1+(ofs)*2]*c;              \
4967    c = hc[1+(ofs)];                           \
4968    tota1 += decode[2+(ofs)*2]*c;              \
4969    totb1 += decode[3+(ofs)*2]*c;              \
4970    c = hc[2+(ofs)];                           \
4971    tota2 += decode[4+(ofs)*2]*c;              \
4972    totb2 += decode[5+(ofs)*2]*c;              \
4973    c = hc[3+(ofs)];                           \
4974    tota3 += decode[6+(ofs)*2]*c;              \
4975    totb3 += decode[7+(ofs)*2]*c;
4976
4977#define stbir__1_coeff_remnant( ofs )  \
4978    c = hc[0+(ofs)];                   \
4979    tota0 += decode[0+(ofs)*2] * c;    \
4980    totb0 += decode[1+(ofs)*2] * c;
4981
4982#define stbir__2_coeff_remnant( ofs )  \
4983    c = hc[0+(ofs)];                   \
4984    tota0 += decode[0+(ofs)*2] * c;    \
4985    totb0 += decode[1+(ofs)*2] * c;    \
4986    c = hc[1+(ofs)];                   \
4987    tota1 += decode[2+(ofs)*2] * c;    \
4988    totb1 += decode[3+(ofs)*2] * c;
4989
4990#define stbir__3_coeff_remnant( ofs )  \
4991    c = hc[0+(ofs)];                   \
4992    tota0 += decode[0+(ofs)*2] * c;    \
4993    totb0 += decode[1+(ofs)*2] * c;    \
4994    c = hc[1+(ofs)];                   \
4995    tota1 += decode[2+(ofs)*2] * c;    \
4996    totb1 += decode[3+(ofs)*2] * c;    \
4997    c = hc[2+(ofs)];                   \
4998    tota2 += decode[4+(ofs)*2] * c;    \
4999    totb2 += decode[5+(ofs)*2] * c;
5000
5001#define stbir__store_output()                     \
5002    output[0] = (tota0+tota2)+(tota1+tota3);      \
5003    output[1] = (totb0+totb2)+(totb1+totb3);      \
5004    horizontal_coefficients += coefficient_width; \
5005    ++horizontal_contributors;                    \
5006    output += 2;
5007
5008#endif
5009
5010#define STBIR__horizontal_channels 2
5011#define STB_IMAGE_RESIZE_DO_HORIZONTALS
5012#include STBIR__HEADER_FILENAME
5013
5014
5015//=================
5016// Do 3 channel horizontal routines
5017
5018#ifdef STBIR_SIMD
5019
5020#define stbir__1_coeff_only()         \
5021    stbir__simdf tot,c,d;             \
5022    STBIR_SIMD_NO_UNROLL(decode);     \
5023    stbir__simdf_load1z( c, hc );     \
5024    stbir__simdf_0123to0001( c, c );  \
5025    stbir__simdf_load( d, decode );   \
5026    stbir__simdf_mult( tot, d, c );
5027
5028#define stbir__2_coeff_only()         \
5029    stbir__simdf tot,c,cs,d;          \
5030    STBIR_SIMD_NO_UNROLL(decode);     \
5031    stbir__simdf_load2( cs, hc );     \
5032    stbir__simdf_0123to0000( c, cs ); \
5033    stbir__simdf_load( d, decode );   \
5034    stbir__simdf_mult( tot, d, c );   \
5035    stbir__simdf_0123to1111( c, cs ); \
5036    stbir__simdf_load( d, decode+3 ); \
5037    stbir__simdf_madd( tot, tot, d, c );
5038
5039#define stbir__3_coeff_only()            \
5040    stbir__simdf tot,c,d,cs;             \
5041    STBIR_SIMD_NO_UNROLL(decode);        \
5042    stbir__simdf_load( cs, hc );         \
5043    stbir__simdf_0123to0000( c, cs );    \
5044    stbir__simdf_load( d, decode );      \
5045    stbir__simdf_mult( tot, d, c );      \
5046    stbir__simdf_0123to1111( c, cs );    \
5047    stbir__simdf_load( d, decode+3 );    \
5048    stbir__simdf_madd( tot, tot, d, c ); \
5049    stbir__simdf_0123to2222( c, cs );    \
5050    stbir__simdf_load( d, decode+6 );    \
5051    stbir__simdf_madd( tot, tot, d, c );
5052
5053#define stbir__store_output_tiny()                \
5054    stbir__simdf_store2( output, tot );           \
5055    stbir__simdf_0123to2301( tot, tot );          \
5056    stbir__simdf_store1( output+2, tot );         \
5057    horizontal_coefficients += coefficient_width; \
5058    ++horizontal_contributors;                    \
5059    output += 3;
5060
5061#ifdef STBIR_SIMD8
5062
5063// we're loading from the XXXYYY decode by -1 to get the XXXYYY into different halves of the AVX reg fyi
5064#define stbir__4_coeff_start()                     \
5065    stbir__simdf8 tot0,tot1,c,cs; stbir__simdf t;  \
5066    STBIR_SIMD_NO_UNROLL(decode);                  \
5067    stbir__simdf8_load4b( cs, hc );                \
5068    stbir__simdf8_0123to00001111( c, cs );         \
5069    stbir__simdf8_mult_mem( tot0, c, decode - 1 ); \
5070    stbir__simdf8_0123to22223333( c, cs );         \
5071    stbir__simdf8_mult_mem( tot1, c, decode+6 - 1 );
5072
5073#define stbir__4_coeff_continue_from_4( ofs )      \
5074    STBIR_SIMD_NO_UNROLL(decode);                  \
5075    stbir__simdf8_load4b( cs, hc + (ofs) );        \
5076    stbir__simdf8_0123to00001111( c, cs );         \
5077    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); \
5078    stbir__simdf8_0123to22223333( c, cs );         \
5079    stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*3 + 6 - 1 );
5080
5081#define stbir__1_coeff_remnant( ofs )                          \
5082    STBIR_SIMD_NO_UNROLL(decode);                              \
5083    stbir__simdf_load1rep4( t, hc + (ofs) );                   \
5084    stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*3 - 1 );
5085
5086#define stbir__2_coeff_remnant( ofs )                          \
5087    STBIR_SIMD_NO_UNROLL(decode);                              \
5088    stbir__simdf8_load4b( cs, hc + (ofs) - 2 );                \
5089    stbir__simdf8_0123to22223333( c, cs );                     \
5090    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 );
5091
5092 #define stbir__3_coeff_remnant( ofs )                           \
5093    STBIR_SIMD_NO_UNROLL(decode);                                \
5094    stbir__simdf8_load4b( cs, hc + (ofs) );                      \
5095    stbir__simdf8_0123to00001111( c, cs );                       \
5096    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); \
5097    stbir__simdf8_0123to2222( t, cs );                           \
5098    stbir__simdf8_madd_mem4( tot1, tot1, t, decode+(ofs)*3 + 6 - 1 );
5099
5100#define stbir__store_output()                       \
5101    stbir__simdf8_add( tot0, tot0, tot1 );          \
5102    stbir__simdf_0123to1230( t, stbir__if_simdf8_cast_to_simdf4( tot0 ) ); \
5103    stbir__simdf8_add4halves( t, t, tot0 );         \
5104    horizontal_coefficients += coefficient_width;   \
5105    ++horizontal_contributors;                      \
5106    output += 3;                                    \
5107    if ( output < output_end )                      \
5108    {                                               \
5109      stbir__simdf_store( output-3, t );            \
5110      continue;                                     \
5111    }                                               \
5112    { stbir__simdf tt; stbir__simdf_0123to2301( tt, t ); \
5113    stbir__simdf_store2( output-3, t );             \
5114    stbir__simdf_store1( output+2-3, tt ); }        \
5115    break;
5116
5117
5118#else
5119
5120#define stbir__4_coeff_start()                  \
5121    stbir__simdf tot0,tot1,tot2,c,cs;           \
5122    STBIR_SIMD_NO_UNROLL(decode);               \
5123    stbir__simdf_load( cs, hc );                \
5124    stbir__simdf_0123to0001( c, cs );           \
5125    stbir__simdf_mult_mem( tot0, c, decode );   \
5126    stbir__simdf_0123to1122( c, cs );           \
5127    stbir__simdf_mult_mem( tot1, c, decode+4 ); \
5128    stbir__simdf_0123to2333( c, cs );           \
5129    stbir__simdf_mult_mem( tot2, c, decode+8 );
5130
5131#define stbir__4_coeff_continue_from_4( ofs )                 \
5132    STBIR_SIMD_NO_UNROLL(decode);                             \
5133    stbir__simdf_load( cs, hc + (ofs) );                      \
5134    stbir__simdf_0123to0001( c, cs );                         \
5135    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 );   \
5136    stbir__simdf_0123to1122( c, cs );                         \
5137    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*3+4 ); \
5138    stbir__simdf_0123to2333( c, cs );                         \
5139    stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*3+8 );
5140
5141#define stbir__1_coeff_remnant( ofs )         \
5142    STBIR_SIMD_NO_UNROLL(decode);             \
5143    stbir__simdf_load1z( c, hc + (ofs) );     \
5144    stbir__simdf_0123to0001( c, c );          \
5145    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 );
5146
5147#define stbir__2_coeff_remnant( ofs )                       \
5148    { stbir__simdf d;                                       \
5149    STBIR_SIMD_NO_UNROLL(decode);                           \
5150    stbir__simdf_load2z( cs, hc + (ofs) );                  \
5151    stbir__simdf_0123to0001( c, cs );                       \
5152    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 ); \
5153    stbir__simdf_0123to1122( c, cs );                       \
5154    stbir__simdf_load2z( d, decode+(ofs)*3+4 );             \
5155    stbir__simdf_madd( tot1, tot1, c, d ); }
5156
5157#define stbir__3_coeff_remnant( ofs )                         \
5158    { stbir__simdf d;                                         \
5159    STBIR_SIMD_NO_UNROLL(decode);                             \
5160    stbir__simdf_load( cs, hc + (ofs) );                      \
5161    stbir__simdf_0123to0001( c, cs );                         \
5162    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 );   \
5163    stbir__simdf_0123to1122( c, cs );                         \
5164    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*3+4 ); \
5165    stbir__simdf_0123to2222( c, cs );                         \
5166    stbir__simdf_load1z( d, decode+(ofs)*3+8 );               \
5167    stbir__simdf_madd( tot2, tot2, c, d );  }
5168
5169#define stbir__store_output()                       \
5170    stbir__simdf_0123ABCDto3ABx( c, tot0, tot1 );   \
5171    stbir__simdf_0123ABCDto23Ax( cs, tot1, tot2 );  \
5172    stbir__simdf_0123to1230( tot2, tot2 );          \
5173    stbir__simdf_add( tot0, tot0, cs );             \
5174    stbir__simdf_add( c, c, tot2 );                 \
5175    stbir__simdf_add( tot0, tot0, c );              \
5176    horizontal_coefficients += coefficient_width;   \
5177    ++horizontal_contributors;                      \
5178    output += 3;                                    \
5179    if ( output < output_end )                      \
5180    {                                               \
5181      stbir__simdf_store( output-3, tot0 );         \
5182      continue;                                     \
5183    }                                               \
5184    stbir__simdf_0123to2301( tot1, tot0 );          \
5185    stbir__simdf_store2( output-3, tot0 );          \
5186    stbir__simdf_store1( output+2-3, tot1 );        \
5187    break;
5188
5189#endif
5190
5191#else
5192
5193#define stbir__1_coeff_only()  \
5194    float tot0, tot1, tot2, c; \
5195    c = hc[0];                 \
5196    tot0 = decode[0]*c;        \
5197    tot1 = decode[1]*c;        \
5198    tot2 = decode[2]*c;
5199
5200#define stbir__2_coeff_only()  \
5201    float tot0, tot1, tot2, c; \
5202    c = hc[0];                 \
5203    tot0 = decode[0]*c;        \
5204    tot1 = decode[1]*c;        \
5205    tot2 = decode[2]*c;        \
5206    c = hc[1];                 \
5207    tot0 += decode[3]*c;       \
5208    tot1 += decode[4]*c;       \
5209    tot2 += decode[5]*c;
5210
5211#define stbir__3_coeff_only()  \
5212    float tot0, tot1, tot2, c; \
5213    c = hc[0];                 \
5214    tot0 = decode[0]*c;        \
5215    tot1 = decode[1]*c;        \
5216    tot2 = decode[2]*c;        \
5217    c = hc[1];                 \
5218    tot0 += decode[3]*c;       \
5219    tot1 += decode[4]*c;       \
5220    tot2 += decode[5]*c;       \
5221    c = hc[2];                 \
5222    tot0 += decode[6]*c;       \
5223    tot1 += decode[7]*c;       \
5224    tot2 += decode[8]*c;
5225
5226#define stbir__store_output_tiny()                \
5227    output[0] = tot0;                             \
5228    output[1] = tot1;                             \
5229    output[2] = tot2;                             \
5230    horizontal_coefficients += coefficient_width; \
5231    ++horizontal_contributors;                    \
5232    output += 3;
5233
5234#define stbir__4_coeff_start()      \
5235    float tota0,tota1,tota2,totb0,totb1,totb2,totc0,totc1,totc2,totd0,totd1,totd2,c;  \
5236    c = hc[0];                      \
5237    tota0 = decode[0]*c;            \
5238    tota1 = decode[1]*c;            \
5239    tota2 = decode[2]*c;            \
5240    c = hc[1];                      \
5241    totb0 = decode[3]*c;            \
5242    totb1 = decode[4]*c;            \
5243    totb2 = decode[5]*c;            \
5244    c = hc[2];                      \
5245    totc0 = decode[6]*c;            \
5246    totc1 = decode[7]*c;            \
5247    totc2 = decode[8]*c;            \
5248    c = hc[3];                      \
5249    totd0 = decode[9]*c;            \
5250    totd1 = decode[10]*c;           \
5251    totd2 = decode[11]*c;
5252
5253#define stbir__4_coeff_continue_from_4( ofs )  \
5254    c = hc[0+(ofs)];                           \
5255    tota0 += decode[0+(ofs)*3]*c;              \
5256    tota1 += decode[1+(ofs)*3]*c;              \
5257    tota2 += decode[2+(ofs)*3]*c;              \
5258    c = hc[1+(ofs)];                           \
5259    totb0 += decode[3+(ofs)*3]*c;              \
5260    totb1 += decode[4+(ofs)*3]*c;              \
5261    totb2 += decode[5+(ofs)*3]*c;              \
5262    c = hc[2+(ofs)];                           \
5263    totc0 += decode[6+(ofs)*3]*c;              \
5264    totc1 += decode[7+(ofs)*3]*c;              \
5265    totc2 += decode[8+(ofs)*3]*c;              \
5266    c = hc[3+(ofs)];                           \
5267    totd0 += decode[9+(ofs)*3]*c;              \
5268    totd1 += decode[10+(ofs)*3]*c;             \
5269    totd2 += decode[11+(ofs)*3]*c;
5270
5271#define stbir__1_coeff_remnant( ofs )  \
5272    c = hc[0+(ofs)];                   \
5273    tota0 += decode[0+(ofs)*3]*c;      \
5274    tota1 += decode[1+(ofs)*3]*c;      \
5275    tota2 += decode[2+(ofs)*3]*c;
5276
5277#define stbir__2_coeff_remnant( ofs )  \
5278    c = hc[0+(ofs)];                   \
5279    tota0 += decode[0+(ofs)*3]*c;      \
5280    tota1 += decode[1+(ofs)*3]*c;      \
5281    tota2 += decode[2+(ofs)*3]*c;      \
5282    c = hc[1+(ofs)];                   \
5283    totb0 += decode[3+(ofs)*3]*c;      \
5284    totb1 += decode[4+(ofs)*3]*c;      \
5285    totb2 += decode[5+(ofs)*3]*c;      \
5286
5287#define stbir__3_coeff_remnant( ofs )  \
5288    c = hc[0+(ofs)];                   \
5289    tota0 += decode[0+(ofs)*3]*c;      \
5290    tota1 += decode[1+(ofs)*3]*c;      \
5291    tota2 += decode[2+(ofs)*3]*c;      \
5292    c = hc[1+(ofs)];                   \
5293    totb0 += decode[3+(ofs)*3]*c;      \
5294    totb1 += decode[4+(ofs)*3]*c;      \
5295    totb2 += decode[5+(ofs)*3]*c;      \
5296    c = hc[2+(ofs)];                   \
5297    totc0 += decode[6+(ofs)*3]*c;      \
5298    totc1 += decode[7+(ofs)*3]*c;      \
5299    totc2 += decode[8+(ofs)*3]*c;
5300
5301#define stbir__store_output()                     \
5302    output[0] = (tota0+totc0)+(totb0+totd0);      \
5303    output[1] = (tota1+totc1)+(totb1+totd1);      \
5304    output[2] = (tota2+totc2)+(totb2+totd2);      \
5305    horizontal_coefficients += coefficient_width; \
5306    ++horizontal_contributors;                    \
5307    output += 3;
5308
5309#endif
5310
5311#define STBIR__horizontal_channels 3
5312#define STB_IMAGE_RESIZE_DO_HORIZONTALS
5313#include STBIR__HEADER_FILENAME
5314
5315//=================
5316// Do 4 channel horizontal routines
5317
5318#ifdef STBIR_SIMD
5319
5320#define stbir__1_coeff_only()             \
5321    stbir__simdf tot,c;                   \
5322    STBIR_SIMD_NO_UNROLL(decode);         \
5323    stbir__simdf_load1( c, hc );          \
5324    stbir__simdf_0123to0000( c, c );      \
5325    stbir__simdf_mult_mem( tot, c, decode );
5326
5327#define stbir__2_coeff_only()                       \
5328    stbir__simdf tot,c,cs;                          \
5329    STBIR_SIMD_NO_UNROLL(decode);                   \
5330    stbir__simdf_load2( cs, hc );                   \
5331    stbir__simdf_0123to0000( c, cs );               \
5332    stbir__simdf_mult_mem( tot, c, decode );        \
5333    stbir__simdf_0123to1111( c, cs );               \
5334    stbir__simdf_madd_mem( tot, tot, c, decode+4 );
5335
5336#define stbir__3_coeff_only()                       \
5337    stbir__simdf tot,c,cs;                          \
5338    STBIR_SIMD_NO_UNROLL(decode);                   \
5339    stbir__simdf_load( cs, hc );                    \
5340    stbir__simdf_0123to0000( c, cs );               \
5341    stbir__simdf_mult_mem( tot, c, decode );        \
5342    stbir__simdf_0123to1111( c, cs );               \
5343    stbir__simdf_madd_mem( tot, tot, c, decode+4 ); \
5344    stbir__simdf_0123to2222( c, cs );               \
5345    stbir__simdf_madd_mem( tot, tot, c, decode+8 );
5346
5347#define stbir__store_output_tiny()                \
5348    stbir__simdf_store( output, tot );            \
5349    horizontal_coefficients += coefficient_width; \
5350    ++horizontal_contributors;                    \
5351    output += 4;
5352
5353#ifdef STBIR_SIMD8
5354
5355#define stbir__4_coeff_start()                     \
5356    stbir__simdf8 tot0,c,cs; stbir__simdf t;  \
5357    STBIR_SIMD_NO_UNROLL(decode);                  \
5358    stbir__simdf8_load4b( cs, hc );                \
5359    stbir__simdf8_0123to00001111( c, cs );         \
5360    stbir__simdf8_mult_mem( tot0, c, decode );     \
5361    stbir__simdf8_0123to22223333( c, cs );         \
5362    stbir__simdf8_madd_mem( tot0, tot0, c, decode+8 );
5363
5364#define stbir__4_coeff_continue_from_4( ofs )                  \
5365    STBIR_SIMD_NO_UNROLL(decode);                              \
5366    stbir__simdf8_load4b( cs, hc + (ofs) );                    \
5367    stbir__simdf8_0123to00001111( c, cs );                     \
5368    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 );   \
5369    stbir__simdf8_0123to22223333( c, cs );                     \
5370    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 );
5371
5372#define stbir__1_coeff_remnant( ofs )                          \
5373    STBIR_SIMD_NO_UNROLL(decode);                              \
5374    stbir__simdf_load1rep4( t, hc + (ofs) );                   \
5375    stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4 );
5376
5377#define stbir__2_coeff_remnant( ofs )                          \
5378    STBIR_SIMD_NO_UNROLL(decode);                              \
5379    stbir__simdf8_load4b( cs, hc + (ofs) - 2 );                \
5380    stbir__simdf8_0123to22223333( c, cs );                     \
5381    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 );
5382
5383 #define stbir__3_coeff_remnant( ofs )                         \
5384    STBIR_SIMD_NO_UNROLL(decode);                              \
5385    stbir__simdf8_load4b( cs, hc + (ofs) );                    \
5386    stbir__simdf8_0123to00001111( c, cs );                     \
5387    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 );   \
5388    stbir__simdf8_0123to2222( t, cs );                         \
5389    stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4+8 );
5390
5391#define stbir__store_output()                      \
5392    stbir__simdf8_add4halves( t, stbir__if_simdf8_cast_to_simdf4(tot0), tot0 );     \
5393    stbir__simdf_store( output, t );               \
5394    horizontal_coefficients += coefficient_width;  \
5395    ++horizontal_contributors;                     \
5396    output += 4;
5397
5398#else
5399
5400#define stbir__4_coeff_start()                        \
5401    stbir__simdf tot0,tot1,c,cs;                      \
5402    STBIR_SIMD_NO_UNROLL(decode);                     \
5403    stbir__simdf_load( cs, hc );                      \
5404    stbir__simdf_0123to0000( c, cs );                 \
5405    stbir__simdf_mult_mem( tot0, c, decode );         \
5406    stbir__simdf_0123to1111( c, cs );                 \
5407    stbir__simdf_mult_mem( tot1, c, decode+4 );       \
5408    stbir__simdf_0123to2222( c, cs );                 \
5409    stbir__simdf_madd_mem( tot0, tot0, c, decode+8 ); \
5410    stbir__simdf_0123to3333( c, cs );                 \
5411    stbir__simdf_madd_mem( tot1, tot1, c, decode+12 );
5412
5413#define stbir__4_coeff_continue_from_4( ofs )                  \
5414    STBIR_SIMD_NO_UNROLL(decode);                              \
5415    stbir__simdf_load( cs, hc + (ofs) );                       \
5416    stbir__simdf_0123to0000( c, cs );                          \
5417    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 );    \
5418    stbir__simdf_0123to1111( c, cs );                          \
5419    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 );  \
5420    stbir__simdf_0123to2222( c, cs );                          \
5421    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 );  \
5422    stbir__simdf_0123to3333( c, cs );                          \
5423    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+12 );
5424
5425#define stbir__1_coeff_remnant( ofs )                       \
5426    STBIR_SIMD_NO_UNROLL(decode);                           \
5427    stbir__simdf_load1( c, hc + (ofs) );                    \
5428    stbir__simdf_0123to0000( c, c );                        \
5429    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 );
5430
5431#define stbir__2_coeff_remnant( ofs )                         \
5432    STBIR_SIMD_NO_UNROLL(decode);                             \
5433    stbir__simdf_load2( cs, hc + (ofs) );                     \
5434    stbir__simdf_0123to0000( c, cs );                         \
5435    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 );   \
5436    stbir__simdf_0123to1111( c, cs );                         \
5437    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 );
5438
5439#define stbir__3_coeff_remnant( ofs )                          \
5440    STBIR_SIMD_NO_UNROLL(decode);                              \
5441    stbir__simdf_load( cs, hc + (ofs) );                       \
5442    stbir__simdf_0123to0000( c, cs );                          \
5443    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 );    \
5444    stbir__simdf_0123to1111( c, cs );                          \
5445    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 );  \
5446    stbir__simdf_0123to2222( c, cs );                          \
5447    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 );
5448
5449#define stbir__store_output()                     \
5450    stbir__simdf_add( tot0, tot0, tot1 );         \
5451    stbir__simdf_store( output, tot0 );           \
5452    horizontal_coefficients += coefficient_width; \
5453    ++horizontal_contributors;                    \
5454    output += 4;
5455
5456#endif
5457
5458#else
5459
5460#define stbir__1_coeff_only()         \
5461    float p0,p1,p2,p3,c;              \
5462    STBIR_SIMD_NO_UNROLL(decode);     \
5463    c = hc[0];                        \
5464    p0 = decode[0] * c;               \
5465    p1 = decode[1] * c;               \
5466    p2 = decode[2] * c;               \
5467    p3 = decode[3] * c;
5468
5469#define stbir__2_coeff_only()         \
5470    float p0,p1,p2,p3,c;              \
5471    STBIR_SIMD_NO_UNROLL(decode);     \
5472    c = hc[0];                        \
5473    p0 = decode[0] * c;               \
5474    p1 = decode[1] * c;               \
5475    p2 = decode[2] * c;               \
5476    p3 = decode[3] * c;               \
5477    c = hc[1];                        \
5478    p0 += decode[4] * c;              \
5479    p1 += decode[5] * c;              \
5480    p2 += decode[6] * c;              \
5481    p3 += decode[7] * c;
5482
5483#define stbir__3_coeff_only()         \
5484    float p0,p1,p2,p3,c;              \
5485    STBIR_SIMD_NO_UNROLL(decode);     \
5486    c = hc[0];                        \
5487    p0 = decode[0] * c;               \
5488    p1 = decode[1] * c;               \
5489    p2 = decode[2] * c;               \
5490    p3 = decode[3] * c;               \
5491    c = hc[1];                        \
5492    p0 += decode[4] * c;              \
5493    p1 += decode[5] * c;              \
5494    p2 += decode[6] * c;              \
5495    p3 += decode[7] * c;              \
5496    c = hc[2];                        \
5497    p0 += decode[8] * c;              \
5498    p1 += decode[9] * c;              \
5499    p2 += decode[10] * c;             \
5500    p3 += decode[11] * c;
5501
5502#define stbir__store_output_tiny()                \
5503    output[0] = p0;                               \
5504    output[1] = p1;                               \
5505    output[2] = p2;                               \
5506    output[3] = p3;                               \
5507    horizontal_coefficients += coefficient_width; \
5508    ++horizontal_contributors;                    \
5509    output += 4;
5510
5511#define stbir__4_coeff_start()        \
5512    float x0,x1,x2,x3,y0,y1,y2,y3,c;  \
5513    STBIR_SIMD_NO_UNROLL(decode);     \
5514    c = hc[0];                        \
5515    x0 = decode[0] * c;               \
5516    x1 = decode[1] * c;               \
5517    x2 = decode[2] * c;               \
5518    x3 = decode[3] * c;               \
5519    c = hc[1];                        \
5520    y0 = decode[4] * c;               \
5521    y1 = decode[5] * c;               \
5522    y2 = decode[6] * c;               \
5523    y3 = decode[7] * c;               \
5524    c = hc[2];                        \
5525    x0 += decode[8] * c;              \
5526    x1 += decode[9] * c;              \
5527    x2 += decode[10] * c;             \
5528    x3 += decode[11] * c;             \
5529    c = hc[3];                        \
5530    y0 += decode[12] * c;             \
5531    y1 += decode[13] * c;             \
5532    y2 += decode[14] * c;             \
5533    y3 += decode[15] * c;
5534
5535#define stbir__4_coeff_continue_from_4( ofs ) \
5536    STBIR_SIMD_NO_UNROLL(decode);     \
5537    c = hc[0+(ofs)];                  \
5538    x0 += decode[0+(ofs)*4] * c;      \
5539    x1 += decode[1+(ofs)*4] * c;      \
5540    x2 += decode[2+(ofs)*4] * c;      \
5541    x3 += decode[3+(ofs)*4] * c;      \
5542    c = hc[1+(ofs)];                  \
5543    y0 += decode[4+(ofs)*4] * c;      \
5544    y1 += decode[5+(ofs)*4] * c;      \
5545    y2 += decode[6+(ofs)*4] * c;      \
5546    y3 += decode[7+(ofs)*4] * c;      \
5547    c = hc[2+(ofs)];                  \
5548    x0 += decode[8+(ofs)*4] * c;      \
5549    x1 += decode[9+(ofs)*4] * c;      \
5550    x2 += decode[10+(ofs)*4] * c;     \
5551    x3 += decode[11+(ofs)*4] * c;     \
5552    c = hc[3+(ofs)];                  \
5553    y0 += decode[12+(ofs)*4] * c;     \
5554    y1 += decode[13+(ofs)*4] * c;     \
5555    y2 += decode[14+(ofs)*4] * c;     \
5556    y3 += decode[15+(ofs)*4] * c;
5557
5558#define stbir__1_coeff_remnant( ofs ) \
5559    STBIR_SIMD_NO_UNROLL(decode);     \
5560    c = hc[0+(ofs)];                  \
5561    x0 += decode[0+(ofs)*4] * c;      \
5562    x1 += decode[1+(ofs)*4] * c;      \
5563    x2 += decode[2+(ofs)*4] * c;      \
5564    x3 += decode[3+(ofs)*4] * c;
5565
5566#define stbir__2_coeff_remnant( ofs ) \
5567    STBIR_SIMD_NO_UNROLL(decode);     \
5568    c = hc[0+(ofs)];                  \
5569    x0 += decode[0+(ofs)*4] * c;      \
5570    x1 += decode[1+(ofs)*4] * c;      \
5571    x2 += decode[2+(ofs)*4] * c;      \
5572    x3 += decode[3+(ofs)*4] * c;      \
5573    c = hc[1+(ofs)];                  \
5574    y0 += decode[4+(ofs)*4] * c;      \
5575    y1 += decode[5+(ofs)*4] * c;      \
5576    y2 += decode[6+(ofs)*4] * c;      \
5577    y3 += decode[7+(ofs)*4] * c;
5578
5579#define stbir__3_coeff_remnant( ofs ) \
5580    STBIR_SIMD_NO_UNROLL(decode);     \
5581    c = hc[0+(ofs)];                  \
5582    x0 += decode[0+(ofs)*4] * c;      \
5583    x1 += decode[1+(ofs)*4] * c;      \
5584    x2 += decode[2+(ofs)*4] * c;      \
5585    x3 += decode[3+(ofs)*4] * c;      \
5586    c = hc[1+(ofs)];                  \
5587    y0 += decode[4+(ofs)*4] * c;      \
5588    y1 += decode[5+(ofs)*4] * c;      \
5589    y2 += decode[6+(ofs)*4] * c;      \
5590    y3 += decode[7+(ofs)*4] * c;      \
5591    c = hc[2+(ofs)];                  \
5592    x0 += decode[8+(ofs)*4] * c;      \
5593    x1 += decode[9+(ofs)*4] * c;      \
5594    x2 += decode[10+(ofs)*4] * c;     \
5595    x3 += decode[11+(ofs)*4] * c;
5596
5597#define stbir__store_output()                     \
5598    output[0] = x0 + y0;                          \
5599    output[1] = x1 + y1;                          \
5600    output[2] = x2 + y2;                          \
5601    output[3] = x3 + y3;                          \
5602    horizontal_coefficients += coefficient_width; \
5603    ++horizontal_contributors;                    \
5604    output += 4;
5605
5606#endif
5607
5608#define STBIR__horizontal_channels 4
5609#define STB_IMAGE_RESIZE_DO_HORIZONTALS
5610#include STBIR__HEADER_FILENAME
5611
5612
5613
5614//=================
5615// Do 7 channel horizontal routines
5616
5617#ifdef STBIR_SIMD
5618
5619#define stbir__1_coeff_only()                   \
5620    stbir__simdf tot0,tot1,c;                   \
5621    STBIR_SIMD_NO_UNROLL(decode);               \
5622    stbir__simdf_load1( c, hc );                \
5623    stbir__simdf_0123to0000( c, c );            \
5624    stbir__simdf_mult_mem( tot0, c, decode );   \
5625    stbir__simdf_mult_mem( tot1, c, decode+3 );
5626
5627#define stbir__2_coeff_only()                         \
5628    stbir__simdf tot0,tot1,c,cs;                      \
5629    STBIR_SIMD_NO_UNROLL(decode);                     \
5630    stbir__simdf_load2( cs, hc );                     \
5631    stbir__simdf_0123to0000( c, cs );                 \
5632    stbir__simdf_mult_mem( tot0, c, decode );         \
5633    stbir__simdf_mult_mem( tot1, c, decode+3 );       \
5634    stbir__simdf_0123to1111( c, cs );                 \
5635    stbir__simdf_madd_mem( tot0, tot0, c, decode+7 ); \
5636    stbir__simdf_madd_mem( tot1, tot1, c,decode+10 );
5637
5638#define stbir__3_coeff_only()                           \
5639    stbir__simdf tot0,tot1,c,cs;                        \
5640    STBIR_SIMD_NO_UNROLL(decode);                       \
5641    stbir__simdf_load( cs, hc );                        \
5642    stbir__simdf_0123to0000( c, cs );                   \
5643    stbir__simdf_mult_mem( tot0, c, decode );           \
5644    stbir__simdf_mult_mem( tot1, c, decode+3 );         \
5645    stbir__simdf_0123to1111( c, cs );                   \
5646    stbir__simdf_madd_mem( tot0, tot0, c, decode+7 );   \
5647    stbir__simdf_madd_mem( tot1, tot1, c, decode+10 );  \
5648    stbir__simdf_0123to2222( c, cs );                   \
5649    stbir__simdf_madd_mem( tot0, tot0, c, decode+14 );  \
5650    stbir__simdf_madd_mem( tot1, tot1, c, decode+17 );
5651
5652#define stbir__store_output_tiny()                \
5653    stbir__simdf_store( output+3, tot1 );         \
5654    stbir__simdf_store( output, tot0 );           \
5655    horizontal_coefficients += coefficient_width; \
5656    ++horizontal_contributors;                    \
5657    output += 7;
5658
5659#ifdef STBIR_SIMD8
5660
5661#define stbir__4_coeff_start()                     \
5662    stbir__simdf8 tot0,tot1,c,cs;                  \
5663    STBIR_SIMD_NO_UNROLL(decode);                  \
5664    stbir__simdf8_load4b( cs, hc );                \
5665    stbir__simdf8_0123to00000000( c, cs );         \
5666    stbir__simdf8_mult_mem( tot0, c, decode );     \
5667    stbir__simdf8_0123to11111111( c, cs );         \
5668    stbir__simdf8_mult_mem( tot1, c, decode+7 );   \
5669    stbir__simdf8_0123to22222222( c, cs );         \
5670    stbir__simdf8_madd_mem( tot0, tot0, c, decode+14 );  \
5671    stbir__simdf8_0123to33333333( c, cs );         \
5672    stbir__simdf8_madd_mem( tot1, tot1, c, decode+21 );
5673
5674#define stbir__4_coeff_continue_from_4( ofs )                   \
5675    STBIR_SIMD_NO_UNROLL(decode);                               \
5676    stbir__simdf8_load4b( cs, hc + (ofs) );                     \
5677    stbir__simdf8_0123to00000000( c, cs );                      \
5678    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 );    \
5679    stbir__simdf8_0123to11111111( c, cs );                      \
5680    stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 );  \
5681    stbir__simdf8_0123to22222222( c, cs );                      \
5682    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); \
5683    stbir__simdf8_0123to33333333( c, cs );                      \
5684    stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+21 );
5685
5686#define stbir__1_coeff_remnant( ofs )                           \
5687    STBIR_SIMD_NO_UNROLL(decode);                               \
5688    stbir__simdf8_load1b( c, hc + (ofs) );                      \
5689    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 );
5690
5691#define stbir__2_coeff_remnant( ofs )                           \
5692    STBIR_SIMD_NO_UNROLL(decode);                               \
5693    stbir__simdf8_load1b( c, hc + (ofs) );                      \
5694    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 );    \
5695    stbir__simdf8_load1b( c, hc + (ofs)+1 );                    \
5696    stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 );
5697
5698#define stbir__3_coeff_remnant( ofs )                           \
5699    STBIR_SIMD_NO_UNROLL(decode);                               \
5700    stbir__simdf8_load4b( cs, hc + (ofs) );                     \
5701    stbir__simdf8_0123to00000000( c, cs );                      \
5702    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 );    \
5703    stbir__simdf8_0123to11111111( c, cs );                      \
5704    stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 );  \
5705    stbir__simdf8_0123to22222222( c, cs );                      \
5706    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 );
5707
5708#define stbir__store_output()                     \
5709    stbir__simdf8_add( tot0, tot0, tot1 );        \
5710    horizontal_coefficients += coefficient_width; \
5711    ++horizontal_contributors;                    \
5712    output += 7;                                  \
5713    if ( output < output_end )                    \
5714    {                                             \
5715      stbir__simdf8_store( output-7, tot0 );      \
5716      continue;                                   \
5717    }                                             \
5718    stbir__simdf_store( output-7+3, stbir__simdf_swiz(stbir__simdf8_gettop4(tot0),0,0,1,2) ); \
5719    stbir__simdf_store( output-7, stbir__if_simdf8_cast_to_simdf4(tot0) );           \
5720    break;
5721
5722#else
5723
5724#define stbir__4_coeff_start()                    \
5725    stbir__simdf tot0,tot1,tot2,tot3,c,cs;        \
5726    STBIR_SIMD_NO_UNROLL(decode);                 \
5727    stbir__simdf_load( cs, hc );                  \
5728    stbir__simdf_0123to0000( c, cs );             \
5729    stbir__simdf_mult_mem( tot0, c, decode );     \
5730    stbir__simdf_mult_mem( tot1, c, decode+3 );   \
5731    stbir__simdf_0123to1111( c, cs );             \
5732    stbir__simdf_mult_mem( tot2, c, decode+7 );   \
5733    stbir__simdf_mult_mem( tot3, c, decode+10 );  \
5734    stbir__simdf_0123to2222( c, cs );             \
5735    stbir__simdf_madd_mem( tot0, tot0, c, decode+14 );  \
5736    stbir__simdf_madd_mem( tot1, tot1, c, decode+17 );  \
5737    stbir__simdf_0123to3333( c, cs );                   \
5738    stbir__simdf_madd_mem( tot2, tot2, c, decode+21 );  \
5739    stbir__simdf_madd_mem( tot3, tot3, c, decode+24 );
5740
5741#define stbir__4_coeff_continue_from_4( ofs )                   \
5742    STBIR_SIMD_NO_UNROLL(decode);                               \
5743    stbir__simdf_load( cs, hc + (ofs) );                        \
5744    stbir__simdf_0123to0000( c, cs );                           \
5745    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 );     \
5746    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 );   \
5747    stbir__simdf_0123to1111( c, cs );                           \
5748    stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+7 );   \
5749    stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 );  \
5750    stbir__simdf_0123to2222( c, cs );                           \
5751    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 );  \
5752    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+17 );  \
5753    stbir__simdf_0123to3333( c, cs );                           \
5754    stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+21 );  \
5755    stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+24 );
5756
5757#define stbir__1_coeff_remnant( ofs )                           \
5758    STBIR_SIMD_NO_UNROLL(decode);                               \
5759    stbir__simdf_load1( c, hc + (ofs) );                        \
5760    stbir__simdf_0123to0000( c, c );                            \
5761    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 );     \
5762    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 );   \
5763
5764#define stbir__2_coeff_remnant( ofs )                           \
5765    STBIR_SIMD_NO_UNROLL(decode);                               \
5766    stbir__simdf_load2( cs, hc + (ofs) );                       \
5767    stbir__simdf_0123to0000( c, cs );                           \
5768    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 );     \
5769    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 );   \
5770    stbir__simdf_0123to1111( c, cs );                           \
5771    stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+7 );   \
5772    stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 );
5773
5774#define stbir__3_coeff_remnant( ofs )                           \
5775    STBIR_SIMD_NO_UNROLL(decode);                               \
5776    stbir__simdf_load( cs, hc + (ofs) );                        \
5777    stbir__simdf_0123to0000( c, cs );                           \
5778    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 );     \
5779    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 );   \
5780    stbir__simdf_0123to1111( c, cs );                           \
5781    stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+7 );   \
5782    stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 );  \
5783    stbir__simdf_0123to2222( c, cs );                           \
5784    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 );  \
5785    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+17 );
5786
5787#define stbir__store_output()                     \
5788    stbir__simdf_add( tot0, tot0, tot2 );         \
5789    stbir__simdf_add( tot1, tot1, tot3 );         \
5790    stbir__simdf_store( output+3, tot1 );         \
5791    stbir__simdf_store( output, tot0 );           \
5792    horizontal_coefficients += coefficient_width; \
5793    ++horizontal_contributors;                    \
5794    output += 7;
5795
5796#endif
5797
5798#else
5799
5800#define stbir__1_coeff_only()        \
5801    float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c; \
5802    c = hc[0];                       \
5803    tot0 = decode[0]*c;              \
5804    tot1 = decode[1]*c;              \
5805    tot2 = decode[2]*c;              \
5806    tot3 = decode[3]*c;              \
5807    tot4 = decode[4]*c;              \
5808    tot5 = decode[5]*c;              \
5809    tot6 = decode[6]*c;
5810
5811#define stbir__2_coeff_only()        \
5812    float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c; \
5813    c = hc[0];                       \
5814    tot0 = decode[0]*c;              \
5815    tot1 = decode[1]*c;              \
5816    tot2 = decode[2]*c;              \
5817    tot3 = decode[3]*c;              \
5818    tot4 = decode[4]*c;              \
5819    tot5 = decode[5]*c;              \
5820    tot6 = decode[6]*c;              \
5821    c = hc[1];                       \
5822    tot0 += decode[7]*c;             \
5823    tot1 += decode[8]*c;             \
5824    tot2 += decode[9]*c;             \
5825    tot3 += decode[10]*c;            \
5826    tot4 += decode[11]*c;            \
5827    tot5 += decode[12]*c;            \
5828    tot6 += decode[13]*c;            \
5829
5830#define stbir__3_coeff_only()        \
5831    float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c; \
5832    c = hc[0];                       \
5833    tot0 = decode[0]*c;              \
5834    tot1 = decode[1]*c;              \
5835    tot2 = decode[2]*c;              \
5836    tot3 = decode[3]*c;              \
5837    tot4 = decode[4]*c;              \
5838    tot5 = decode[5]*c;              \
5839    tot6 = decode[6]*c;              \
5840    c = hc[1];                       \
5841    tot0 += decode[7]*c;             \
5842    tot1 += decode[8]*c;             \
5843    tot2 += decode[9]*c;             \
5844    tot3 += decode[10]*c;            \
5845    tot4 += decode[11]*c;            \
5846    tot5 += decode[12]*c;            \
5847    tot6 += decode[13]*c;            \
5848    c = hc[2];                       \
5849    tot0 += decode[14]*c;            \
5850    tot1 += decode[15]*c;            \
5851    tot2 += decode[16]*c;            \
5852    tot3 += decode[17]*c;            \
5853    tot4 += decode[18]*c;            \
5854    tot5 += decode[19]*c;            \
5855    tot6 += decode[20]*c;            \
5856
5857#define stbir__store_output_tiny()                \
5858    output[0] = tot0;                             \
5859    output[1] = tot1;                             \
5860    output[2] = tot2;                             \
5861    output[3] = tot3;                             \
5862    output[4] = tot4;                             \
5863    output[5] = tot5;                             \
5864    output[6] = tot6;                             \
5865    horizontal_coefficients += coefficient_width; \
5866    ++horizontal_contributors;                    \
5867    output += 7;
5868
5869#define stbir__4_coeff_start()    \
5870    float x0,x1,x2,x3,x4,x5,x6,y0,y1,y2,y3,y4,y5,y6,c; \
5871    STBIR_SIMD_NO_UNROLL(decode); \
5872    c = hc[0];                    \
5873    x0 = decode[0] * c;           \
5874    x1 = decode[1] * c;           \
5875    x2 = decode[2] * c;           \
5876    x3 = decode[3] * c;           \
5877    x4 = decode[4] * c;           \
5878    x5 = decode[5] * c;           \
5879    x6 = decode[6] * c;           \
5880    c = hc[1];                    \
5881    y0 = decode[7] * c;           \
5882    y1 = decode[8] * c;           \
5883    y2 = decode[9] * c;           \
5884    y3 = decode[10] * c;          \
5885    y4 = decode[11] * c;          \
5886    y5 = decode[12] * c;          \
5887    y6 = decode[13] * c;          \
5888    c = hc[2];                    \
5889    x0 += decode[14] * c;         \
5890    x1 += decode[15] * c;         \
5891    x2 += decode[16] * c;         \
5892    x3 += decode[17] * c;         \
5893    x4 += decode[18] * c;         \
5894    x5 += decode[19] * c;         \
5895    x6 += decode[20] * c;         \
5896    c = hc[3];                    \
5897    y0 += decode[21] * c;         \
5898    y1 += decode[22] * c;         \
5899    y2 += decode[23] * c;         \
5900    y3 += decode[24] * c;         \
5901    y4 += decode[25] * c;         \
5902    y5 += decode[26] * c;         \
5903    y6 += decode[27] * c;
5904
5905#define stbir__4_coeff_continue_from_4( ofs ) \
5906    STBIR_SIMD_NO_UNROLL(decode);  \
5907    c = hc[0+(ofs)];               \
5908    x0 += decode[0+(ofs)*7] * c;   \
5909    x1 += decode[1+(ofs)*7] * c;   \
5910    x2 += decode[2+(ofs)*7] * c;   \
5911    x3 += decode[3+(ofs)*7] * c;   \
5912    x4 += decode[4+(ofs)*7] * c;   \
5913    x5 += decode[5+(ofs)*7] * c;   \
5914    x6 += decode[6+(ofs)*7] * c;   \
5915    c = hc[1+(ofs)];               \
5916    y0 += decode[7+(ofs)*7] * c;   \
5917    y1 += decode[8+(ofs)*7] * c;   \
5918    y2 += decode[9+(ofs)*7] * c;   \
5919    y3 += decode[10+(ofs)*7] * c;  \
5920    y4 += decode[11+(ofs)*7] * c;  \
5921    y5 += decode[12+(ofs)*7] * c;  \
5922    y6 += decode[13+(ofs)*7] * c;  \
5923    c = hc[2+(ofs)];               \
5924    x0 += decode[14+(ofs)*7] * c;  \
5925    x1 += decode[15+(ofs)*7] * c;  \
5926    x2 += decode[16+(ofs)*7] * c;  \
5927    x3 += decode[17+(ofs)*7] * c;  \
5928    x4 += decode[18+(ofs)*7] * c;  \
5929    x5 += decode[19+(ofs)*7] * c;  \
5930    x6 += decode[20+(ofs)*7] * c;  \
5931    c = hc[3+(ofs)];               \
5932    y0 += decode[21+(ofs)*7] * c;  \
5933    y1 += decode[22+(ofs)*7] * c;  \
5934    y2 += decode[23+(ofs)*7] * c;  \
5935    y3 += decode[24+(ofs)*7] * c;  \
5936    y4 += decode[25+(ofs)*7] * c;  \
5937    y5 += decode[26+(ofs)*7] * c;  \
5938    y6 += decode[27+(ofs)*7] * c;
5939
5940#define stbir__1_coeff_remnant( ofs ) \
5941    STBIR_SIMD_NO_UNROLL(decode);  \
5942    c = hc[0+(ofs)];               \
5943    x0 += decode[0+(ofs)*7] * c;   \
5944    x1 += decode[1+(ofs)*7] * c;   \
5945    x2 += decode[2+(ofs)*7] * c;   \
5946    x3 += decode[3+(ofs)*7] * c;   \
5947    x4 += decode[4+(ofs)*7] * c;   \
5948    x5 += decode[5+(ofs)*7] * c;   \
5949    x6 += decode[6+(ofs)*7] * c;   \
5950
5951#define stbir__2_coeff_remnant( ofs ) \
5952    STBIR_SIMD_NO_UNROLL(decode);  \
5953    c = hc[0+(ofs)];               \
5954    x0 += decode[0+(ofs)*7] * c;   \
5955    x1 += decode[1+(ofs)*7] * c;   \
5956    x2 += decode[2+(ofs)*7] * c;   \
5957    x3 += decode[3+(ofs)*7] * c;   \
5958    x4 += decode[4+(ofs)*7] * c;   \
5959    x5 += decode[5+(ofs)*7] * c;   \
5960    x6 += decode[6+(ofs)*7] * c;   \
5961    c = hc[1+(ofs)];               \
5962    y0 += decode[7+(ofs)*7] * c;   \
5963    y1 += decode[8+(ofs)*7] * c;   \
5964    y2 += decode[9+(ofs)*7] * c;   \
5965    y3 += decode[10+(ofs)*7] * c;  \
5966    y4 += decode[11+(ofs)*7] * c;  \
5967    y5 += decode[12+(ofs)*7] * c;  \
5968    y6 += decode[13+(ofs)*7] * c;  \
5969
5970#define stbir__3_coeff_remnant( ofs ) \
5971    STBIR_SIMD_NO_UNROLL(decode);  \
5972    c = hc[0+(ofs)];               \
5973    x0 += decode[0+(ofs)*7] * c;   \
5974    x1 += decode[1+(ofs)*7] * c;   \
5975    x2 += decode[2+(ofs)*7] * c;   \
5976    x3 += decode[3+(ofs)*7] * c;   \
5977    x4 += decode[4+(ofs)*7] * c;   \
5978    x5 += decode[5+(ofs)*7] * c;   \
5979    x6 += decode[6+(ofs)*7] * c;   \
5980    c = hc[1+(ofs)];               \
5981    y0 += decode[7+(ofs)*7] * c;   \
5982    y1 += decode[8+(ofs)*7] * c;   \
5983    y2 += decode[9+(ofs)*7] * c;   \
5984    y3 += decode[10+(ofs)*7] * c;  \
5985    y4 += decode[11+(ofs)*7] * c;  \
5986    y5 += decode[12+(ofs)*7] * c;  \
5987    y6 += decode[13+(ofs)*7] * c;  \
5988    c = hc[2+(ofs)];               \
5989    x0 += decode[14+(ofs)*7] * c;  \
5990    x1 += decode[15+(ofs)*7] * c;  \
5991    x2 += decode[16+(ofs)*7] * c;  \
5992    x3 += decode[17+(ofs)*7] * c;  \
5993    x4 += decode[18+(ofs)*7] * c;  \
5994    x5 += decode[19+(ofs)*7] * c;  \
5995    x6 += decode[20+(ofs)*7] * c;  \
5996
5997#define stbir__store_output()                     \
5998    output[0] = x0 + y0;                          \
5999    output[1] = x1 + y1;                          \
6000    output[2] = x2 + y2;                          \
6001    output[3] = x3 + y3;                          \
6002    output[4] = x4 + y4;                          \
6003    output[5] = x5 + y5;                          \
6004    output[6] = x6 + y6;                          \
6005    horizontal_coefficients += coefficient_width; \
6006    ++horizontal_contributors;                    \
6007    output += 7;
6008
6009#endif
6010
6011#define STBIR__horizontal_channels 7
6012#define STB_IMAGE_RESIZE_DO_HORIZONTALS
6013#include STBIR__HEADER_FILENAME
6014
6015
6016// include all of the vertical resamplers (both scatter and gather versions)
6017
6018#define STBIR__vertical_channels 1
6019#define STB_IMAGE_RESIZE_DO_VERTICALS
6020#include STBIR__HEADER_FILENAME
6021
6022#define STBIR__vertical_channels 1
6023#define STB_IMAGE_RESIZE_DO_VERTICALS
6024#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6025#include STBIR__HEADER_FILENAME
6026
6027#define STBIR__vertical_channels 2
6028#define STB_IMAGE_RESIZE_DO_VERTICALS
6029#include STBIR__HEADER_FILENAME
6030
6031#define STBIR__vertical_channels 2
6032#define STB_IMAGE_RESIZE_DO_VERTICALS
6033#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6034#include STBIR__HEADER_FILENAME
6035
6036#define STBIR__vertical_channels 3
6037#define STB_IMAGE_RESIZE_DO_VERTICALS
6038#include STBIR__HEADER_FILENAME
6039
6040#define STBIR__vertical_channels 3
6041#define STB_IMAGE_RESIZE_DO_VERTICALS
6042#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6043#include STBIR__HEADER_FILENAME
6044
6045#define STBIR__vertical_channels 4
6046#define STB_IMAGE_RESIZE_DO_VERTICALS
6047#include STBIR__HEADER_FILENAME
6048
6049#define STBIR__vertical_channels 4
6050#define STB_IMAGE_RESIZE_DO_VERTICALS
6051#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6052#include STBIR__HEADER_FILENAME
6053
6054#define STBIR__vertical_channels 5
6055#define STB_IMAGE_RESIZE_DO_VERTICALS
6056#include STBIR__HEADER_FILENAME
6057
6058#define STBIR__vertical_channels 5
6059#define STB_IMAGE_RESIZE_DO_VERTICALS
6060#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6061#include STBIR__HEADER_FILENAME
6062
6063#define STBIR__vertical_channels 6
6064#define STB_IMAGE_RESIZE_DO_VERTICALS
6065#include STBIR__HEADER_FILENAME
6066
6067#define STBIR__vertical_channels 6
6068#define STB_IMAGE_RESIZE_DO_VERTICALS
6069#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6070#include STBIR__HEADER_FILENAME
6071
6072#define STBIR__vertical_channels 7
6073#define STB_IMAGE_RESIZE_DO_VERTICALS
6074#include STBIR__HEADER_FILENAME
6075
6076#define STBIR__vertical_channels 7
6077#define STB_IMAGE_RESIZE_DO_VERTICALS
6078#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6079#include STBIR__HEADER_FILENAME
6080
6081#define STBIR__vertical_channels 8
6082#define STB_IMAGE_RESIZE_DO_VERTICALS
6083#include STBIR__HEADER_FILENAME
6084
6085#define STBIR__vertical_channels 8
6086#define STB_IMAGE_RESIZE_DO_VERTICALS
6087#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6088#include STBIR__HEADER_FILENAME
6089
6090typedef void STBIR_VERTICAL_GATHERFUNC( float * output, float const * coeffs, float const ** inputs, float const * input0_end );
6091
6092static STBIR_VERTICAL_GATHERFUNC * stbir__vertical_gathers[ 8 ] =
6093{
6094  stbir__vertical_gather_with_1_coeffs,stbir__vertical_gather_with_2_coeffs,stbir__vertical_gather_with_3_coeffs,stbir__vertical_gather_with_4_coeffs,stbir__vertical_gather_with_5_coeffs,stbir__vertical_gather_with_6_coeffs,stbir__vertical_gather_with_7_coeffs,stbir__vertical_gather_with_8_coeffs
6095};
6096
6097static STBIR_VERTICAL_GATHERFUNC * stbir__vertical_gathers_continues[ 8 ] =
6098{
6099  stbir__vertical_gather_with_1_coeffs_cont,stbir__vertical_gather_with_2_coeffs_cont,stbir__vertical_gather_with_3_coeffs_cont,stbir__vertical_gather_with_4_coeffs_cont,stbir__vertical_gather_with_5_coeffs_cont,stbir__vertical_gather_with_6_coeffs_cont,stbir__vertical_gather_with_7_coeffs_cont,stbir__vertical_gather_with_8_coeffs_cont
6100};
6101
6102typedef void STBIR_VERTICAL_SCATTERFUNC( float ** outputs, float const * coeffs, float const * input, float const * input_end );
6103
6104static STBIR_VERTICAL_SCATTERFUNC * stbir__vertical_scatter_sets[ 8 ] =
6105{
6106  stbir__vertical_scatter_with_1_coeffs,stbir__vertical_scatter_with_2_coeffs,stbir__vertical_scatter_with_3_coeffs,stbir__vertical_scatter_with_4_coeffs,stbir__vertical_scatter_with_5_coeffs,stbir__vertical_scatter_with_6_coeffs,stbir__vertical_scatter_with_7_coeffs,stbir__vertical_scatter_with_8_coeffs
6107};
6108
6109static STBIR_VERTICAL_SCATTERFUNC * stbir__vertical_scatter_blends[ 8 ] =
6110{
6111  stbir__vertical_scatter_with_1_coeffs_cont,stbir__vertical_scatter_with_2_coeffs_cont,stbir__vertical_scatter_with_3_coeffs_cont,stbir__vertical_scatter_with_4_coeffs_cont,stbir__vertical_scatter_with_5_coeffs_cont,stbir__vertical_scatter_with_6_coeffs_cont,stbir__vertical_scatter_with_7_coeffs_cont,stbir__vertical_scatter_with_8_coeffs_cont
6112};
6113
6114
6115static void stbir__encode_scanline( stbir__info const * stbir_info, void *output_buffer_data, float * encode_buffer, int row  STBIR_ONLY_PROFILE_GET_SPLIT_INFO )
6116{
6117  int num_pixels = stbir_info->horizontal.scale_info.output_sub_size;
6118  int channels = stbir_info->channels;
6119  int width_times_channels = num_pixels * channels;
6120  void * output_buffer;
6121
6122  // un-alpha weight if we need to
6123  if ( stbir_info->alpha_unweight )
6124  {
6125    STBIR_PROFILE_START( unalpha );
6126    stbir_info->alpha_unweight( encode_buffer, width_times_channels );
6127    STBIR_PROFILE_END( unalpha );
6128  }
6129
6130  // write directly into output by default
6131  output_buffer = output_buffer_data;
6132
6133  // if we have an output callback, we first convert the decode buffer in place (and then hand that to the callback)
6134  if ( stbir_info->out_pixels_cb )
6135    output_buffer = encode_buffer;
6136
6137  STBIR_PROFILE_START( encode );
6138  // convert into the output buffer
6139  stbir_info->encode_pixels( output_buffer, width_times_channels, encode_buffer );
6140  STBIR_PROFILE_END( encode );
6141
6142  // if we have an output callback, call it to send the data
6143  if ( stbir_info->out_pixels_cb )
6144    stbir_info->out_pixels_cb( output_buffer, num_pixels, row, stbir_info->user_data );
6145}
6146
6147
6148// Get the ring buffer pointer for an index
6149static float* stbir__get_ring_buffer_entry(stbir__info const * stbir_info, stbir__per_split_info const * split_info, int index )
6150{
6151  STBIR_ASSERT( index < stbir_info->ring_buffer_num_entries );
6152
6153  #ifdef STBIR__SEPARATE_ALLOCATIONS
6154    return split_info->ring_buffers[ index ];
6155  #else
6156    return (float*) ( ( (char*) split_info->ring_buffer ) + ( index * stbir_info->ring_buffer_length_bytes ) );
6157  #endif
6158}
6159
6160// Get the specified scan line from the ring buffer
6161static float* stbir__get_ring_buffer_scanline(stbir__info const * stbir_info, stbir__per_split_info const * split_info, int get_scanline)
6162{
6163  int ring_buffer_index = (split_info->ring_buffer_begin_index + (get_scanline - split_info->ring_buffer_first_scanline)) % stbir_info->ring_buffer_num_entries;
6164  return stbir__get_ring_buffer_entry( stbir_info, split_info, ring_buffer_index );
6165}
6166
6167static void stbir__resample_horizontal_gather(stbir__info const * stbir_info, float* output_buffer, float const * input_buffer STBIR_ONLY_PROFILE_GET_SPLIT_INFO )
6168{
6169  float const * decode_buffer = input_buffer - ( stbir_info->scanline_extents.conservative.n0 * stbir_info->effective_channels );
6170
6171  STBIR_PROFILE_START( horizontal );
6172  if ( ( stbir_info->horizontal.filter_enum == STBIR_FILTER_POINT_SAMPLE ) && ( stbir_info->horizontal.scale_info.scale == 1.0f ) )
6173    STBIR_MEMCPY( output_buffer, input_buffer, stbir_info->horizontal.scale_info.output_sub_size * sizeof( float ) * stbir_info->effective_channels );
6174  else
6175    stbir_info->horizontal_gather_channels( output_buffer, stbir_info->horizontal.scale_info.output_sub_size, decode_buffer, stbir_info->horizontal.contributors, stbir_info->horizontal.coefficients, stbir_info->horizontal.coefficient_width );
6176  STBIR_PROFILE_END( horizontal );
6177}
6178
6179static void stbir__resample_vertical_gather(stbir__info const * stbir_info, stbir__per_split_info* split_info, int n, int contrib_n0, int contrib_n1, float const * vertical_coefficients )
6180{
6181  float* encode_buffer = split_info->vertical_buffer;
6182  float* decode_buffer = split_info->decode_buffer;
6183  int vertical_first = stbir_info->vertical_first;
6184  int width = (vertical_first) ? ( stbir_info->scanline_extents.conservative.n1-stbir_info->scanline_extents.conservative.n0+1 ) : stbir_info->horizontal.scale_info.output_sub_size;
6185  int width_times_channels = stbir_info->effective_channels * width;
6186
6187  STBIR_ASSERT( stbir_info->vertical.is_gather );
6188
6189  // loop over the contributing scanlines and scale into the buffer
6190  STBIR_PROFILE_START( vertical );
6191  {
6192    int k = 0, total = contrib_n1 - contrib_n0 + 1;
6193    STBIR_ASSERT( total > 0 );
6194    do {
6195      float const * inputs[8];
6196      int i, cnt = total; if ( cnt > 8 ) cnt = 8;
6197      for( i = 0 ; i < cnt ; i++ )
6198        inputs[ i ] = stbir__get_ring_buffer_scanline(stbir_info, split_info, k+i+contrib_n0 );
6199
6200      // call the N scanlines at a time function (up to 8 scanlines of blending at once)
6201      ((k==0)?stbir__vertical_gathers:stbir__vertical_gathers_continues)[cnt-1]( (vertical_first) ? decode_buffer : encode_buffer, vertical_coefficients + k, inputs, inputs[0] + width_times_channels );
6202      k += cnt;
6203      total -= cnt;
6204    } while ( total );
6205  }
6206  STBIR_PROFILE_END( vertical );
6207
6208  if ( vertical_first )
6209  {
6210    // Now resample the gathered vertical data in the horizontal axis into the encode buffer
6211    stbir__resample_horizontal_gather(stbir_info, encode_buffer, decode_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6212  }
6213
6214  stbir__encode_scanline( stbir_info, ( (char *) stbir_info->output_data ) + ((size_t)n * (size_t)stbir_info->output_stride_bytes),
6215                          encode_buffer, n  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6216}
6217
6218static void stbir__decode_and_resample_for_vertical_gather_loop(stbir__info const * stbir_info, stbir__per_split_info* split_info, int n)
6219{
6220  int ring_buffer_index;
6221  float* ring_buffer;
6222
6223  // Decode the nth scanline from the source image into the decode buffer.
6224  stbir__decode_scanline( stbir_info, n, split_info->decode_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6225
6226  // update new end scanline
6227  split_info->ring_buffer_last_scanline = n;
6228
6229  // get ring buffer
6230  ring_buffer_index = (split_info->ring_buffer_begin_index + (split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline)) % stbir_info->ring_buffer_num_entries;
6231  ring_buffer = stbir__get_ring_buffer_entry(stbir_info, split_info, ring_buffer_index);
6232
6233  // Now resample it into the ring buffer.
6234  stbir__resample_horizontal_gather( stbir_info, ring_buffer, split_info->decode_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6235
6236  // Now it's sitting in the ring buffer ready to be used as source for the vertical sampling.
6237}
6238
6239static void stbir__vertical_gather_loop( stbir__info const * stbir_info, stbir__per_split_info* split_info, int split_count )
6240{
6241  int y, start_output_y, end_output_y;
6242  stbir__contributors* vertical_contributors = stbir_info->vertical.contributors;
6243  float const * vertical_coefficients = stbir_info->vertical.coefficients;
6244
6245  STBIR_ASSERT( stbir_info->vertical.is_gather );
6246
6247  start_output_y = split_info->start_output_y;
6248  end_output_y = split_info[split_count-1].end_output_y;
6249
6250  vertical_contributors += start_output_y;
6251  vertical_coefficients += start_output_y * stbir_info->vertical.coefficient_width;
6252
6253  // initialize the ring buffer for gathering
6254  split_info->ring_buffer_begin_index = 0;
6255  split_info->ring_buffer_first_scanline = vertical_contributors->n0;
6256  split_info->ring_buffer_last_scanline = split_info->ring_buffer_first_scanline - 1; // means "empty"
6257
6258  for (y = start_output_y; y < end_output_y; y++)
6259  {
6260    int in_first_scanline, in_last_scanline;
6261
6262    in_first_scanline = vertical_contributors->n0;
6263    in_last_scanline = vertical_contributors->n1;
6264
6265    // make sure the indexing hasn't broken
6266    STBIR_ASSERT( in_first_scanline >= split_info->ring_buffer_first_scanline );
6267
6268    // Load in new scanlines
6269    while (in_last_scanline > split_info->ring_buffer_last_scanline)
6270    {
6271      STBIR_ASSERT( ( split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline + 1 ) <= stbir_info->ring_buffer_num_entries );
6272
6273      // make sure there was room in the ring buffer when we add new scanlines
6274      if ( ( split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline + 1 ) == stbir_info->ring_buffer_num_entries )
6275      {
6276        split_info->ring_buffer_first_scanline++;
6277        split_info->ring_buffer_begin_index++;
6278      }
6279
6280      if ( stbir_info->vertical_first )
6281      {
6282        float * ring_buffer = stbir__get_ring_buffer_scanline( stbir_info, split_info, ++split_info->ring_buffer_last_scanline );
6283        // Decode the nth scanline from the source image into the decode buffer.
6284        stbir__decode_scanline( stbir_info, split_info->ring_buffer_last_scanline, ring_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6285      }
6286      else
6287      {
6288        stbir__decode_and_resample_for_vertical_gather_loop(stbir_info, split_info, split_info->ring_buffer_last_scanline + 1);
6289      }
6290    }
6291
6292    // Now all buffers should be ready to write a row of vertical sampling, so do it.
6293    stbir__resample_vertical_gather(stbir_info, split_info, y, in_first_scanline, in_last_scanline, vertical_coefficients );
6294
6295    ++vertical_contributors;
6296    vertical_coefficients += stbir_info->vertical.coefficient_width;
6297  }
6298}
6299
6300#define STBIR__FLOAT_EMPTY_MARKER 3.0e+38F
6301#define STBIR__FLOAT_BUFFER_IS_EMPTY(ptr) ((ptr)[0]==STBIR__FLOAT_EMPTY_MARKER)
6302
6303static void stbir__encode_first_scanline_from_scatter(stbir__info const * stbir_info, stbir__per_split_info* split_info)
6304{
6305  // evict a scanline out into the output buffer
6306  float* ring_buffer_entry = stbir__get_ring_buffer_entry(stbir_info, split_info, split_info->ring_buffer_begin_index );
6307
6308  // dump the scanline out
6309  stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (size_t)split_info->ring_buffer_first_scanline * (size_t)stbir_info->output_stride_bytes ), ring_buffer_entry, split_info->ring_buffer_first_scanline  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6310
6311  // mark it as empty
6312  ring_buffer_entry[ 0 ] = STBIR__FLOAT_EMPTY_MARKER;
6313
6314  // advance the first scanline
6315  split_info->ring_buffer_first_scanline++;
6316  if ( ++split_info->ring_buffer_begin_index == stbir_info->ring_buffer_num_entries )
6317    split_info->ring_buffer_begin_index = 0;
6318}
6319
6320static void stbir__horizontal_resample_and_encode_first_scanline_from_scatter(stbir__info const * stbir_info, stbir__per_split_info* split_info)
6321{
6322  // evict a scanline out into the output buffer
6323
6324  float* ring_buffer_entry = stbir__get_ring_buffer_entry(stbir_info, split_info, split_info->ring_buffer_begin_index );
6325
6326  // Now resample it into the buffer.
6327  stbir__resample_horizontal_gather( stbir_info, split_info->vertical_buffer, ring_buffer_entry  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6328
6329  // dump the scanline out
6330  stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (size_t)split_info->ring_buffer_first_scanline * (size_t)stbir_info->output_stride_bytes ), split_info->vertical_buffer, split_info->ring_buffer_first_scanline  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6331
6332  // mark it as empty
6333  ring_buffer_entry[ 0 ] = STBIR__FLOAT_EMPTY_MARKER;
6334
6335  // advance the first scanline
6336  split_info->ring_buffer_first_scanline++;
6337  if ( ++split_info->ring_buffer_begin_index == stbir_info->ring_buffer_num_entries )
6338    split_info->ring_buffer_begin_index = 0;
6339}
6340
6341static void stbir__resample_vertical_scatter(stbir__info const * stbir_info, stbir__per_split_info* split_info, int n0, int n1, float const * vertical_coefficients, float const * vertical_buffer, float const * vertical_buffer_end )
6342{
6343  STBIR_ASSERT( !stbir_info->vertical.is_gather );
6344
6345  STBIR_PROFILE_START( vertical );
6346  {
6347    int k = 0, total = n1 - n0 + 1;
6348    STBIR_ASSERT( total > 0 );
6349    do {
6350      float * outputs[8];
6351      int i, n = total; if ( n > 8 ) n = 8;
6352      for( i = 0 ; i < n ; i++ )
6353      {
6354        outputs[ i ] = stbir__get_ring_buffer_scanline(stbir_info, split_info, k+i+n0 );
6355        if ( ( i ) && ( STBIR__FLOAT_BUFFER_IS_EMPTY( outputs[i] ) != STBIR__FLOAT_BUFFER_IS_EMPTY( outputs[0] ) ) ) // make sure runs are of the same type
6356        {
6357          n = i;
6358          break;
6359        }
6360      }
6361      // call the scatter to N scanlines at a time function (up to 8 scanlines of scattering at once)
6362      ((STBIR__FLOAT_BUFFER_IS_EMPTY( outputs[0] ))?stbir__vertical_scatter_sets:stbir__vertical_scatter_blends)[n-1]( outputs, vertical_coefficients + k, vertical_buffer, vertical_buffer_end );
6363      k += n;
6364      total -= n;
6365    } while ( total );
6366  }
6367
6368  STBIR_PROFILE_END( vertical );
6369}
6370
6371typedef void stbir__handle_scanline_for_scatter_func(stbir__info const * stbir_info, stbir__per_split_info* split_info);
6372
6373static void stbir__vertical_scatter_loop( stbir__info const * stbir_info, stbir__per_split_info* split_info, int split_count )
6374{
6375  int y, start_output_y, end_output_y, start_input_y, end_input_y;
6376  stbir__contributors* vertical_contributors = stbir_info->vertical.contributors;
6377  float const * vertical_coefficients = stbir_info->vertical.coefficients;
6378  stbir__handle_scanline_for_scatter_func * handle_scanline_for_scatter;
6379  void * scanline_scatter_buffer;
6380  void * scanline_scatter_buffer_end;
6381  int on_first_input_y, last_input_y;
6382
6383  STBIR_ASSERT( !stbir_info->vertical.is_gather );
6384
6385  start_output_y = split_info->start_output_y;
6386  end_output_y = split_info[split_count-1].end_output_y;  // may do multiple split counts
6387
6388  start_input_y = split_info->start_input_y;
6389  end_input_y = split_info[split_count-1].end_input_y;
6390
6391  // adjust for starting offset start_input_y
6392  y = start_input_y + stbir_info->vertical.filter_pixel_margin;
6393  vertical_contributors += y ;
6394  vertical_coefficients += stbir_info->vertical.coefficient_width * y;
6395
6396  if ( stbir_info->vertical_first )
6397  {
6398    handle_scanline_for_scatter = stbir__horizontal_resample_and_encode_first_scanline_from_scatter;
6399    scanline_scatter_buffer = split_info->decode_buffer;
6400    scanline_scatter_buffer_end = ( (char*) scanline_scatter_buffer ) + sizeof( float ) * stbir_info->effective_channels * (stbir_info->scanline_extents.conservative.n1-stbir_info->scanline_extents.conservative.n0+1);
6401  }
6402  else
6403  {
6404    handle_scanline_for_scatter = stbir__encode_first_scanline_from_scatter;
6405    scanline_scatter_buffer = split_info->vertical_buffer;
6406    scanline_scatter_buffer_end = ( (char*) scanline_scatter_buffer ) + sizeof( float ) * stbir_info->effective_channels * stbir_info->horizontal.scale_info.output_sub_size;
6407  }
6408
6409  // initialize the ring buffer for scattering
6410  split_info->ring_buffer_first_scanline = start_output_y;
6411  split_info->ring_buffer_last_scanline = -1;
6412  split_info->ring_buffer_begin_index = -1;
6413
6414  // mark all the buffers as empty to start
6415  for( y = 0 ; y < stbir_info->ring_buffer_num_entries ; y++ )
6416    stbir__get_ring_buffer_entry( stbir_info, split_info, y )[0] = STBIR__FLOAT_EMPTY_MARKER; // only used on scatter
6417
6418  // do the loop in input space
6419  on_first_input_y = 1; last_input_y = start_input_y;
6420  for (y = start_input_y ; y < end_input_y; y++)
6421  {
6422    int out_first_scanline, out_last_scanline;
6423
6424    out_first_scanline = vertical_contributors->n0;
6425    out_last_scanline = vertical_contributors->n1;
6426
6427    STBIR_ASSERT(out_last_scanline - out_first_scanline + 1 <= stbir_info->ring_buffer_num_entries);
6428
6429    if ( ( out_last_scanline >= out_first_scanline ) && ( ( ( out_first_scanline >= start_output_y ) && ( out_first_scanline < end_output_y ) ) || ( ( out_last_scanline >= start_output_y ) && ( out_last_scanline < end_output_y ) ) ) )
6430    {
6431      float const * vc = vertical_coefficients;
6432
6433      // keep track of the range actually seen for the next resize
6434      last_input_y = y;
6435      if ( ( on_first_input_y ) && ( y > start_input_y ) )
6436        split_info->start_input_y = y;
6437      on_first_input_y = 0;
6438
6439      // clip the region
6440      if ( out_first_scanline < start_output_y )
6441      {
6442        vc += start_output_y - out_first_scanline;
6443        out_first_scanline = start_output_y;
6444      }
6445
6446      if ( out_last_scanline >= end_output_y )
6447        out_last_scanline = end_output_y - 1;
6448
6449      // if very first scanline, init the index
6450      if (split_info->ring_buffer_begin_index < 0)
6451        split_info->ring_buffer_begin_index = out_first_scanline - start_output_y;
6452
6453      STBIR_ASSERT( split_info->ring_buffer_begin_index <= out_first_scanline );
6454
6455      // Decode the nth scanline from the source image into the decode buffer.
6456      stbir__decode_scanline( stbir_info, y, split_info->decode_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6457
6458      // When horizontal first, we resample horizontally into the vertical buffer before we scatter it out
6459      if ( !stbir_info->vertical_first )
6460        stbir__resample_horizontal_gather( stbir_info, split_info->vertical_buffer, split_info->decode_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6461
6462      // Now it's sitting in the buffer ready to be distributed into the ring buffers.
6463
6464      // evict from the ringbuffer, if we need are full
6465      if ( ( ( split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline + 1 ) == stbir_info->ring_buffer_num_entries ) &&
6466           ( out_last_scanline > split_info->ring_buffer_last_scanline ) )
6467        handle_scanline_for_scatter( stbir_info, split_info );
6468
6469      // Now the horizontal buffer is ready to write to all ring buffer rows, so do it.
6470      stbir__resample_vertical_scatter(stbir_info, split_info, out_first_scanline, out_last_scanline, vc, (float*)scanline_scatter_buffer, (float*)scanline_scatter_buffer_end );
6471
6472      // update the end of the buffer
6473      if ( out_last_scanline > split_info->ring_buffer_last_scanline )
6474        split_info->ring_buffer_last_scanline = out_last_scanline;
6475    }
6476    ++vertical_contributors;
6477    vertical_coefficients += stbir_info->vertical.coefficient_width;
6478  }
6479
6480  // now evict the scanlines that are left over in the ring buffer
6481  while ( split_info->ring_buffer_first_scanline < end_output_y )
6482    handle_scanline_for_scatter(stbir_info, split_info);
6483
6484  // update the end_input_y if we do multiple resizes with the same data
6485  ++last_input_y;
6486  for( y = 0 ; y < split_count; y++ )
6487    if ( split_info[y].end_input_y > last_input_y )
6488      split_info[y].end_input_y = last_input_y;
6489}
6490
6491
6492static stbir__kernel_callback * stbir__builtin_kernels[] =   { 0, stbir__filter_trapezoid,  stbir__filter_triangle, stbir__filter_cubic, stbir__filter_catmullrom, stbir__filter_mitchell, stbir__filter_point };
6493static stbir__support_callback * stbir__builtin_supports[] = { 0, stbir__support_trapezoid, stbir__support_one,     stbir__support_two,  stbir__support_two,       stbir__support_two,     stbir__support_zeropoint5 };
6494
6495static void stbir__set_sampler(stbir__sampler * samp, stbir_filter filter, stbir__kernel_callback * kernel, stbir__support_callback * support, stbir_edge edge, stbir__scale_info * scale_info, int always_gather, void * user_data )
6496{
6497  // set filter
6498  if (filter == 0)
6499  {
6500    filter = STBIR_DEFAULT_FILTER_DOWNSAMPLE; // default to downsample
6501    if (scale_info->scale >= ( 1.0f - stbir__small_float ) )
6502    {
6503      if ( (scale_info->scale <= ( 1.0f + stbir__small_float ) ) && ( STBIR_CEILF(scale_info->pixel_shift) == scale_info->pixel_shift ) )
6504        filter = STBIR_FILTER_POINT_SAMPLE;
6505      else
6506        filter = STBIR_DEFAULT_FILTER_UPSAMPLE;
6507    }
6508  }
6509  samp->filter_enum = filter;
6510
6511  STBIR_ASSERT(samp->filter_enum != 0);
6512  STBIR_ASSERT((unsigned)samp->filter_enum < STBIR_FILTER_OTHER);
6513  samp->filter_kernel = stbir__builtin_kernels[ filter ];
6514  samp->filter_support = stbir__builtin_supports[ filter ];
6515
6516  if ( kernel && support )
6517  {
6518    samp->filter_kernel = kernel;
6519    samp->filter_support = support;
6520    samp->filter_enum = STBIR_FILTER_OTHER;
6521  }
6522
6523  samp->edge = edge;
6524  samp->filter_pixel_width  = stbir__get_filter_pixel_width (samp->filter_support, scale_info->scale, user_data );
6525  // Gather is always better, but in extreme downsamples, you have to most or all of the data in memory
6526  //    For horizontal, we always have all the pixels, so we always use gather here (always_gather==1).
6527  //    For vertical, we use gather if scaling up (which means we will have samp->filter_pixel_width
6528  //    scanlines in memory at once).
6529  samp->is_gather = 0;
6530  if ( scale_info->scale >= ( 1.0f - stbir__small_float ) )
6531    samp->is_gather = 1;
6532  else if ( ( always_gather ) || ( samp->filter_pixel_width <= STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT ) )
6533    samp->is_gather = 2;
6534
6535  // pre calculate stuff based on the above
6536  samp->coefficient_width = stbir__get_coefficient_width(samp, samp->is_gather, user_data);
6537
6538  // filter_pixel_width is the conservative size in pixels of input that affect an output pixel.
6539  //   In rare cases (only with 2 pix to 1 pix with the default filters), it's possible that the 
6540  //   filter will extend before or after the scanline beyond just one extra entire copy of the 
6541  //   scanline (we would hit the edge twice). We don't let you do that, so we clamp the total 
6542  //   width to 3x the total of input pixel (once for the scanline, once for the left side 
6543  //   overhang, and once for the right side). We only do this for edge mode, since the other 
6544  //   modes can just re-edge clamp back in again.
6545  if ( edge == STBIR_EDGE_WRAP )
6546    if ( samp->filter_pixel_width > ( scale_info->input_full_size * 3 ) )
6547      samp->filter_pixel_width = scale_info->input_full_size * 3;
6548
6549  // This is how much to expand buffers to account for filters seeking outside
6550  // the image boundaries.
6551  samp->filter_pixel_margin = samp->filter_pixel_width / 2;
6552  
6553  // filter_pixel_margin is the amount that this filter can overhang on just one side of either 
6554  //   end of the scanline (left or the right). Since we only allow you to overhang 1 scanline's 
6555  //   worth of pixels, we clamp this one side of overhang to the input scanline size. Again, 
6556  //   this clamping only happens in rare cases with the default filters (2 pix to 1 pix). 
6557  if ( edge == STBIR_EDGE_WRAP )
6558    if ( samp->filter_pixel_margin > scale_info->input_full_size )
6559      samp->filter_pixel_margin = scale_info->input_full_size;
6560
6561  samp->num_contributors = stbir__get_contributors(samp, samp->is_gather);
6562
6563  samp->contributors_size = samp->num_contributors * sizeof(stbir__contributors);
6564  samp->coefficients_size = samp->num_contributors * samp->coefficient_width * sizeof(float) + sizeof(float); // extra sizeof(float) is padding
6565
6566  samp->gather_prescatter_contributors = 0;
6567  samp->gather_prescatter_coefficients = 0;
6568  if ( samp->is_gather == 0 )
6569  {
6570    samp->gather_prescatter_coefficient_width = samp->filter_pixel_width;
6571    samp->gather_prescatter_num_contributors  = stbir__get_contributors(samp, 2);
6572    samp->gather_prescatter_contributors_size = samp->gather_prescatter_num_contributors * sizeof(stbir__contributors);
6573    samp->gather_prescatter_coefficients_size = samp->gather_prescatter_num_contributors * samp->gather_prescatter_coefficient_width * sizeof(float);
6574  }
6575}
6576
6577static void stbir__get_conservative_extents( stbir__sampler * samp, stbir__contributors * range, void * user_data )
6578{
6579  float scale = samp->scale_info.scale;
6580  float out_shift = samp->scale_info.pixel_shift;
6581  stbir__support_callback * support = samp->filter_support;
6582  int input_full_size = samp->scale_info.input_full_size;
6583  stbir_edge edge = samp->edge;
6584  float inv_scale = samp->scale_info.inv_scale;
6585
6586  STBIR_ASSERT( samp->is_gather != 0 );
6587
6588  if ( samp->is_gather == 1 )
6589  {
6590    int in_first_pixel, in_last_pixel;
6591    float out_filter_radius = support(inv_scale, user_data) * scale;
6592
6593    stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, 0.5, out_filter_radius, inv_scale, out_shift, input_full_size, edge );
6594    range->n0 = in_first_pixel;
6595    stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, ( (float)(samp->scale_info.output_sub_size-1) ) + 0.5f, out_filter_radius, inv_scale, out_shift, input_full_size, edge );
6596    range->n1 = in_last_pixel;
6597  }
6598  else if ( samp->is_gather == 2 ) // downsample gather, refine
6599  {
6600    float in_pixels_radius = support(scale, user_data) * inv_scale;
6601    int filter_pixel_margin = samp->filter_pixel_margin;
6602    int output_sub_size = samp->scale_info.output_sub_size;
6603    int input_end;
6604    int n;
6605    int in_first_pixel, in_last_pixel;
6606
6607    // get a conservative area of the input range
6608    stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, 0, 0, inv_scale, out_shift, input_full_size, edge );
6609    range->n0 = in_first_pixel;
6610    stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, (float)output_sub_size, 0, inv_scale, out_shift, input_full_size, edge );
6611    range->n1 = in_last_pixel;
6612
6613    // now go through the margin to the start of area to find bottom
6614    n = range->n0 + 1;
6615    input_end = -filter_pixel_margin;
6616    while( n >= input_end )
6617    {
6618      int out_first_pixel, out_last_pixel;
6619      stbir__calculate_out_pixel_range( &out_first_pixel, &out_last_pixel, ((float)n)+0.5f, in_pixels_radius, scale, out_shift, output_sub_size );
6620      if ( out_first_pixel > out_last_pixel )
6621        break;
6622
6623      if ( ( out_first_pixel < output_sub_size ) || ( out_last_pixel >= 0 ) )
6624        range->n0 = n;
6625      --n;
6626    }
6627
6628    // now go through the end of the area through the margin to find top
6629    n = range->n1 - 1;
6630    input_end = n + 1 + filter_pixel_margin;
6631    while( n <= input_end )
6632    {
6633      int out_first_pixel, out_last_pixel;
6634      stbir__calculate_out_pixel_range( &out_first_pixel, &out_last_pixel, ((float)n)+0.5f, in_pixels_radius, scale, out_shift, output_sub_size );
6635      if ( out_first_pixel > out_last_pixel )
6636        break;
6637      if ( ( out_first_pixel < output_sub_size ) || ( out_last_pixel >= 0 ) )
6638        range->n1 = n;
6639      ++n;
6640    }
6641  }
6642
6643  if ( samp->edge == STBIR_EDGE_WRAP )
6644  {
6645    // if we are wrapping, and we are very close to the image size (so the edges might merge), just use the scanline up to the edge
6646    if ( ( range->n0 > 0 ) && ( range->n1 >= input_full_size ) )
6647    {
6648      int marg = range->n1 - input_full_size + 1;
6649      if ( ( marg + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= range->n0 )
6650        range->n0 = 0;
6651    }
6652    if ( ( range->n0 < 0 ) && ( range->n1 < (input_full_size-1) ) )
6653    {
6654      int marg = -range->n0;
6655      if ( ( input_full_size - marg - STBIR__MERGE_RUNS_PIXEL_THRESHOLD - 1 ) <= range->n1 )
6656        range->n1 = input_full_size - 1;
6657    }
6658  }
6659  else
6660  {
6661    // for non-edge-wrap modes, we never read over the edge, so clamp
6662    if ( range->n0 < 0 )
6663      range->n0 = 0;
6664    if ( range->n1 >= input_full_size )
6665      range->n1 = input_full_size - 1;
6666  }
6667}
6668
6669static void stbir__get_split_info( stbir__per_split_info* split_info, int splits, int output_height, int vertical_pixel_margin, int input_full_height )
6670{
6671  int i, cur;
6672  int left = output_height;
6673
6674  cur = 0;
6675  for( i = 0 ; i < splits ; i++ )
6676  {
6677    int each;
6678    split_info[i].start_output_y = cur;
6679    each = left / ( splits - i );
6680    split_info[i].end_output_y = cur + each;
6681    cur += each;
6682    left -= each;
6683
6684    // scatter range (updated to minimum as you run it)
6685    split_info[i].start_input_y = -vertical_pixel_margin;
6686    split_info[i].end_input_y = input_full_height + vertical_pixel_margin;
6687  }
6688}
6689
6690static void stbir__free_internal_mem( stbir__info *info )
6691{
6692  #define STBIR__FREE_AND_CLEAR( ptr ) { if ( ptr ) { void * p = (ptr); (ptr) = 0; STBIR_FREE( p, info->user_data); } }
6693
6694  if ( info )
6695  {
6696  #ifndef STBIR__SEPARATE_ALLOCATIONS
6697    STBIR__FREE_AND_CLEAR( info->alloced_mem );
6698  #else
6699    int i,j;
6700
6701    if ( ( info->vertical.gather_prescatter_contributors ) && ( (void*)info->vertical.gather_prescatter_contributors != (void*)info->split_info[0].decode_buffer ) )
6702    {
6703      STBIR__FREE_AND_CLEAR( info->vertical.gather_prescatter_coefficients );
6704      STBIR__FREE_AND_CLEAR( info->vertical.gather_prescatter_contributors );
6705    }
6706    for( i = 0 ; i < info->splits ; i++ )
6707    {
6708      for( j = 0 ; j < info->alloc_ring_buffer_num_entries ; j++ )
6709      {
6710        #ifdef STBIR_SIMD8
6711        if ( info->effective_channels == 3 )
6712          --info->split_info[i].ring_buffers[j]; // avx in 3 channel mode needs one float at the start of the buffer
6713        #endif
6714        STBIR__FREE_AND_CLEAR( info->split_info[i].ring_buffers[j] );
6715      }
6716
6717      #ifdef STBIR_SIMD8
6718      if ( info->effective_channels == 3 )
6719        --info->split_info[i].decode_buffer; // avx in 3 channel mode needs one float at the start of the buffer
6720      #endif
6721      STBIR__FREE_AND_CLEAR( info->split_info[i].decode_buffer );
6722      STBIR__FREE_AND_CLEAR( info->split_info[i].ring_buffers );
6723      STBIR__FREE_AND_CLEAR( info->split_info[i].vertical_buffer );
6724    }
6725    STBIR__FREE_AND_CLEAR( info->split_info );
6726    if ( info->vertical.coefficients != info->horizontal.coefficients )
6727    {
6728      STBIR__FREE_AND_CLEAR( info->vertical.coefficients );
6729      STBIR__FREE_AND_CLEAR( info->vertical.contributors );
6730    }
6731    STBIR__FREE_AND_CLEAR( info->horizontal.coefficients );
6732    STBIR__FREE_AND_CLEAR( info->horizontal.contributors );
6733    STBIR__FREE_AND_CLEAR( info->alloced_mem );
6734    STBIR_FREE( info, info->user_data );
6735  #endif
6736  }
6737
6738  #undef STBIR__FREE_AND_CLEAR
6739}
6740
6741static int stbir__get_max_split( int splits, int height )
6742{
6743  int i;
6744  int max = 0;
6745
6746  for( i = 0 ; i < splits ; i++ )
6747  {
6748    int each = height / ( splits - i );
6749    if ( each > max )
6750      max = each;
6751    height -= each;
6752  }
6753  return max;
6754}
6755
6756static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_n_coeffs_funcs[8] =
6757{
6758  0, stbir__horizontal_gather_1_channels_with_n_coeffs_funcs, stbir__horizontal_gather_2_channels_with_n_coeffs_funcs, stbir__horizontal_gather_3_channels_with_n_coeffs_funcs, stbir__horizontal_gather_4_channels_with_n_coeffs_funcs, 0,0, stbir__horizontal_gather_7_channels_with_n_coeffs_funcs
6759};
6760
6761static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_channels_funcs[8] =
6762{
6763  0, stbir__horizontal_gather_1_channels_funcs, stbir__horizontal_gather_2_channels_funcs, stbir__horizontal_gather_3_channels_funcs, stbir__horizontal_gather_4_channels_funcs, 0,0, stbir__horizontal_gather_7_channels_funcs
6764};
6765
6766// there are six resize classifications: 0 == vertical scatter, 1 == vertical gather < 1x scale, 2 == vertical gather 1x-2x scale, 4 == vertical gather < 3x scale, 4 == vertical gather > 3x scale, 5 == <=4 pixel height, 6 == <=4 pixel wide column
6767#define STBIR_RESIZE_CLASSIFICATIONS 8
6768
6769static float stbir__compute_weights[5][STBIR_RESIZE_CLASSIFICATIONS][4]=  // 5 = 0=1chan, 1=2chan, 2=3chan, 3=4chan, 4=7chan
6770{
6771  {
6772    { 1.00000f, 1.00000f, 0.31250f, 1.00000f },
6773    { 0.56250f, 0.59375f, 0.00000f, 0.96875f },
6774    { 1.00000f, 0.06250f, 0.00000f, 1.00000f },
6775    { 0.00000f, 0.09375f, 1.00000f, 1.00000f },
6776    { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
6777    { 0.03125f, 0.12500f, 1.00000f, 1.00000f },
6778    { 0.06250f, 0.12500f, 0.00000f, 1.00000f },
6779    { 0.00000f, 1.00000f, 0.00000f, 0.03125f },
6780  }, {
6781    { 0.00000f, 0.84375f, 0.00000f, 0.03125f },
6782    { 0.09375f, 0.93750f, 0.00000f, 0.78125f },
6783    { 0.87500f, 0.21875f, 0.00000f, 0.96875f },
6784    { 0.09375f, 0.09375f, 1.00000f, 1.00000f },
6785    { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
6786    { 0.03125f, 0.12500f, 1.00000f, 1.00000f },
6787    { 0.06250f, 0.12500f, 0.00000f, 1.00000f },
6788    { 0.00000f, 1.00000f, 0.00000f, 0.53125f },
6789  }, {
6790    { 0.00000f, 0.53125f, 0.00000f, 0.03125f },
6791    { 0.06250f, 0.96875f, 0.00000f, 0.53125f },
6792    { 0.87500f, 0.18750f, 0.00000f, 0.93750f },
6793    { 0.00000f, 0.09375f, 1.00000f, 1.00000f },
6794    { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
6795    { 0.03125f, 0.12500f, 1.00000f, 1.00000f },
6796    { 0.06250f, 0.12500f, 0.00000f, 1.00000f },
6797    { 0.00000f, 1.00000f, 0.00000f, 0.56250f },
6798  }, {
6799    { 0.00000f, 0.50000f, 0.00000f, 0.71875f },
6800    { 0.06250f, 0.84375f, 0.00000f, 0.87500f },
6801    { 1.00000f, 0.50000f, 0.50000f, 0.96875f },
6802    { 1.00000f, 0.09375f, 0.31250f, 0.50000f },
6803    { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
6804    { 1.00000f, 0.03125f, 0.03125f, 0.53125f },
6805    { 0.18750f, 0.12500f, 0.00000f, 1.00000f },
6806    { 0.00000f, 1.00000f, 0.03125f, 0.18750f },
6807  }, {
6808    { 0.00000f, 0.59375f, 0.00000f, 0.96875f },
6809    { 0.06250f, 0.81250f, 0.06250f, 0.59375f },
6810    { 0.75000f, 0.43750f, 0.12500f, 0.96875f },
6811    { 0.87500f, 0.06250f, 0.18750f, 0.43750f },
6812    { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
6813    { 0.15625f, 0.12500f, 1.00000f, 1.00000f },
6814    { 0.06250f, 0.12500f, 0.00000f, 1.00000f },
6815    { 0.00000f, 1.00000f, 0.03125f, 0.34375f },
6816  }
6817};
6818
6819// structure that allow us to query and override info for training the costs
6820typedef struct STBIR__V_FIRST_INFO
6821{
6822  double v_cost, h_cost;
6823  int control_v_first; // 0 = no control, 1 = force hori, 2 = force vert
6824  int v_first;
6825  int v_resize_classification;
6826  int is_gather;
6827} STBIR__V_FIRST_INFO;
6828
6829#ifdef STBIR__V_FIRST_INFO_BUFFER
6830static STBIR__V_FIRST_INFO STBIR__V_FIRST_INFO_BUFFER = {0};
6831#define STBIR__V_FIRST_INFO_POINTER &STBIR__V_FIRST_INFO_BUFFER
6832#else
6833#define STBIR__V_FIRST_INFO_POINTER 0
6834#endif
6835
6836// Figure out whether to scale along the horizontal or vertical first.
6837//   This only *super* important when you are scaling by a massively
6838//   different amount in the vertical vs the horizontal (for example, if
6839//   you are scaling by 2x in the width, and 0.5x in the height, then you
6840//   want to do the vertical scale first, because it's around 3x faster
6841//   in that order.
6842//
6843//   In more normal circumstances, this makes a 20-40% differences, so
6844//     it's good to get right, but not critical. The normal way that you
6845//     decide which direction goes first is just figuring out which
6846//     direction does more multiplies. But with modern CPUs with their
6847//     fancy caches and SIMD and high IPC abilities, so there's just a lot
6848//     more that goes into it.
6849//
6850//   My handwavy sort of solution is to have an app that does a whole
6851//     bunch of timing for both vertical and horizontal first modes,
6852//     and then another app that can read lots of these timing files
6853//     and try to search for the best weights to use. Dotimings.c
6854//     is the app that does a bunch of timings, and vf_train.c is the
6855//     app that solves for the best weights (and shows how well it
6856//     does currently).
6857
6858static int stbir__should_do_vertical_first( float weights_table[STBIR_RESIZE_CLASSIFICATIONS][4], int horizontal_filter_pixel_width, float horizontal_scale, int horizontal_output_size, int vertical_filter_pixel_width, float vertical_scale, int vertical_output_size, int is_gather, STBIR__V_FIRST_INFO * info )
6859{
6860  double v_cost, h_cost;
6861  float * weights;
6862  int vertical_first;
6863  int v_classification;
6864
6865  // categorize the resize into buckets
6866  if ( ( vertical_output_size <= 4 ) || ( horizontal_output_size <= 4 ) )
6867    v_classification = ( vertical_output_size < horizontal_output_size ) ? 6 : 7;
6868  else if ( vertical_scale <= 1.0f )
6869    v_classification = ( is_gather ) ? 1 : 0;
6870  else if ( vertical_scale <= 2.0f)
6871    v_classification = 2;
6872  else if ( vertical_scale <= 3.0f)
6873    v_classification = 3;
6874  else if ( vertical_scale <= 4.0f)
6875    v_classification = 5;
6876  else
6877    v_classification = 6;
6878
6879  // use the right weights
6880  weights = weights_table[ v_classification ];
6881
6882  // this is the costs when you don't take into account modern CPUs with high ipc and simd and caches - wish we had a better estimate
6883  h_cost = (float)horizontal_filter_pixel_width * weights[0] + horizontal_scale * (float)vertical_filter_pixel_width * weights[1];
6884  v_cost = (float)vertical_filter_pixel_width  * weights[2] + vertical_scale * (float)horizontal_filter_pixel_width * weights[3];
6885
6886  // use computation estimate to decide vertical first or not
6887  vertical_first = ( v_cost <= h_cost ) ? 1 : 0;
6888
6889  // save these, if requested
6890  if ( info )
6891  {
6892    info->h_cost = h_cost;
6893    info->v_cost = v_cost;
6894    info->v_resize_classification = v_classification;
6895    info->v_first = vertical_first;
6896    info->is_gather = is_gather;
6897  }
6898
6899  // and this allows us to override everything for testing (see dotiming.c)
6900  if ( ( info ) && ( info->control_v_first ) )
6901    vertical_first = ( info->control_v_first == 2 ) ? 1 : 0;
6902
6903  return vertical_first;
6904}
6905
6906// layout lookups - must match stbir_internal_pixel_layout
6907static unsigned char stbir__pixel_channels[] = {
6908  1,2,3,3,4,   // 1ch, 2ch, rgb, bgr, 4ch
6909  4,4,4,4,2,2, // RGBA,BGRA,ARGB,ABGR,RA,AR
6910  4,4,4,4,2,2, // RGBA_PM,BGRA_PM,ARGB_PM,ABGR_PM,RA_PM,AR_PM
6911};
6912
6913// the internal pixel layout enums are in a different order, so we can easily do range comparisons of types
6914//   the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible
6915static stbir_internal_pixel_layout stbir__pixel_layout_convert_public_to_internal[] = {
6916  STBIRI_BGR, STBIRI_1CHANNEL, STBIRI_2CHANNEL, STBIRI_RGB, STBIRI_RGBA,
6917  STBIRI_4CHANNEL, STBIRI_BGRA, STBIRI_ARGB, STBIRI_ABGR, STBIRI_RA, STBIRI_AR,
6918  STBIRI_RGBA_PM, STBIRI_BGRA_PM, STBIRI_ARGB_PM, STBIRI_ABGR_PM, STBIRI_RA_PM, STBIRI_AR_PM,
6919};
6920
6921static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sampler * horizontal, stbir__sampler * vertical, stbir__contributors * conservative, stbir_pixel_layout input_pixel_layout_public, stbir_pixel_layout output_pixel_layout_public, int splits, int new_x, int new_y, int fast_alpha, void * user_data STBIR_ONLY_PROFILE_BUILD_GET_INFO )
6922{
6923  static char stbir_channel_count_index[8]={ 9,0,1,2, 3,9,9,4 };
6924
6925  stbir__info * info = 0;
6926  void * alloced = 0;
6927  size_t alloced_total = 0;
6928  int vertical_first;
6929  int decode_buffer_size, ring_buffer_length_bytes, ring_buffer_size, vertical_buffer_size, alloc_ring_buffer_num_entries;
6930
6931  int alpha_weighting_type = 0; // 0=none, 1=simple, 2=fancy
6932  int conservative_split_output_size = stbir__get_max_split( splits, vertical->scale_info.output_sub_size );
6933  stbir_internal_pixel_layout input_pixel_layout = stbir__pixel_layout_convert_public_to_internal[ input_pixel_layout_public ];
6934  stbir_internal_pixel_layout output_pixel_layout = stbir__pixel_layout_convert_public_to_internal[ output_pixel_layout_public ];
6935  int channels = stbir__pixel_channels[ input_pixel_layout ];
6936  int effective_channels = channels;
6937
6938  // first figure out what type of alpha weighting to use (if any)
6939  if ( ( horizontal->filter_enum != STBIR_FILTER_POINT_SAMPLE ) || ( vertical->filter_enum != STBIR_FILTER_POINT_SAMPLE ) ) // no alpha weighting on point sampling
6940  {
6941    if ( ( input_pixel_layout >= STBIRI_RGBA ) && ( input_pixel_layout <= STBIRI_AR ) && ( output_pixel_layout >= STBIRI_RGBA ) && ( output_pixel_layout <= STBIRI_AR ) )
6942    {
6943      if ( fast_alpha )
6944      {
6945        alpha_weighting_type = 4;
6946      }
6947      else
6948      {
6949        static int fancy_alpha_effective_cnts[6] = { 7, 7, 7, 7, 3, 3 };
6950        alpha_weighting_type = 2;
6951        effective_channels = fancy_alpha_effective_cnts[ input_pixel_layout - STBIRI_RGBA ];
6952      }
6953    }
6954    else if ( ( input_pixel_layout >= STBIRI_RGBA_PM ) && ( input_pixel_layout <= STBIRI_AR_PM ) && ( output_pixel_layout >= STBIRI_RGBA ) && ( output_pixel_layout <= STBIRI_AR ) )
6955    {
6956      // input premult, output non-premult
6957      alpha_weighting_type = 3;
6958    }
6959    else if ( ( input_pixel_layout >= STBIRI_RGBA ) && ( input_pixel_layout <= STBIRI_AR ) && ( output_pixel_layout >= STBIRI_RGBA_PM ) && ( output_pixel_layout <= STBIRI_AR_PM ) )
6960    {
6961      // input non-premult, output premult
6962      alpha_weighting_type = 1;
6963    }
6964  }
6965
6966  // channel in and out count must match currently
6967  if ( channels != stbir__pixel_channels[ output_pixel_layout ] )
6968    return 0;
6969
6970  // get vertical first
6971  vertical_first = stbir__should_do_vertical_first( stbir__compute_weights[ (int)stbir_channel_count_index[ effective_channels ] ], horizontal->filter_pixel_width, horizontal->scale_info.scale, horizontal->scale_info.output_sub_size, vertical->filter_pixel_width, vertical->scale_info.scale, vertical->scale_info.output_sub_size, vertical->is_gather, STBIR__V_FIRST_INFO_POINTER );
6972
6973  // sometimes read one float off in some of the unrolled loops (with a weight of zero coeff, so it doesn't have an effect)
6974  decode_buffer_size = ( conservative->n1 - conservative->n0 + 1 ) * effective_channels * sizeof(float) + sizeof(float); // extra float for padding
6975
6976#if defined( STBIR__SEPARATE_ALLOCATIONS ) && defined(STBIR_SIMD8)
6977  if ( effective_channels == 3 )
6978    decode_buffer_size += sizeof(float); // avx in 3 channel mode needs one float at the start of the buffer (only with separate allocations)
6979#endif
6980
6981  ring_buffer_length_bytes = horizontal->scale_info.output_sub_size * effective_channels * sizeof(float) + sizeof(float); // extra float for padding
6982
6983  // if we do vertical first, the ring buffer holds a whole decoded line
6984  if ( vertical_first )
6985    ring_buffer_length_bytes = ( decode_buffer_size + 15 ) & ~15;
6986
6987  if ( ( ring_buffer_length_bytes & 4095 ) == 0 ) ring_buffer_length_bytes += 64*3; // avoid 4k alias
6988
6989  // One extra entry because floating point precision problems sometimes cause an extra to be necessary.
6990  alloc_ring_buffer_num_entries = vertical->filter_pixel_width + 1;
6991
6992  // we never need more ring buffer entries than the scanlines we're outputting when in scatter mode
6993  if ( ( !vertical->is_gather ) && ( alloc_ring_buffer_num_entries > conservative_split_output_size ) )
6994    alloc_ring_buffer_num_entries = conservative_split_output_size;
6995
6996  ring_buffer_size = alloc_ring_buffer_num_entries * ring_buffer_length_bytes;
6997
6998  // The vertical buffer is used differently, depending on whether we are scattering
6999  //   the vertical scanlines, or gathering them.
7000  //   If scattering, it's used at the temp buffer to accumulate each output.
7001  //   If gathering, it's just the output buffer.
7002  vertical_buffer_size = horizontal->scale_info.output_sub_size * effective_channels * sizeof(float) + sizeof(float);  // extra float for padding
7003
7004  // we make two passes through this loop, 1st to add everything up, 2nd to allocate and init
7005  for(;;)
7006  {
7007    int i;
7008    void * advance_mem = alloced;
7009    int copy_horizontal = 0;
7010    stbir__sampler * possibly_use_horizontal_for_pivot = 0;
7011
7012#ifdef STBIR__SEPARATE_ALLOCATIONS
7013    #define STBIR__NEXT_PTR( ptr, size, ntype ) if ( alloced ) { void * p = STBIR_MALLOC( size, user_data); if ( p == 0 ) { stbir__free_internal_mem( info ); return 0; } (ptr) = (ntype*)p; }
7014#else
7015    #define STBIR__NEXT_PTR( ptr, size, ntype ) advance_mem = (void*) ( ( ((size_t)advance_mem) + 15 ) & ~15 ); if ( alloced ) ptr = (ntype*)advance_mem; advance_mem = ((char*)advance_mem) + (size);
7016#endif
7017
7018    STBIR__NEXT_PTR( info, sizeof( stbir__info ), stbir__info );
7019
7020    STBIR__NEXT_PTR( info->split_info, sizeof( stbir__per_split_info ) * splits, stbir__per_split_info );
7021
7022    if ( info )
7023    {
7024      static stbir__alpha_weight_func * fancy_alpha_weights[6]  =    { stbir__fancy_alpha_weight_4ch,   stbir__fancy_alpha_weight_4ch,   stbir__fancy_alpha_weight_4ch,   stbir__fancy_alpha_weight_4ch,   stbir__fancy_alpha_weight_2ch,   stbir__fancy_alpha_weight_2ch };
7025      static stbir__alpha_unweight_func * fancy_alpha_unweights[6] = { stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_2ch, stbir__fancy_alpha_unweight_2ch };
7026      static stbir__alpha_weight_func * simple_alpha_weights[6] = { stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_2ch, stbir__simple_alpha_weight_2ch };
7027      static stbir__alpha_unweight_func * simple_alpha_unweights[6] = { stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_2ch, stbir__simple_alpha_unweight_2ch };
7028
7029      // initialize info fields
7030      info->alloced_mem = alloced;
7031      info->alloced_total = alloced_total;
7032
7033      info->channels = channels;
7034      info->effective_channels = effective_channels;
7035
7036      info->offset_x = new_x;
7037      info->offset_y = new_y;
7038      info->alloc_ring_buffer_num_entries = alloc_ring_buffer_num_entries;
7039      info->ring_buffer_num_entries = 0;
7040      info->ring_buffer_length_bytes = ring_buffer_length_bytes;
7041      info->splits = splits;
7042      info->vertical_first = vertical_first;
7043
7044      info->input_pixel_layout_internal = input_pixel_layout;
7045      info->output_pixel_layout_internal = output_pixel_layout;
7046
7047      // setup alpha weight functions
7048      info->alpha_weight = 0;
7049      info->alpha_unweight = 0;
7050
7051      // handle alpha weighting functions and overrides
7052      if ( alpha_weighting_type == 2 )
7053      {
7054        // high quality alpha multiplying on the way in, dividing on the way out
7055        info->alpha_weight = fancy_alpha_weights[ input_pixel_layout - STBIRI_RGBA ];
7056        info->alpha_unweight = fancy_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ];
7057      }
7058      else if ( alpha_weighting_type == 4 )
7059      {
7060        // fast alpha multiplying on the way in, dividing on the way out
7061        info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ];
7062        info->alpha_unweight = simple_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ];
7063      }
7064      else if ( alpha_weighting_type == 1 )
7065      {
7066        // fast alpha on the way in, leave in premultiplied form on way out
7067        info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ];
7068      }
7069      else if ( alpha_weighting_type == 3 )
7070      {
7071        // incoming is premultiplied, fast alpha dividing on the way out - non-premultiplied output
7072        info->alpha_unweight = simple_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ];
7073      }
7074
7075      // handle 3-chan color flipping, using the alpha weight path
7076      if ( ( ( input_pixel_layout == STBIRI_RGB ) && ( output_pixel_layout == STBIRI_BGR ) ) ||
7077           ( ( input_pixel_layout == STBIRI_BGR ) && ( output_pixel_layout == STBIRI_RGB ) ) )
7078      {
7079        // do the flipping on the smaller of the two ends
7080        if ( horizontal->scale_info.scale < 1.0f )
7081          info->alpha_unweight = stbir__simple_flip_3ch;
7082        else
7083          info->alpha_weight = stbir__simple_flip_3ch;
7084      }
7085
7086    }
7087
7088    // get all the per-split buffers
7089    for( i = 0 ; i < splits ; i++ )
7090    {
7091      STBIR__NEXT_PTR( info->split_info[i].decode_buffer, decode_buffer_size, float );
7092
7093#ifdef STBIR__SEPARATE_ALLOCATIONS
7094
7095      #ifdef STBIR_SIMD8
7096      if ( ( info ) && ( effective_channels == 3 ) )
7097        ++info->split_info[i].decode_buffer; // avx in 3 channel mode needs one float at the start of the buffer
7098      #endif
7099
7100      STBIR__NEXT_PTR( info->split_info[i].ring_buffers, alloc_ring_buffer_num_entries * sizeof(float*), float* );
7101      {
7102        int j;
7103        for( j = 0 ; j < alloc_ring_buffer_num_entries ; j++ )
7104        {
7105          STBIR__NEXT_PTR( info->split_info[i].ring_buffers[j], ring_buffer_length_bytes, float );
7106          #ifdef STBIR_SIMD8
7107          if ( ( info ) && ( effective_channels == 3 ) )
7108            ++info->split_info[i].ring_buffers[j]; // avx in 3 channel mode needs one float at the start of the buffer
7109          #endif
7110        }
7111      }
7112#else
7113      STBIR__NEXT_PTR( info->split_info[i].ring_buffer, ring_buffer_size, float );
7114#endif
7115      STBIR__NEXT_PTR( info->split_info[i].vertical_buffer, vertical_buffer_size, float );
7116    }
7117
7118    // alloc memory for to-be-pivoted coeffs (if necessary)
7119    if ( vertical->is_gather == 0 )
7120    {
7121      int both;
7122      int temp_mem_amt;
7123
7124      // when in vertical scatter mode, we first build the coefficients in gather mode, and then pivot after,
7125      //   that means we need two buffers, so we try to use the decode buffer and ring buffer for this. if that
7126      //   is too small, we just allocate extra memory to use as this temp.
7127
7128      both = vertical->gather_prescatter_contributors_size + vertical->gather_prescatter_coefficients_size;
7129
7130#ifdef STBIR__SEPARATE_ALLOCATIONS
7131      temp_mem_amt = decode_buffer_size;
7132
7133      #ifdef STBIR_SIMD8
7134      if ( effective_channels == 3 )
7135        --temp_mem_amt; // avx in 3 channel mode needs one float at the start of the buffer
7136      #endif
7137#else
7138      temp_mem_amt = ( decode_buffer_size + ring_buffer_size + vertical_buffer_size ) * splits;
7139#endif
7140      if ( temp_mem_amt >= both )
7141      {
7142        if ( info )
7143        {
7144          vertical->gather_prescatter_contributors = (stbir__contributors*)info->split_info[0].decode_buffer;
7145          vertical->gather_prescatter_coefficients = (float*) ( ( (char*)info->split_info[0].decode_buffer ) + vertical->gather_prescatter_contributors_size );
7146        }
7147      }
7148      else
7149      {
7150        // ring+decode memory is too small, so allocate temp memory
7151        STBIR__NEXT_PTR( vertical->gather_prescatter_contributors, vertical->gather_prescatter_contributors_size, stbir__contributors );
7152        STBIR__NEXT_PTR( vertical->gather_prescatter_coefficients, vertical->gather_prescatter_coefficients_size, float );
7153      }
7154    }
7155
7156    STBIR__NEXT_PTR( horizontal->contributors, horizontal->contributors_size, stbir__contributors );
7157    STBIR__NEXT_PTR( horizontal->coefficients, horizontal->coefficients_size, float );
7158
7159    // are the two filters identical?? (happens a lot with mipmap generation)
7160    if ( ( horizontal->filter_kernel == vertical->filter_kernel ) && ( horizontal->filter_support == vertical->filter_support ) && ( horizontal->edge == vertical->edge ) && ( horizontal->scale_info.output_sub_size == vertical->scale_info.output_sub_size ) )
7161    {
7162      float diff_scale = horizontal->scale_info.scale - vertical->scale_info.scale;
7163      float diff_shift = horizontal->scale_info.pixel_shift - vertical->scale_info.pixel_shift;
7164      if ( diff_scale < 0.0f ) diff_scale = -diff_scale;
7165      if ( diff_shift < 0.0f ) diff_shift = -diff_shift;
7166      if ( ( diff_scale <= stbir__small_float ) && ( diff_shift <= stbir__small_float ) )
7167      {
7168        if ( horizontal->is_gather == vertical->is_gather )
7169        {
7170          copy_horizontal = 1;
7171          goto no_vert_alloc;
7172        }
7173        // everything matches, but vertical is scatter, horizontal is gather, use horizontal coeffs for vertical pivot coeffs
7174        possibly_use_horizontal_for_pivot = horizontal;
7175      }
7176    }
7177
7178    STBIR__NEXT_PTR( vertical->contributors, vertical->contributors_size, stbir__contributors );
7179    STBIR__NEXT_PTR( vertical->coefficients, vertical->coefficients_size, float );
7180
7181   no_vert_alloc:
7182
7183    if ( info )
7184    {
7185      STBIR_PROFILE_BUILD_START( horizontal );
7186
7187      stbir__calculate_filters( horizontal, 0, user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO );
7188
7189      // setup the horizontal gather functions
7190      // start with defaulting to the n_coeffs functions (specialized on channels and remnant leftover)
7191      info->horizontal_gather_channels = stbir__horizontal_gather_n_coeffs_funcs[ effective_channels ][ horizontal->extent_info.widest & 3 ];
7192      // but if the number of coeffs <= 12, use another set of special cases. <=12 coeffs is any enlarging resize, or shrinking resize down to about 1/3 size
7193      if ( horizontal->extent_info.widest <= 12 )
7194        info->horizontal_gather_channels = stbir__horizontal_gather_channels_funcs[ effective_channels ][ horizontal->extent_info.widest - 1 ];
7195
7196      info->scanline_extents.conservative.n0 = conservative->n0;
7197      info->scanline_extents.conservative.n1 = conservative->n1;
7198
7199      // get exact extents
7200      stbir__get_extents( horizontal, &info->scanline_extents );
7201
7202      // pack the horizontal coeffs
7203      horizontal->coefficient_width = stbir__pack_coefficients(horizontal->num_contributors, horizontal->contributors, horizontal->coefficients, horizontal->coefficient_width, horizontal->extent_info.widest, info->scanline_extents.conservative.n0, info->scanline_extents.conservative.n1 );
7204
7205      STBIR_MEMCPY( &info->horizontal, horizontal, sizeof( stbir__sampler ) );
7206
7207      STBIR_PROFILE_BUILD_END( horizontal );
7208
7209      if ( copy_horizontal )
7210      {
7211        STBIR_MEMCPY( &info->vertical, horizontal, sizeof( stbir__sampler ) );
7212      }
7213      else
7214      {
7215        STBIR_PROFILE_BUILD_START( vertical );
7216
7217        stbir__calculate_filters( vertical, possibly_use_horizontal_for_pivot, user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO );
7218        STBIR_MEMCPY( &info->vertical, vertical, sizeof( stbir__sampler ) );
7219
7220        STBIR_PROFILE_BUILD_END( vertical );
7221      }
7222
7223      // setup the vertical split ranges
7224      stbir__get_split_info( info->split_info, info->splits, info->vertical.scale_info.output_sub_size, info->vertical.filter_pixel_margin, info->vertical.scale_info.input_full_size );
7225
7226      // now we know precisely how many entries we need
7227      info->ring_buffer_num_entries = info->vertical.extent_info.widest;
7228
7229      // we never need more ring buffer entries than the scanlines we're outputting
7230      if ( ( !info->vertical.is_gather ) && ( info->ring_buffer_num_entries > conservative_split_output_size ) )
7231        info->ring_buffer_num_entries = conservative_split_output_size;
7232      STBIR_ASSERT( info->ring_buffer_num_entries <= info->alloc_ring_buffer_num_entries );
7233
7234      // a few of the horizontal gather functions read past the end of the decode (but mask it out), 
7235      //   so put in normal values so no snans or denormals accidentally sneak in (also, in the ring 
7236      //   buffer for vertical first)
7237      for( i = 0 ; i < splits ; i++ )
7238      {
7239        int t, ofs, start;
7240
7241        ofs = decode_buffer_size / 4;
7242
7243        #if defined( STBIR__SEPARATE_ALLOCATIONS ) && defined(STBIR_SIMD8)
7244        if ( effective_channels == 3 ) 
7245          --ofs; // avx in 3 channel mode needs one float at the start of the buffer, so we snap back for clearing
7246        #endif
7247
7248        start = ofs - 4;
7249        if ( start < 0 ) start = 0;
7250
7251        for( t = start ; t < ofs; t++ )
7252          info->split_info[i].decode_buffer[ t ] = 9999.0f;
7253
7254        if ( vertical_first )
7255        {
7256          int j;
7257          for( j = 0; j < info->ring_buffer_num_entries ; j++ )
7258          {
7259            for( t = start ; t < ofs; t++ )
7260              stbir__get_ring_buffer_entry( info, info->split_info + i, j )[ t ] = 9999.0f;
7261          }
7262        }
7263      }
7264    }
7265
7266    #undef STBIR__NEXT_PTR
7267
7268
7269    // is this the first time through loop?
7270    if ( info == 0 )
7271    {
7272      alloced_total = ( 15 + (size_t)advance_mem );
7273      alloced = STBIR_MALLOC( alloced_total, user_data );
7274      if ( alloced == 0 )
7275        return 0;
7276    }
7277    else
7278      return info;  // success
7279  }
7280}
7281
7282static int stbir__perform_resize( stbir__info const * info, int split_start, int split_count )
7283{
7284  stbir__per_split_info * split_info = info->split_info + split_start;
7285
7286  STBIR_PROFILE_CLEAR_EXTRAS();
7287
7288  STBIR_PROFILE_FIRST_START( looping );
7289  if (info->vertical.is_gather)
7290    stbir__vertical_gather_loop( info, split_info, split_count );
7291  else
7292    stbir__vertical_scatter_loop( info, split_info, split_count );
7293  STBIR_PROFILE_END( looping );
7294
7295  return 1;
7296}
7297
7298static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * resize )
7299{
7300  static stbir__decode_pixels_func * decode_simple[STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]=
7301  {
7302    /* 1ch-4ch */ stbir__decode_uint8_srgb, stbir__decode_uint8_srgb, 0, stbir__decode_float_linear, stbir__decode_half_float_linear,
7303  };
7304
7305  static stbir__decode_pixels_func * decode_alphas[STBIRI_AR-STBIRI_RGBA+1][STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]=
7306  {
7307    { /* RGBA */ stbir__decode_uint8_srgb4_linearalpha,      stbir__decode_uint8_srgb,      0, stbir__decode_float_linear,      stbir__decode_half_float_linear },
7308    { /* BGRA */ stbir__decode_uint8_srgb4_linearalpha_BGRA, stbir__decode_uint8_srgb_BGRA, 0, stbir__decode_float_linear_BGRA, stbir__decode_half_float_linear_BGRA },
7309    { /* ARGB */ stbir__decode_uint8_srgb4_linearalpha_ARGB, stbir__decode_uint8_srgb_ARGB, 0, stbir__decode_float_linear_ARGB, stbir__decode_half_float_linear_ARGB },
7310    { /* ABGR */ stbir__decode_uint8_srgb4_linearalpha_ABGR, stbir__decode_uint8_srgb_ABGR, 0, stbir__decode_float_linear_ABGR, stbir__decode_half_float_linear_ABGR },
7311    { /* RA   */ stbir__decode_uint8_srgb2_linearalpha,      stbir__decode_uint8_srgb,      0, stbir__decode_float_linear,      stbir__decode_half_float_linear },
7312    { /* AR   */ stbir__decode_uint8_srgb2_linearalpha_AR,   stbir__decode_uint8_srgb_AR,   0, stbir__decode_float_linear_AR,   stbir__decode_half_float_linear_AR },
7313  };
7314
7315  static stbir__decode_pixels_func * decode_simple_scaled_or_not[2][2]=
7316  {
7317    { stbir__decode_uint8_linear_scaled,  stbir__decode_uint8_linear }, { stbir__decode_uint16_linear_scaled, stbir__decode_uint16_linear },
7318  };
7319
7320  static stbir__decode_pixels_func * decode_alphas_scaled_or_not[STBIRI_AR-STBIRI_RGBA+1][2][2]=
7321  {
7322    { /* RGBA */ { stbir__decode_uint8_linear_scaled,       stbir__decode_uint8_linear },      { stbir__decode_uint16_linear_scaled,      stbir__decode_uint16_linear } },
7323    { /* BGRA */ { stbir__decode_uint8_linear_scaled_BGRA,  stbir__decode_uint8_linear_BGRA }, { stbir__decode_uint16_linear_scaled_BGRA, stbir__decode_uint16_linear_BGRA } },
7324    { /* ARGB */ { stbir__decode_uint8_linear_scaled_ARGB,  stbir__decode_uint8_linear_ARGB }, { stbir__decode_uint16_linear_scaled_ARGB, stbir__decode_uint16_linear_ARGB } },
7325    { /* ABGR */ { stbir__decode_uint8_linear_scaled_ABGR,  stbir__decode_uint8_linear_ABGR }, { stbir__decode_uint16_linear_scaled_ABGR, stbir__decode_uint16_linear_ABGR } },
7326    { /* RA   */ { stbir__decode_uint8_linear_scaled,       stbir__decode_uint8_linear },      { stbir__decode_uint16_linear_scaled,      stbir__decode_uint16_linear } },
7327    { /* AR   */ { stbir__decode_uint8_linear_scaled_AR,    stbir__decode_uint8_linear_AR },   { stbir__decode_uint16_linear_scaled_AR,   stbir__decode_uint16_linear_AR } }
7328  };
7329
7330  static stbir__encode_pixels_func * encode_simple[STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]=
7331  {
7332    /* 1ch-4ch */ stbir__encode_uint8_srgb, stbir__encode_uint8_srgb, 0, stbir__encode_float_linear, stbir__encode_half_float_linear,
7333  };
7334
7335  static stbir__encode_pixels_func * encode_alphas[STBIRI_AR-STBIRI_RGBA+1][STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]=
7336  {
7337    { /* RGBA */ stbir__encode_uint8_srgb4_linearalpha,      stbir__encode_uint8_srgb,      0, stbir__encode_float_linear,      stbir__encode_half_float_linear },
7338    { /* BGRA */ stbir__encode_uint8_srgb4_linearalpha_BGRA, stbir__encode_uint8_srgb_BGRA, 0, stbir__encode_float_linear_BGRA, stbir__encode_half_float_linear_BGRA },
7339    { /* ARGB */ stbir__encode_uint8_srgb4_linearalpha_ARGB, stbir__encode_uint8_srgb_ARGB, 0, stbir__encode_float_linear_ARGB, stbir__encode_half_float_linear_ARGB },
7340    { /* ABGR */ stbir__encode_uint8_srgb4_linearalpha_ABGR, stbir__encode_uint8_srgb_ABGR, 0, stbir__encode_float_linear_ABGR, stbir__encode_half_float_linear_ABGR },
7341    { /* RA   */ stbir__encode_uint8_srgb2_linearalpha,      stbir__encode_uint8_srgb,      0, stbir__encode_float_linear,      stbir__encode_half_float_linear },
7342    { /* AR   */ stbir__encode_uint8_srgb2_linearalpha_AR,   stbir__encode_uint8_srgb_AR,   0, stbir__encode_float_linear_AR,   stbir__encode_half_float_linear_AR }
7343  };
7344
7345  static stbir__encode_pixels_func * encode_simple_scaled_or_not[2][2]=
7346  {
7347    { stbir__encode_uint8_linear_scaled,  stbir__encode_uint8_linear }, { stbir__encode_uint16_linear_scaled, stbir__encode_uint16_linear },
7348  };
7349
7350  static stbir__encode_pixels_func * encode_alphas_scaled_or_not[STBIRI_AR-STBIRI_RGBA+1][2][2]=
7351  {
7352    { /* RGBA */ { stbir__encode_uint8_linear_scaled,       stbir__encode_uint8_linear },       { stbir__encode_uint16_linear_scaled,      stbir__encode_uint16_linear } },
7353    { /* BGRA */ { stbir__encode_uint8_linear_scaled_BGRA,  stbir__encode_uint8_linear_BGRA },  { stbir__encode_uint16_linear_scaled_BGRA, stbir__encode_uint16_linear_BGRA } },
7354    { /* ARGB */ { stbir__encode_uint8_linear_scaled_ARGB,  stbir__encode_uint8_linear_ARGB },  { stbir__encode_uint16_linear_scaled_ARGB, stbir__encode_uint16_linear_ARGB } },
7355    { /* ABGR */ { stbir__encode_uint8_linear_scaled_ABGR,  stbir__encode_uint8_linear_ABGR },  { stbir__encode_uint16_linear_scaled_ABGR, stbir__encode_uint16_linear_ABGR } },
7356    { /* RA   */ { stbir__encode_uint8_linear_scaled,       stbir__encode_uint8_linear },       { stbir__encode_uint16_linear_scaled,      stbir__encode_uint16_linear } },
7357    { /* AR   */ { stbir__encode_uint8_linear_scaled_AR,    stbir__encode_uint8_linear_AR },    { stbir__encode_uint16_linear_scaled_AR,   stbir__encode_uint16_linear_AR } }
7358  };
7359
7360  stbir__decode_pixels_func * decode_pixels = 0;
7361  stbir__encode_pixels_func * encode_pixels = 0;
7362  stbir_datatype input_type, output_type;
7363
7364  input_type = resize->input_data_type;
7365  output_type = resize->output_data_type;
7366  info->input_data = resize->input_pixels;
7367  info->input_stride_bytes = resize->input_stride_in_bytes;
7368  info->output_stride_bytes = resize->output_stride_in_bytes;
7369
7370  // if we're completely point sampling, then we can turn off SRGB
7371  if ( ( info->horizontal.filter_enum == STBIR_FILTER_POINT_SAMPLE ) && ( info->vertical.filter_enum == STBIR_FILTER_POINT_SAMPLE ) )
7372  {
7373    if ( ( ( input_type  == STBIR_TYPE_UINT8_SRGB ) || ( input_type  == STBIR_TYPE_UINT8_SRGB_ALPHA ) ) &&
7374         ( ( output_type == STBIR_TYPE_UINT8_SRGB ) || ( output_type == STBIR_TYPE_UINT8_SRGB_ALPHA ) ) )
7375    {
7376      input_type = STBIR_TYPE_UINT8;
7377      output_type = STBIR_TYPE_UINT8;
7378    }
7379  }
7380
7381  // recalc the output and input strides
7382  if ( info->input_stride_bytes == 0 )
7383    info->input_stride_bytes = info->channels * info->horizontal.scale_info.input_full_size * stbir__type_size[input_type];
7384
7385  if ( info->output_stride_bytes == 0 )
7386    info->output_stride_bytes = info->channels * info->horizontal.scale_info.output_sub_size * stbir__type_size[output_type];
7387
7388  // calc offset
7389  info->output_data = ( (char*) resize->output_pixels ) + ( (size_t) info->offset_y * (size_t) resize->output_stride_in_bytes ) + ( info->offset_x * info->channels * stbir__type_size[output_type] );
7390
7391  info->in_pixels_cb = resize->input_cb;
7392  info->user_data = resize->user_data;
7393  info->out_pixels_cb = resize->output_cb;
7394
7395  // setup the input format converters
7396  if ( ( input_type == STBIR_TYPE_UINT8 ) || ( input_type == STBIR_TYPE_UINT16 ) )
7397  {
7398    int non_scaled = 0;
7399
7400    // check if we can run unscaled - 0-255.0/0-65535.0 instead of 0-1.0 (which is a tiny bit faster when doing linear 8->8 or 16->16)
7401    if ( ( !info->alpha_weight ) && ( !info->alpha_unweight )  ) // don't short circuit when alpha weighting (get everything to 0-1.0 as usual)
7402      if ( ( ( input_type == STBIR_TYPE_UINT8 ) && ( output_type == STBIR_TYPE_UINT8 ) ) || ( ( input_type == STBIR_TYPE_UINT16 ) && ( output_type == STBIR_TYPE_UINT16 ) ) )
7403        non_scaled = 1;
7404
7405    if ( info->input_pixel_layout_internal <= STBIRI_4CHANNEL )
7406      decode_pixels = decode_simple_scaled_or_not[ input_type == STBIR_TYPE_UINT16 ][ non_scaled ];
7407    else
7408      decode_pixels = decode_alphas_scaled_or_not[ ( info->input_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ input_type == STBIR_TYPE_UINT16 ][ non_scaled ];
7409  }
7410  else
7411  {
7412    if ( info->input_pixel_layout_internal <= STBIRI_4CHANNEL )
7413      decode_pixels = decode_simple[ input_type - STBIR_TYPE_UINT8_SRGB ];
7414    else
7415      decode_pixels = decode_alphas[ ( info->input_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ input_type - STBIR_TYPE_UINT8_SRGB ];
7416  }
7417
7418  // setup the output format converters
7419  if ( ( output_type == STBIR_TYPE_UINT8 ) || ( output_type == STBIR_TYPE_UINT16 ) )
7420  {
7421    int non_scaled = 0;
7422
7423    // check if we can run unscaled - 0-255.0/0-65535.0 instead of 0-1.0 (which is a tiny bit faster when doing linear 8->8 or 16->16)
7424    if ( ( !info->alpha_weight ) && ( !info->alpha_unweight ) ) // don't short circuit when alpha weighting (get everything to 0-1.0 as usual)
7425      if ( ( ( input_type == STBIR_TYPE_UINT8 ) && ( output_type == STBIR_TYPE_UINT8 ) ) || ( ( input_type == STBIR_TYPE_UINT16 ) && ( output_type == STBIR_TYPE_UINT16 ) ) )
7426        non_scaled = 1;
7427
7428    if ( info->output_pixel_layout_internal <= STBIRI_4CHANNEL )
7429      encode_pixels = encode_simple_scaled_or_not[ output_type == STBIR_TYPE_UINT16 ][ non_scaled ];
7430    else
7431      encode_pixels = encode_alphas_scaled_or_not[ ( info->output_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ output_type == STBIR_TYPE_UINT16 ][ non_scaled ];
7432  }
7433  else
7434  {
7435    if ( info->output_pixel_layout_internal <= STBIRI_4CHANNEL )
7436      encode_pixels = encode_simple[ output_type - STBIR_TYPE_UINT8_SRGB ];
7437    else
7438      encode_pixels = encode_alphas[ ( info->output_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ output_type - STBIR_TYPE_UINT8_SRGB ];
7439  }
7440
7441  info->input_type = input_type;
7442  info->output_type = output_type;
7443  info->decode_pixels = decode_pixels;
7444  info->encode_pixels = encode_pixels;
7445}
7446
7447static void stbir__clip( int * outx, int * outsubw, int outw, double * u0, double * u1 )
7448{
7449  double per, adj;
7450  int over;
7451
7452  // do left/top edge
7453  if ( *outx < 0 )
7454  {
7455    per = ( (double)*outx ) / ( (double)*outsubw ); // is negative
7456    adj = per * ( *u1 - *u0 );
7457    *u0 -= adj; // increases u0
7458    *outx = 0;
7459  }
7460
7461  // do right/bot edge
7462  over = outw - ( *outx + *outsubw );
7463  if ( over < 0 )
7464  {
7465    per = ( (double)over ) / ( (double)*outsubw ); // is negative
7466    adj = per * ( *u1 - *u0 );
7467    *u1 += adj; // decrease u1
7468    *outsubw = outw - *outx;
7469  }
7470}
7471
7472// converts a double to a rational that has less than one float bit of error (returns 0 if unable to do so)
7473static int stbir__double_to_rational(double f, stbir_uint32 limit, stbir_uint32 *numer, stbir_uint32 *denom, int limit_denom ) // limit_denom (1) or limit numer (0)
7474{
7475  double err;
7476  stbir_uint64 top, bot;
7477  stbir_uint64 numer_last = 0;
7478  stbir_uint64 denom_last = 1;
7479  stbir_uint64 numer_estimate = 1;
7480  stbir_uint64 denom_estimate = 0;
7481
7482  // scale to past float error range
7483  top = (stbir_uint64)( f * (double)(1 << 25) );
7484  bot = 1 << 25;
7485
7486  // keep refining, but usually stops in a few loops - usually 5 for bad cases
7487  for(;;)
7488  {
7489    stbir_uint64 est, temp;
7490
7491    // hit limit, break out and do best full range estimate
7492    if ( ( ( limit_denom ) ? denom_estimate : numer_estimate ) >= limit )
7493      break;
7494
7495    // is the current error less than 1 bit of a float? if so, we're done
7496    if ( denom_estimate )
7497    {
7498      err = ( (double)numer_estimate / (double)denom_estimate ) - f;
7499      if ( err < 0.0 ) err = -err;
7500      if ( err < ( 1.0 / (double)(1<<24) ) )
7501      {
7502        // yup, found it
7503        *numer = (stbir_uint32) numer_estimate;
7504        *denom = (stbir_uint32) denom_estimate;
7505        return 1;
7506      }
7507    }
7508
7509    // no more refinement bits left? break out and do full range estimate
7510    if ( bot == 0 )
7511      break;
7512
7513    // gcd the estimate bits
7514    est = top / bot;
7515    temp = top % bot;
7516    top = bot;
7517    bot = temp;
7518
7519    // move remainders
7520    temp = est * denom_estimate + denom_last;
7521    denom_last = denom_estimate;
7522    denom_estimate = temp;
7523
7524    // move remainders
7525    temp = est * numer_estimate + numer_last;
7526    numer_last = numer_estimate;
7527    numer_estimate = temp;
7528  }
7529
7530  // we didn't fine anything good enough for float, use a full range estimate
7531  if ( limit_denom )
7532  {
7533    numer_estimate= (stbir_uint64)( f * (double)limit + 0.5 );
7534    denom_estimate = limit;
7535  }
7536  else
7537  {
7538    numer_estimate = limit;
7539    denom_estimate = (stbir_uint64)( ( (double)limit / f ) + 0.5 );
7540  }
7541
7542  *numer = (stbir_uint32) numer_estimate;
7543  *denom = (stbir_uint32) denom_estimate;
7544
7545  err = ( denom_estimate ) ? ( ( (double)(stbir_uint32)numer_estimate / (double)(stbir_uint32)denom_estimate ) - f ) : 1.0;
7546  if ( err < 0.0 ) err = -err;
7547  return ( err < ( 1.0 / (double)(1<<24) ) ) ? 1 : 0;
7548}
7549
7550static int stbir__calculate_region_transform( stbir__scale_info * scale_info, int output_full_range, int * output_offset, int output_sub_range, int input_full_range, double input_s0, double input_s1 )
7551{
7552  double output_range, input_range, output_s, input_s, ratio, scale;
7553
7554  input_s = input_s1 - input_s0;
7555
7556  // null area
7557  if ( ( output_full_range == 0 ) || ( input_full_range == 0 ) ||
7558       ( output_sub_range == 0 ) || ( input_s <= stbir__small_float ) )
7559    return 0;
7560
7561  // are either of the ranges completely out of bounds?
7562  if ( ( *output_offset >= output_full_range ) || ( ( *output_offset + output_sub_range ) <= 0 ) || ( input_s0 >= (1.0f-stbir__small_float) ) || ( input_s1 <= stbir__small_float ) )
7563    return 0;
7564
7565  output_range = (double)output_full_range;
7566  input_range = (double)input_full_range;
7567
7568  output_s = ( (double)output_sub_range) / output_range;
7569
7570  // figure out the scaling to use
7571  ratio = output_s / input_s;
7572
7573  // save scale before clipping
7574  scale = ( output_range / input_range ) * ratio;
7575  scale_info->scale = (float)scale;
7576  scale_info->inv_scale = (float)( 1.0 / scale );
7577
7578  // clip output area to left/right output edges (and adjust input area)
7579  stbir__clip( output_offset, &output_sub_range, output_full_range, &input_s0, &input_s1 );
7580
7581  // recalc input area
7582  input_s = input_s1 - input_s0;
7583
7584  // after clipping do we have zero input area?
7585  if ( input_s <= stbir__small_float )
7586    return 0;
7587
7588  // calculate and store the starting source offsets in output pixel space
7589  scale_info->pixel_shift = (float) ( input_s0 * ratio * output_range );
7590
7591  scale_info->scale_is_rational = stbir__double_to_rational( scale, ( scale <= 1.0 ) ? output_full_range : input_full_range, &scale_info->scale_numerator, &scale_info->scale_denominator, ( scale >= 1.0 ) );
7592
7593  scale_info->input_full_size = input_full_range;
7594  scale_info->output_sub_size = output_sub_range;
7595
7596  return 1;
7597}
7598
7599
7600static void stbir__init_and_set_layout( STBIR_RESIZE * resize, stbir_pixel_layout pixel_layout, stbir_datatype data_type )
7601{
7602  resize->input_cb = 0;
7603  resize->output_cb = 0;
7604  resize->user_data = resize;
7605  resize->samplers = 0;
7606  resize->called_alloc = 0;
7607  resize->horizontal_filter = STBIR_FILTER_DEFAULT;
7608  resize->horizontal_filter_kernel = 0; resize->horizontal_filter_support = 0;
7609  resize->vertical_filter = STBIR_FILTER_DEFAULT;
7610  resize->vertical_filter_kernel = 0; resize->vertical_filter_support = 0;
7611  resize->horizontal_edge = STBIR_EDGE_CLAMP;
7612  resize->vertical_edge = STBIR_EDGE_CLAMP;
7613  resize->input_s0 = 0; resize->input_t0 = 0; resize->input_s1 = 1; resize->input_t1 = 1;
7614  resize->output_subx = 0; resize->output_suby = 0; resize->output_subw = resize->output_w; resize->output_subh = resize->output_h;
7615  resize->input_data_type = data_type;
7616  resize->output_data_type = data_type;
7617  resize->input_pixel_layout_public = pixel_layout;
7618  resize->output_pixel_layout_public = pixel_layout;
7619  resize->needs_rebuild = 1;
7620}
7621
7622STBIRDEF void stbir_resize_init( STBIR_RESIZE * resize,
7623                                 const void *input_pixels,  int input_w,  int input_h, int input_stride_in_bytes, // stride can be zero
7624                                       void *output_pixels, int output_w, int output_h, int output_stride_in_bytes, // stride can be zero
7625                                 stbir_pixel_layout pixel_layout, stbir_datatype data_type )
7626{
7627  resize->input_pixels = input_pixels;
7628  resize->input_w = input_w;
7629  resize->input_h = input_h;
7630  resize->input_stride_in_bytes = input_stride_in_bytes;
7631  resize->output_pixels = output_pixels;
7632  resize->output_w = output_w;
7633  resize->output_h = output_h;
7634  resize->output_stride_in_bytes = output_stride_in_bytes;
7635  resize->fast_alpha = 0;
7636
7637  stbir__init_and_set_layout( resize, pixel_layout, data_type );
7638}
7639
7640// You can update parameters any time after resize_init
7641STBIRDEF void stbir_set_datatypes( STBIR_RESIZE * resize, stbir_datatype input_type, stbir_datatype output_type )  // by default, datatype from resize_init
7642{
7643  resize->input_data_type = input_type;
7644  resize->output_data_type = output_type;
7645  if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
7646    stbir__update_info_from_resize( resize->samplers, resize );
7647}
7648
7649STBIRDEF void stbir_set_pixel_callbacks( STBIR_RESIZE * resize, stbir_input_callback * input_cb, stbir_output_callback * output_cb )   // no callbacks by default
7650{
7651  resize->input_cb = input_cb;
7652  resize->output_cb = output_cb;
7653
7654  if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
7655  {
7656    resize->samplers->in_pixels_cb = input_cb;
7657    resize->samplers->out_pixels_cb = output_cb;
7658  }
7659}
7660
7661STBIRDEF void stbir_set_user_data( STBIR_RESIZE * resize, void * user_data )                                     // pass back STBIR_RESIZE* by default
7662{
7663  resize->user_data = user_data;
7664  if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
7665    resize->samplers->user_data = user_data;
7666}
7667
7668STBIRDEF void stbir_set_buffer_ptrs( STBIR_RESIZE * resize, const void * input_pixels, int input_stride_in_bytes, void * output_pixels, int output_stride_in_bytes )
7669{
7670  resize->input_pixels = input_pixels;
7671  resize->input_stride_in_bytes = input_stride_in_bytes;
7672  resize->output_pixels = output_pixels;
7673  resize->output_stride_in_bytes = output_stride_in_bytes;
7674  if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
7675    stbir__update_info_from_resize( resize->samplers, resize );
7676}
7677
7678
7679STBIRDEF int stbir_set_edgemodes( STBIR_RESIZE * resize, stbir_edge horizontal_edge, stbir_edge vertical_edge )       // CLAMP by default
7680{
7681  resize->horizontal_edge = horizontal_edge;
7682  resize->vertical_edge = vertical_edge;
7683  resize->needs_rebuild = 1;
7684  return 1;
7685}
7686
7687STBIRDEF int stbir_set_filters( STBIR_RESIZE * resize, stbir_filter horizontal_filter, stbir_filter vertical_filter ) // STBIR_DEFAULT_FILTER_UPSAMPLE/DOWNSAMPLE by default
7688{
7689  resize->horizontal_filter = horizontal_filter;
7690  resize->vertical_filter = vertical_filter;
7691  resize->needs_rebuild = 1;
7692  return 1;
7693}
7694
7695STBIRDEF int stbir_set_filter_callbacks( STBIR_RESIZE * resize, stbir__kernel_callback * horizontal_filter, stbir__support_callback * horizontal_support, stbir__kernel_callback * vertical_filter, stbir__support_callback * vertical_support )
7696{
7697  resize->horizontal_filter_kernel = horizontal_filter; resize->horizontal_filter_support = horizontal_support;
7698  resize->vertical_filter_kernel = vertical_filter; resize->vertical_filter_support = vertical_support;
7699  resize->needs_rebuild = 1;
7700  return 1;
7701}
7702
7703STBIRDEF int stbir_set_pixel_layouts( STBIR_RESIZE * resize, stbir_pixel_layout input_pixel_layout, stbir_pixel_layout output_pixel_layout )   // sets new pixel layouts
7704{
7705  resize->input_pixel_layout_public = input_pixel_layout;
7706  resize->output_pixel_layout_public = output_pixel_layout;
7707  resize->needs_rebuild = 1;
7708  return 1;
7709}
7710
7711
7712STBIRDEF int stbir_set_non_pm_alpha_speed_over_quality( STBIR_RESIZE * resize, int non_pma_alpha_speed_over_quality )   // sets alpha speed
7713{
7714  resize->fast_alpha = non_pma_alpha_speed_over_quality;
7715  resize->needs_rebuild = 1;
7716  return 1;
7717}
7718
7719STBIRDEF int stbir_set_input_subrect( STBIR_RESIZE * resize, double s0, double t0, double s1, double t1 )                 // sets input region (full region by default)
7720{
7721  resize->input_s0 = s0;
7722  resize->input_t0 = t0;
7723  resize->input_s1 = s1;
7724  resize->input_t1 = t1;
7725  resize->needs_rebuild = 1;
7726
7727  // are we inbounds?
7728  if ( ( s1 < stbir__small_float ) || ( (s1-s0) < stbir__small_float ) ||
7729       ( t1 < stbir__small_float ) || ( (t1-t0) < stbir__small_float ) ||
7730       ( s0 > (1.0f-stbir__small_float) ) ||
7731       ( t0 > (1.0f-stbir__small_float) ) )
7732    return 0;
7733
7734  return 1;
7735}
7736
7737STBIRDEF int stbir_set_output_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh )          // sets input region (full region by default)
7738{
7739  resize->output_subx = subx;
7740  resize->output_suby = suby;
7741  resize->output_subw = subw;
7742  resize->output_subh = subh;
7743  resize->needs_rebuild = 1;
7744
7745  // are we inbounds?
7746  if ( ( subx >= resize->output_w ) || ( ( subx + subw ) <= 0 ) || ( suby >= resize->output_h ) || ( ( suby + subh ) <= 0 ) || ( subw == 0 ) || ( subh == 0 ) )
7747    return 0;
7748
7749  return 1;
7750}
7751
7752STBIRDEF int stbir_set_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh )                 // sets both regions (full regions by default)
7753{
7754  double s0, t0, s1, t1;
7755
7756  s0 = ( (double)subx ) / ( (double)resize->output_w );
7757  t0 = ( (double)suby ) / ( (double)resize->output_h );
7758  s1 = ( (double)(subx+subw) ) / ( (double)resize->output_w );
7759  t1 = ( (double)(suby+subh) ) / ( (double)resize->output_h );
7760
7761  resize->input_s0 = s0;
7762  resize->input_t0 = t0;
7763  resize->input_s1 = s1;
7764  resize->input_t1 = t1;
7765  resize->output_subx = subx;
7766  resize->output_suby = suby;
7767  resize->output_subw = subw;
7768  resize->output_subh = subh;
7769  resize->needs_rebuild = 1;
7770
7771  // are we inbounds?
7772  if ( ( subx >= resize->output_w ) || ( ( subx + subw ) <= 0 ) || ( suby >= resize->output_h ) || ( ( suby + subh ) <= 0 ) || ( subw == 0 ) || ( subh == 0 ) )
7773    return 0;
7774
7775  return 1;
7776}
7777
7778static int stbir__perform_build( STBIR_RESIZE * resize, int splits )
7779{
7780  stbir__contributors conservative = { 0, 0 };
7781  stbir__sampler horizontal, vertical;
7782  int new_output_subx, new_output_suby;
7783  stbir__info * out_info;
7784  #ifdef STBIR_PROFILE
7785  stbir__info profile_infod;  // used to contain building profile info before everything is allocated
7786  stbir__info * profile_info = &profile_infod;
7787  #endif
7788
7789  // have we already built the samplers?
7790  if ( resize->samplers )
7791    return 0;
7792
7793  #define STBIR_RETURN_ERROR_AND_ASSERT( exp )  STBIR_ASSERT( !(exp) ); if (exp) return 0;
7794  STBIR_RETURN_ERROR_AND_ASSERT( (unsigned)resize->horizontal_filter >= STBIR_FILTER_OTHER)
7795  STBIR_RETURN_ERROR_AND_ASSERT( (unsigned)resize->vertical_filter >= STBIR_FILTER_OTHER)
7796  #undef STBIR_RETURN_ERROR_AND_ASSERT
7797
7798  if ( splits <= 0 )
7799    return 0;
7800
7801  STBIR_PROFILE_BUILD_FIRST_START( build );
7802
7803  new_output_subx = resize->output_subx;
7804  new_output_suby = resize->output_suby;
7805
7806  // do horizontal clip and scale calcs
7807  if ( !stbir__calculate_region_transform( &horizontal.scale_info, resize->output_w, &new_output_subx, resize->output_subw, resize->input_w, resize->input_s0, resize->input_s1 ) )
7808    return 0;
7809
7810  // do vertical clip and scale calcs
7811  if ( !stbir__calculate_region_transform( &vertical.scale_info, resize->output_h, &new_output_suby, resize->output_subh, resize->input_h, resize->input_t0, resize->input_t1 ) )
7812    return 0;
7813
7814  // if nothing to do, just return
7815  if ( ( horizontal.scale_info.output_sub_size == 0 ) || ( vertical.scale_info.output_sub_size == 0 ) )
7816    return 0;
7817
7818  stbir__set_sampler(&horizontal, resize->horizontal_filter, resize->horizontal_filter_kernel, resize->horizontal_filter_support, resize->horizontal_edge, &horizontal.scale_info, 1, resize->user_data );
7819  stbir__get_conservative_extents( &horizontal, &conservative, resize->user_data );
7820  stbir__set_sampler(&vertical, resize->vertical_filter, resize->horizontal_filter_kernel, resize->vertical_filter_support, resize->vertical_edge, &vertical.scale_info, 0, resize->user_data );
7821
7822  if ( ( vertical.scale_info.output_sub_size / splits ) < STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS ) // each split should be a minimum of 4 scanlines (handwavey choice)
7823  {
7824    splits = vertical.scale_info.output_sub_size / STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS;
7825    if ( splits == 0 ) splits = 1;
7826  }
7827
7828  STBIR_PROFILE_BUILD_START( alloc );
7829  out_info = stbir__alloc_internal_mem_and_build_samplers( &horizontal, &vertical, &conservative, resize->input_pixel_layout_public, resize->output_pixel_layout_public, splits, new_output_subx, new_output_suby, resize->fast_alpha, resize->user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO );
7830  STBIR_PROFILE_BUILD_END( alloc );
7831  STBIR_PROFILE_BUILD_END( build );
7832
7833  if ( out_info )
7834  {
7835    resize->splits = splits;
7836    resize->samplers = out_info;
7837    resize->needs_rebuild = 0;
7838    #ifdef STBIR_PROFILE
7839      STBIR_MEMCPY( &out_info->profile, &profile_infod.profile, sizeof( out_info->profile ) );
7840    #endif
7841
7842    // update anything that can be changed without recalcing samplers
7843    stbir__update_info_from_resize( out_info, resize );
7844
7845    return splits;
7846  }
7847
7848  return 0;
7849}
7850
7851void stbir_free_samplers( STBIR_RESIZE * resize )
7852{
7853  if ( resize->samplers )
7854  {
7855    stbir__free_internal_mem( resize->samplers );
7856    resize->samplers = 0;
7857    resize->called_alloc = 0;
7858  }
7859}
7860
7861STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int splits )
7862{
7863  if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) )
7864  {
7865    if ( resize->samplers )
7866      stbir_free_samplers( resize );
7867
7868    resize->called_alloc = 1;
7869    return stbir__perform_build( resize, splits );
7870  }
7871
7872  STBIR_PROFILE_BUILD_CLEAR( resize->samplers );
7873
7874  return 1;
7875}
7876
7877STBIRDEF int stbir_build_samplers( STBIR_RESIZE * resize )
7878{
7879  return stbir_build_samplers_with_splits( resize, 1 );
7880}
7881
7882STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize )
7883{
7884  int result;
7885
7886  if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) )
7887  {
7888    int alloc_state = resize->called_alloc;  // remember allocated state
7889
7890    if ( resize->samplers )
7891    {
7892      stbir__free_internal_mem( resize->samplers );
7893      resize->samplers = 0;
7894    }
7895
7896    if ( !stbir_build_samplers( resize ) )
7897      return 0;
7898
7899    resize->called_alloc = alloc_state;
7900
7901    // if build_samplers succeeded (above), but there are no samplers set, then
7902    //   the area to stretch into was zero pixels, so don't do anything and return
7903    //   success
7904    if ( resize->samplers == 0 )
7905      return 1;
7906  }
7907  else
7908  {
7909    // didn't build anything - clear it
7910    STBIR_PROFILE_BUILD_CLEAR( resize->samplers );
7911  }
7912
7913  // do resize
7914  result = stbir__perform_resize( resize->samplers, 0, resize->splits );
7915
7916  // if we alloced, then free
7917  if ( !resize->called_alloc )
7918  {
7919    stbir_free_samplers( resize );
7920    resize->samplers = 0;
7921  }
7922
7923  return result;
7924}
7925
7926STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start, int split_count )
7927{
7928  STBIR_ASSERT( resize->samplers );
7929
7930  // if we're just doing the whole thing, call full
7931  if ( ( split_start == -1 ) || ( ( split_start == 0 ) && ( split_count == resize->splits ) ) )
7932    return stbir_resize_extended( resize );
7933
7934  // you **must** build samplers first when using split resize
7935  if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) )
7936    return 0;
7937
7938  if ( ( split_start >= resize->splits ) || ( split_start < 0 ) || ( ( split_start + split_count ) > resize->splits ) || ( split_count <= 0 ) )
7939    return 0;
7940
7941  // do resize
7942  return stbir__perform_resize( resize->samplers, split_start, split_count );
7943}
7944
7945static int stbir__check_output_stuff( void ** ret_ptr, int * ret_pitch, void * output_pixels, int type_size, int output_w, int output_h, int output_stride_in_bytes, stbir_internal_pixel_layout pixel_layout )
7946{
7947  size_t size;
7948  int pitch;
7949  void * ptr;
7950
7951  pitch = output_w * type_size * stbir__pixel_channels[ pixel_layout ];
7952  if ( pitch == 0 )
7953    return 0;
7954
7955  if ( output_stride_in_bytes == 0 )
7956    output_stride_in_bytes = pitch;
7957
7958  if ( output_stride_in_bytes < pitch )
7959    return 0;
7960
7961  size = (size_t)output_stride_in_bytes * (size_t)output_h;
7962  if ( size == 0 )
7963    return 0;
7964
7965  *ret_ptr = 0;
7966  *ret_pitch = output_stride_in_bytes;
7967
7968  if ( output_pixels == 0 )
7969  {
7970    ptr = STBIR_MALLOC( size, 0 );
7971    if ( ptr == 0 )
7972      return 0;
7973
7974    *ret_ptr = ptr;
7975    *ret_pitch = pitch;
7976  }
7977
7978  return 1;
7979}
7980
7981
7982STBIRDEF unsigned char * stbir_resize_uint8_linear( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
7983                                                          unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
7984                                                          stbir_pixel_layout pixel_layout )
7985{
7986  STBIR_RESIZE resize;
7987  unsigned char * optr;
7988  int opitch;
7989
7990  if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( unsigned char ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
7991    return 0;
7992
7993  stbir_resize_init( &resize,
7994                     input_pixels,  input_w,  input_h,  input_stride_in_bytes,
7995                     (optr) ? optr : output_pixels, output_w, output_h, opitch,
7996                     pixel_layout, STBIR_TYPE_UINT8 );
7997
7998  if ( !stbir_resize_extended( &resize ) )
7999  {
8000    if ( optr )
8001      STBIR_FREE( optr, 0 );
8002    return 0;
8003  }
8004
8005  return (optr) ? optr : output_pixels;
8006}
8007
8008STBIRDEF unsigned char * stbir_resize_uint8_srgb( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
8009                                                        unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
8010                                                        stbir_pixel_layout pixel_layout )
8011{
8012  STBIR_RESIZE resize;
8013  unsigned char * optr;
8014  int opitch;
8015
8016  if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( unsigned char ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
8017    return 0;
8018
8019  stbir_resize_init( &resize,
8020                     input_pixels,  input_w,  input_h,  input_stride_in_bytes,
8021                     (optr) ? optr : output_pixels, output_w, output_h, opitch,
8022                     pixel_layout, STBIR_TYPE_UINT8_SRGB );
8023
8024  if ( !stbir_resize_extended( &resize ) )
8025  {
8026    if ( optr )
8027      STBIR_FREE( optr, 0 );
8028    return 0;
8029  }
8030
8031  return (optr) ? optr : output_pixels;
8032}
8033
8034
8035STBIRDEF float * stbir_resize_float_linear( const float *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
8036                                                  float *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
8037                                                  stbir_pixel_layout pixel_layout )
8038{
8039  STBIR_RESIZE resize;
8040  float * optr;
8041  int opitch;
8042
8043  if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( float ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
8044    return 0;
8045
8046  stbir_resize_init( &resize,
8047                     input_pixels,  input_w,  input_h,  input_stride_in_bytes,
8048                     (optr) ? optr : output_pixels, output_w, output_h, opitch,
8049                     pixel_layout, STBIR_TYPE_FLOAT );
8050
8051  if ( !stbir_resize_extended( &resize ) )
8052  {
8053    if ( optr )
8054      STBIR_FREE( optr, 0 );
8055    return 0;
8056  }
8057
8058  return (optr) ? optr : output_pixels;
8059}
8060
8061
8062STBIRDEF void * stbir_resize( const void *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
8063                                    void *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
8064                              stbir_pixel_layout pixel_layout, stbir_datatype data_type,
8065                              stbir_edge edge, stbir_filter filter )
8066{
8067  STBIR_RESIZE resize;
8068  float * optr;
8069  int opitch;
8070
8071  if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, stbir__type_size[data_type], output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
8072    return 0;
8073
8074  stbir_resize_init( &resize,
8075                     input_pixels,  input_w,  input_h,  input_stride_in_bytes,
8076                     (optr) ? optr : output_pixels, output_w, output_h, output_stride_in_bytes,
8077                     pixel_layout, data_type );
8078
8079  resize.horizontal_edge = edge;
8080  resize.vertical_edge = edge;
8081  resize.horizontal_filter = filter;
8082  resize.vertical_filter = filter;
8083
8084  if ( !stbir_resize_extended( &resize ) )
8085  {
8086    if ( optr )
8087      STBIR_FREE( optr, 0 );
8088    return 0;
8089  }
8090
8091  return (optr) ? optr : output_pixels;
8092}
8093
8094#ifdef STBIR_PROFILE
8095
8096STBIRDEF void stbir_resize_build_profile_info( STBIR_PROFILE_INFO * info, STBIR_RESIZE const * resize )
8097{
8098  static char const * bdescriptions[6] = { "Building", "Allocating", "Horizontal sampler", "Vertical sampler", "Coefficient cleanup", "Coefficient piovot" } ;
8099  stbir__info* samp = resize->samplers;
8100  int i;
8101
8102  typedef int testa[ (STBIR__ARRAY_SIZE( bdescriptions ) == (STBIR__ARRAY_SIZE( samp->profile.array )-1) )?1:-1];
8103  typedef int testb[ (sizeof( samp->profile.array ) == (sizeof(samp->profile.named)) )?1:-1];
8104  typedef int testc[ (sizeof( info->clocks ) >= (sizeof(samp->profile.named)) )?1:-1];
8105
8106  for( i = 0 ; i < STBIR__ARRAY_SIZE( bdescriptions ) ; i++)
8107    info->clocks[i] = samp->profile.array[i+1];
8108
8109  info->total_clocks = samp->profile.named.total;
8110  info->descriptions = bdescriptions;
8111  info->count = STBIR__ARRAY_SIZE( bdescriptions );
8112}
8113
8114STBIRDEF void stbir_resize_split_profile_info( STBIR_PROFILE_INFO * info, STBIR_RESIZE const * resize, int split_start, int split_count )
8115{
8116  static char const * descriptions[7] = { "Looping", "Vertical sampling", "Horizontal sampling", "Scanline input", "Scanline output", "Alpha weighting", "Alpha unweighting" };
8117  stbir__per_split_info * split_info;
8118  int s, i;
8119
8120  typedef int testa[ (STBIR__ARRAY_SIZE( descriptions ) == (STBIR__ARRAY_SIZE( split_info->profile.array )-1) )?1:-1];
8121  typedef int testb[ (sizeof( split_info->profile.array ) == (sizeof(split_info->profile.named)) )?1:-1];
8122  typedef int testc[ (sizeof( info->clocks ) >= (sizeof(split_info->profile.named)) )?1:-1];
8123
8124  if ( split_start == -1 )
8125  {
8126    split_start = 0;
8127    split_count = resize->samplers->splits;
8128  }
8129
8130  if ( ( split_start >= resize->splits ) || ( split_start < 0 ) || ( ( split_start + split_count ) > resize->splits ) || ( split_count <= 0 ) )
8131  {
8132    info->total_clocks = 0;
8133    info->descriptions = 0;
8134    info->count = 0;
8135    return;
8136  }
8137
8138  split_info = resize->samplers->split_info + split_start;
8139
8140  // sum up the profile from all the splits
8141  for( i = 0 ; i < STBIR__ARRAY_SIZE( descriptions ) ; i++ )
8142  {
8143    stbir_uint64 sum = 0;
8144    for( s = 0 ; s < split_count ; s++ )
8145      sum += split_info[s].profile.array[i+1];
8146    info->clocks[i] = sum;
8147  }
8148
8149  info->total_clocks = split_info->profile.named.total;
8150  info->descriptions = descriptions;
8151  info->count = STBIR__ARRAY_SIZE( descriptions );
8152}
8153
8154STBIRDEF void stbir_resize_extended_profile_info( STBIR_PROFILE_INFO * info, STBIR_RESIZE const * resize )
8155{
8156  stbir_resize_split_profile_info( info, resize, -1, 0 );
8157}
8158
8159#endif // STBIR_PROFILE
8160
8161#undef STBIR_BGR
8162#undef STBIR_1CHANNEL
8163#undef STBIR_2CHANNEL
8164#undef STBIR_RGB
8165#undef STBIR_RGBA
8166#undef STBIR_4CHANNEL
8167#undef STBIR_BGRA
8168#undef STBIR_ARGB
8169#undef STBIR_ABGR
8170#undef STBIR_RA
8171#undef STBIR_AR
8172#undef STBIR_RGBA_PM
8173#undef STBIR_BGRA_PM
8174#undef STBIR_ARGB_PM
8175#undef STBIR_ABGR_PM
8176#undef STBIR_RA_PM
8177#undef STBIR_AR_PM
8178
8179#endif // STB_IMAGE_RESIZE_IMPLEMENTATION
8180
8181#else  // STB_IMAGE_RESIZE_HORIZONTALS&STB_IMAGE_RESIZE_DO_VERTICALS
8182
8183// we reinclude the header file to define all the horizontal functions
8184//   specializing each function for the number of coeffs is 20-40% faster *OVERALL*
8185
8186// by including the header file again this way, we can still debug the functions
8187
8188#define STBIR_strs_join2( start, mid, end ) start##mid##end
8189#define STBIR_strs_join1( start, mid, end ) STBIR_strs_join2( start, mid, end )
8190
8191#define STBIR_strs_join24( start, mid1, mid2, end ) start##mid1##mid2##end
8192#define STBIR_strs_join14( start, mid1, mid2, end ) STBIR_strs_join24( start, mid1, mid2, end )
8193
8194#ifdef STB_IMAGE_RESIZE_DO_CODERS
8195
8196#ifdef stbir__decode_suffix
8197#define STBIR__CODER_NAME( name ) STBIR_strs_join1( name, _, stbir__decode_suffix )
8198#else
8199#define STBIR__CODER_NAME( name ) name
8200#endif
8201
8202#ifdef stbir__decode_swizzle
8203#define stbir__decode_simdf8_flip(reg) STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( stbir__simdf8_0123to,stbir__decode_order0,stbir__decode_order1),stbir__decode_order2,stbir__decode_order3),stbir__decode_order0,stbir__decode_order1),stbir__decode_order2,stbir__decode_order3)(reg, reg)
8204#define stbir__decode_simdf4_flip(reg) STBIR_strs_join1( STBIR_strs_join1( stbir__simdf_0123to,stbir__decode_order0,stbir__decode_order1),stbir__decode_order2,stbir__decode_order3)(reg, reg)
8205#define stbir__encode_simdf8_unflip(reg) STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( stbir__simdf8_0123to,stbir__encode_order0,stbir__encode_order1),stbir__encode_order2,stbir__encode_order3),stbir__encode_order0,stbir__encode_order1),stbir__encode_order2,stbir__encode_order3)(reg, reg)
8206#define stbir__encode_simdf4_unflip(reg) STBIR_strs_join1( STBIR_strs_join1( stbir__simdf_0123to,stbir__encode_order0,stbir__encode_order1),stbir__encode_order2,stbir__encode_order3)(reg, reg)
8207#else
8208#define stbir__decode_order0 0
8209#define stbir__decode_order1 1
8210#define stbir__decode_order2 2
8211#define stbir__decode_order3 3
8212#define stbir__encode_order0 0
8213#define stbir__encode_order1 1
8214#define stbir__encode_order2 2
8215#define stbir__encode_order3 3
8216#define stbir__decode_simdf8_flip(reg)
8217#define stbir__decode_simdf4_flip(reg)
8218#define stbir__encode_simdf8_unflip(reg)
8219#define stbir__encode_simdf4_unflip(reg)
8220#endif
8221
8222#ifdef STBIR_SIMD8
8223#define stbir__encode_simdfX_unflip  stbir__encode_simdf8_unflip
8224#else
8225#define stbir__encode_simdfX_unflip  stbir__encode_simdf4_unflip
8226#endif
8227
8228static void STBIR__CODER_NAME( stbir__decode_uint8_linear_scaled )( float * decodep, int width_times_channels, void const * inputp )
8229{
8230  float STBIR_STREAMOUT_PTR( * ) decode = decodep;
8231  float * decode_end = (float*) decode + width_times_channels;
8232  unsigned char const * input = (unsigned char const*)inputp;
8233
8234  #ifdef STBIR_SIMD
8235  unsigned char const * end_input_m16 = input + width_times_channels - 16;
8236  if ( width_times_channels >= 16 )
8237  {
8238    decode_end -= 16;
8239    STBIR_NO_UNROLL_LOOP_START_INF_FOR
8240    for(;;)
8241    {
8242      #ifdef STBIR_SIMD8
8243      stbir__simdi i; stbir__simdi8 o0,o1;
8244      stbir__simdf8 of0, of1;
8245      STBIR_NO_UNROLL(decode);
8246      stbir__simdi_load( i, input );
8247      stbir__simdi8_expand_u8_to_u32( o0, o1, i );
8248      stbir__simdi8_convert_i32_to_float( of0, o0 );
8249      stbir__simdi8_convert_i32_to_float( of1, o1 );
8250      stbir__simdf8_mult( of0, of0, STBIR_max_uint8_as_float_inverted8);
8251      stbir__simdf8_mult( of1, of1, STBIR_max_uint8_as_float_inverted8);
8252      stbir__decode_simdf8_flip( of0 );
8253      stbir__decode_simdf8_flip( of1 );
8254      stbir__simdf8_store( decode + 0, of0 );
8255      stbir__simdf8_store( decode + 8, of1 );
8256      #else
8257      stbir__simdi i, o0, o1, o2, o3;
8258      stbir__simdf of0, of1, of2, of3;
8259      STBIR_NO_UNROLL(decode);
8260      stbir__simdi_load( i, input );
8261      stbir__simdi_expand_u8_to_u32( o0,o1,o2,o3,i);
8262      stbir__simdi_convert_i32_to_float( of0, o0 );
8263      stbir__simdi_convert_i32_to_float( of1, o1 );
8264      stbir__simdi_convert_i32_to_float( of2, o2 );
8265      stbir__simdi_convert_i32_to_float( of3, o3 );
8266      stbir__simdf_mult( of0, of0, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) );
8267      stbir__simdf_mult( of1, of1, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) );
8268      stbir__simdf_mult( of2, of2, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) );
8269      stbir__simdf_mult( of3, of3, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) );
8270      stbir__decode_simdf4_flip( of0 );
8271      stbir__decode_simdf4_flip( of1 );
8272      stbir__decode_simdf4_flip( of2 );
8273      stbir__decode_simdf4_flip( of3 );
8274      stbir__simdf_store( decode + 0,  of0 );
8275      stbir__simdf_store( decode + 4,  of1 );
8276      stbir__simdf_store( decode + 8,  of2 );
8277      stbir__simdf_store( decode + 12, of3 );
8278      #endif
8279      decode += 16;
8280      input += 16;
8281      if ( decode <= decode_end )
8282        continue;
8283      if ( decode == ( decode_end + 16 ) )
8284        break;
8285      decode = decode_end; // backup and do last couple
8286      input = end_input_m16;
8287    }
8288    return;
8289  }
8290  #endif
8291
8292  // try to do blocks of 4 when you can
8293  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8294  decode += 4;
8295  STBIR_SIMD_NO_UNROLL_LOOP_START
8296  while( decode <= decode_end )
8297  {
8298    STBIR_SIMD_NO_UNROLL(decode);
8299    decode[0-4] = ((float)(input[stbir__decode_order0])) * stbir__max_uint8_as_float_inverted;
8300    decode[1-4] = ((float)(input[stbir__decode_order1])) * stbir__max_uint8_as_float_inverted;
8301    decode[2-4] = ((float)(input[stbir__decode_order2])) * stbir__max_uint8_as_float_inverted;
8302    decode[3-4] = ((float)(input[stbir__decode_order3])) * stbir__max_uint8_as_float_inverted;
8303    decode += 4;
8304    input += 4;
8305  }
8306  decode -= 4;
8307  #endif
8308
8309  // do the remnants
8310  #if stbir__coder_min_num < 4
8311  STBIR_NO_UNROLL_LOOP_START
8312  while( decode < decode_end )
8313  {
8314    STBIR_NO_UNROLL(decode);
8315    decode[0] = ((float)(input[stbir__decode_order0])) * stbir__max_uint8_as_float_inverted;
8316    #if stbir__coder_min_num >= 2
8317    decode[1] = ((float)(input[stbir__decode_order1])) * stbir__max_uint8_as_float_inverted;
8318    #endif
8319    #if stbir__coder_min_num >= 3
8320    decode[2] = ((float)(input[stbir__decode_order2])) * stbir__max_uint8_as_float_inverted;
8321    #endif
8322    decode += stbir__coder_min_num;
8323    input += stbir__coder_min_num;
8324  }
8325  #endif
8326}
8327
8328static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outputp, int width_times_channels, float const * encode )
8329{
8330  unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char *) outputp;
8331  unsigned char * end_output = ( (unsigned char *) output ) + width_times_channels;
8332
8333  #ifdef STBIR_SIMD
8334  if ( width_times_channels >= stbir__simdfX_float_count*2 )
8335  {
8336    float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
8337    end_output -= stbir__simdfX_float_count*2;
8338    STBIR_NO_UNROLL_LOOP_START_INF_FOR
8339    for(;;)
8340    {
8341      stbir__simdfX e0, e1;
8342      stbir__simdi i;
8343      STBIR_SIMD_NO_UNROLL(encode);
8344      stbir__simdfX_madd_mem( e0, STBIR_simd_point5X, STBIR_max_uint8_as_floatX, encode );
8345      stbir__simdfX_madd_mem( e1, STBIR_simd_point5X, STBIR_max_uint8_as_floatX, encode+stbir__simdfX_float_count );
8346      stbir__encode_simdfX_unflip( e0 );
8347      stbir__encode_simdfX_unflip( e1 );
8348      #ifdef STBIR_SIMD8
8349      stbir__simdf8_pack_to_16bytes( i, e0, e1 );
8350      stbir__simdi_store( output, i );
8351      #else
8352      stbir__simdf_pack_to_8bytes( i, e0, e1 );
8353      stbir__simdi_store2( output, i );
8354      #endif
8355      encode += stbir__simdfX_float_count*2;
8356      output += stbir__simdfX_float_count*2;
8357      if ( output <= end_output )
8358        continue;
8359      if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
8360        break;
8361      output = end_output; // backup and do last couple
8362      encode = end_encode_m8;
8363    }
8364    return;
8365  }
8366
8367  // try to do blocks of 4 when you can
8368  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8369  output += 4;
8370  STBIR_NO_UNROLL_LOOP_START
8371  while( output <= end_output )
8372  {
8373    stbir__simdf e0;
8374    stbir__simdi i0;
8375    STBIR_NO_UNROLL(encode);
8376    stbir__simdf_load( e0, encode );
8377    stbir__simdf_madd( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), e0 );
8378    stbir__encode_simdf4_unflip( e0 );
8379    stbir__simdf_pack_to_8bytes( i0, e0, e0 );  // only use first 4
8380    *(int*)(output-4) = stbir__simdi_to_int( i0 );
8381    output += 4;
8382    encode += 4;
8383  }
8384  output -= 4;
8385  #endif
8386
8387  // do the remnants
8388  #if stbir__coder_min_num < 4
8389  STBIR_NO_UNROLL_LOOP_START
8390  while( output < end_output )
8391  {
8392    stbir__simdf e0;
8393    STBIR_NO_UNROLL(encode);
8394    stbir__simdf_madd1_mem( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), encode+stbir__encode_order0 ); output[0] = stbir__simdf_convert_float_to_uint8( e0 );
8395    #if stbir__coder_min_num >= 2
8396    stbir__simdf_madd1_mem( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), encode+stbir__encode_order1 ); output[1] = stbir__simdf_convert_float_to_uint8( e0 );
8397    #endif
8398    #if stbir__coder_min_num >= 3
8399    stbir__simdf_madd1_mem( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), encode+stbir__encode_order2 ); output[2] = stbir__simdf_convert_float_to_uint8( e0 );
8400    #endif
8401    output += stbir__coder_min_num;
8402    encode += stbir__coder_min_num;
8403  }
8404  #endif
8405
8406  #else
8407
8408  // try to do blocks of 4 when you can
8409  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8410  output += 4;
8411  while( output <= end_output )
8412  {
8413    float f;
8414    f = encode[stbir__encode_order0] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[0-4] = (unsigned char)f;
8415    f = encode[stbir__encode_order1] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[1-4] = (unsigned char)f;
8416    f = encode[stbir__encode_order2] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[2-4] = (unsigned char)f;
8417    f = encode[stbir__encode_order3] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[3-4] = (unsigned char)f;
8418    output += 4;
8419    encode += 4;
8420  }
8421  output -= 4;
8422  #endif
8423
8424  // do the remnants
8425  #if stbir__coder_min_num < 4
8426  STBIR_NO_UNROLL_LOOP_START
8427  while( output < end_output )
8428  {
8429    float f;
8430    STBIR_NO_UNROLL(encode);
8431    f = encode[stbir__encode_order0] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[0] = (unsigned char)f;
8432    #if stbir__coder_min_num >= 2
8433    f = encode[stbir__encode_order1] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[1] = (unsigned char)f;
8434    #endif
8435    #if stbir__coder_min_num >= 3
8436    f = encode[stbir__encode_order2] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[2] = (unsigned char)f;
8437    #endif
8438    output += stbir__coder_min_num;
8439    encode += stbir__coder_min_num;
8440  }
8441  #endif
8442  #endif
8443}
8444
8445static void STBIR__CODER_NAME(stbir__decode_uint8_linear)( float * decodep, int width_times_channels, void const * inputp )
8446{
8447  float STBIR_STREAMOUT_PTR( * ) decode = decodep;
8448  float * decode_end = (float*) decode + width_times_channels;
8449  unsigned char const * input = (unsigned char const*)inputp;
8450
8451  #ifdef STBIR_SIMD
8452  unsigned char const * end_input_m16 = input + width_times_channels - 16;
8453  if ( width_times_channels >= 16 )
8454  {
8455    decode_end -= 16;
8456    STBIR_NO_UNROLL_LOOP_START_INF_FOR
8457    for(;;)
8458    {
8459      #ifdef STBIR_SIMD8
8460      stbir__simdi i; stbir__simdi8 o0,o1;
8461      stbir__simdf8 of0, of1;
8462      STBIR_NO_UNROLL(decode);
8463      stbir__simdi_load( i, input );
8464      stbir__simdi8_expand_u8_to_u32( o0, o1, i );
8465      stbir__simdi8_convert_i32_to_float( of0, o0 );
8466      stbir__simdi8_convert_i32_to_float( of1, o1 );
8467      stbir__decode_simdf8_flip( of0 );
8468      stbir__decode_simdf8_flip( of1 );
8469      stbir__simdf8_store( decode + 0, of0 );
8470      stbir__simdf8_store( decode + 8, of1 );
8471      #else
8472      stbir__simdi i, o0, o1, o2, o3;
8473      stbir__simdf of0, of1, of2, of3;
8474      STBIR_NO_UNROLL(decode);
8475      stbir__simdi_load( i, input );
8476      stbir__simdi_expand_u8_to_u32( o0,o1,o2,o3,i);
8477      stbir__simdi_convert_i32_to_float( of0, o0 );
8478      stbir__simdi_convert_i32_to_float( of1, o1 );
8479      stbir__simdi_convert_i32_to_float( of2, o2 );
8480      stbir__simdi_convert_i32_to_float( of3, o3 );
8481      stbir__decode_simdf4_flip( of0 );
8482      stbir__decode_simdf4_flip( of1 );
8483      stbir__decode_simdf4_flip( of2 );
8484      stbir__decode_simdf4_flip( of3 );
8485      stbir__simdf_store( decode + 0,  of0 );
8486      stbir__simdf_store( decode + 4,  of1 );
8487      stbir__simdf_store( decode + 8,  of2 );
8488      stbir__simdf_store( decode + 12, of3 );
8489#endif
8490      decode += 16;
8491      input += 16;
8492      if ( decode <= decode_end )
8493        continue;
8494      if ( decode == ( decode_end + 16 ) )
8495        break;
8496      decode = decode_end; // backup and do last couple
8497      input = end_input_m16;
8498    }
8499    return;
8500  }
8501  #endif
8502
8503  // try to do blocks of 4 when you can
8504  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8505  decode += 4;
8506  STBIR_SIMD_NO_UNROLL_LOOP_START
8507  while( decode <= decode_end )
8508  {
8509    STBIR_SIMD_NO_UNROLL(decode);
8510    decode[0-4] = ((float)(input[stbir__decode_order0]));
8511    decode[1-4] = ((float)(input[stbir__decode_order1]));
8512    decode[2-4] = ((float)(input[stbir__decode_order2]));
8513    decode[3-4] = ((float)(input[stbir__decode_order3]));
8514    decode += 4;
8515    input += 4;
8516  }
8517  decode -= 4;
8518  #endif
8519
8520  // do the remnants
8521  #if stbir__coder_min_num < 4
8522  STBIR_NO_UNROLL_LOOP_START
8523  while( decode < decode_end )
8524  {
8525    STBIR_NO_UNROLL(decode);
8526    decode[0] = ((float)(input[stbir__decode_order0]));
8527    #if stbir__coder_min_num >= 2
8528    decode[1] = ((float)(input[stbir__decode_order1]));
8529    #endif
8530    #if stbir__coder_min_num >= 3
8531    decode[2] = ((float)(input[stbir__decode_order2]));
8532    #endif
8533    decode += stbir__coder_min_num;
8534    input += stbir__coder_min_num;
8535  }
8536  #endif
8537}
8538
8539static void STBIR__CODER_NAME( stbir__encode_uint8_linear )( void * outputp, int width_times_channels, float const * encode )
8540{
8541  unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char *) outputp;
8542  unsigned char * end_output = ( (unsigned char *) output ) + width_times_channels;
8543
8544  #ifdef STBIR_SIMD
8545  if ( width_times_channels >= stbir__simdfX_float_count*2 )
8546  {
8547    float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
8548    end_output -= stbir__simdfX_float_count*2;
8549    STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
8550    for(;;)
8551    {
8552      stbir__simdfX e0, e1;
8553      stbir__simdi i;
8554      STBIR_SIMD_NO_UNROLL(encode);
8555      stbir__simdfX_add_mem( e0, STBIR_simd_point5X, encode );
8556      stbir__simdfX_add_mem( e1, STBIR_simd_point5X, encode+stbir__simdfX_float_count );
8557      stbir__encode_simdfX_unflip( e0 );
8558      stbir__encode_simdfX_unflip( e1 );
8559      #ifdef STBIR_SIMD8
8560      stbir__simdf8_pack_to_16bytes( i, e0, e1 );
8561      stbir__simdi_store( output, i );
8562      #else
8563      stbir__simdf_pack_to_8bytes( i, e0, e1 );
8564      stbir__simdi_store2( output, i );
8565      #endif
8566      encode += stbir__simdfX_float_count*2;
8567      output += stbir__simdfX_float_count*2;
8568      if ( output <= end_output )
8569        continue;
8570      if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
8571        break;
8572      output = end_output; // backup and do last couple
8573      encode = end_encode_m8;
8574    }
8575    return;
8576  }
8577
8578  // try to do blocks of 4 when you can
8579  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8580  output += 4;
8581  STBIR_NO_UNROLL_LOOP_START
8582  while( output <= end_output )
8583  {
8584    stbir__simdf e0;
8585    stbir__simdi i0;
8586    STBIR_NO_UNROLL(encode);
8587    stbir__simdf_load( e0, encode );
8588    stbir__simdf_add( e0, STBIR__CONSTF(STBIR_simd_point5), e0 );
8589    stbir__encode_simdf4_unflip( e0 );
8590    stbir__simdf_pack_to_8bytes( i0, e0, e0 );  // only use first 4
8591    *(int*)(output-4) = stbir__simdi_to_int( i0 );
8592    output += 4;
8593    encode += 4;
8594  }
8595  output -= 4;
8596  #endif
8597
8598  #else
8599
8600  // try to do blocks of 4 when you can
8601  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8602  output += 4;
8603  while( output <= end_output )
8604  {
8605    float f;
8606    f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 255); output[0-4] = (unsigned char)f;
8607    f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 255); output[1-4] = (unsigned char)f;
8608    f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 255); output[2-4] = (unsigned char)f;
8609    f = encode[stbir__encode_order3] + 0.5f; STBIR_CLAMP(f, 0, 255); output[3-4] = (unsigned char)f;
8610    output += 4;
8611    encode += 4;
8612  }
8613  output -= 4;
8614  #endif
8615
8616  #endif
8617
8618  // do the remnants
8619  #if stbir__coder_min_num < 4
8620  STBIR_NO_UNROLL_LOOP_START
8621  while( output < end_output )
8622  {
8623    float f;
8624    STBIR_NO_UNROLL(encode);
8625    f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 255); output[0] = (unsigned char)f;
8626    #if stbir__coder_min_num >= 2
8627    f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 255); output[1] = (unsigned char)f;
8628    #endif
8629    #if stbir__coder_min_num >= 3
8630    f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 255); output[2] = (unsigned char)f;
8631    #endif
8632    output += stbir__coder_min_num;
8633    encode += stbir__coder_min_num;
8634  }
8635  #endif
8636}
8637
8638static void STBIR__CODER_NAME(stbir__decode_uint8_srgb)( float * decodep, int width_times_channels, void const * inputp )
8639{
8640  float STBIR_STREAMOUT_PTR( * ) decode = decodep;
8641  float const * decode_end = (float*) decode + width_times_channels;
8642  unsigned char const * input = (unsigned char const *)inputp;
8643
8644  // try to do blocks of 4 when you can
8645  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8646  decode += 4;
8647  while( decode <= decode_end )
8648  {
8649    decode[0-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order0 ] ];
8650    decode[1-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order1 ] ];
8651    decode[2-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order2 ] ];
8652    decode[3-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order3 ] ];
8653    decode += 4;
8654    input += 4;
8655  }
8656  decode -= 4;
8657  #endif
8658
8659  // do the remnants
8660  #if stbir__coder_min_num < 4
8661  STBIR_NO_UNROLL_LOOP_START
8662  while( decode < decode_end )
8663  {
8664    STBIR_NO_UNROLL(decode);
8665    decode[0] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order0 ] ];
8666    #if stbir__coder_min_num >= 2
8667    decode[1] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order1 ] ];
8668    #endif
8669    #if stbir__coder_min_num >= 3
8670    decode[2] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order2 ] ];
8671    #endif
8672    decode += stbir__coder_min_num;
8673    input += stbir__coder_min_num;
8674  }
8675  #endif
8676}
8677
8678#define stbir__min_max_shift20( i, f ) \
8679    stbir__simdf_max( f, f, stbir_simdf_casti(STBIR__CONSTI( STBIR_almost_zero )) ); \
8680    stbir__simdf_min( f, f, stbir_simdf_casti(STBIR__CONSTI( STBIR_almost_one  )) ); \
8681    stbir__simdi_32shr( i, stbir_simdi_castf( f ), 20 );
8682
8683#define stbir__scale_and_convert( i, f ) \
8684    stbir__simdf_madd( f, STBIR__CONSTF( STBIR_simd_point5 ), STBIR__CONSTF( STBIR_max_uint8_as_float ), f ); \
8685    stbir__simdf_max( f, f, stbir__simdf_zeroP() ); \
8686    stbir__simdf_min( f, f, STBIR__CONSTF( STBIR_max_uint8_as_float ) ); \
8687    stbir__simdf_convert_float_to_i32( i, f );
8688
8689#define stbir__linear_to_srgb_finish( i, f ) \
8690{ \
8691    stbir__simdi temp;  \
8692    stbir__simdi_32shr( temp, stbir_simdi_castf( f ), 12 ) ; \
8693    stbir__simdi_and( temp, temp, STBIR__CONSTI(STBIR_mastissa_mask) ); \
8694    stbir__simdi_or( temp, temp, STBIR__CONSTI(STBIR_topscale) ); \
8695    stbir__simdi_16madd( i, i, temp ); \
8696    stbir__simdi_32shr( i, i, 16 ); \
8697}
8698
8699#define stbir__simdi_table_lookup2( v0,v1, table ) \
8700{ \
8701  stbir__simdi_u32 temp0,temp1; \
8702  temp0.m128i_i128 = v0; \
8703  temp1.m128i_i128 = v1; \
8704  temp0.m128i_u32[0] = table[temp0.m128i_i32[0]]; temp0.m128i_u32[1] = table[temp0.m128i_i32[1]]; temp0.m128i_u32[2] = table[temp0.m128i_i32[2]]; temp0.m128i_u32[3] = table[temp0.m128i_i32[3]]; \
8705  temp1.m128i_u32[0] = table[temp1.m128i_i32[0]]; temp1.m128i_u32[1] = table[temp1.m128i_i32[1]]; temp1.m128i_u32[2] = table[temp1.m128i_i32[2]]; temp1.m128i_u32[3] = table[temp1.m128i_i32[3]]; \
8706  v0 = temp0.m128i_i128; \
8707  v1 = temp1.m128i_i128; \
8708}
8709
8710#define stbir__simdi_table_lookup3( v0,v1,v2, table ) \
8711{ \
8712  stbir__simdi_u32 temp0,temp1,temp2; \
8713  temp0.m128i_i128 = v0; \
8714  temp1.m128i_i128 = v1; \
8715  temp2.m128i_i128 = v2; \
8716  temp0.m128i_u32[0] = table[temp0.m128i_i32[0]]; temp0.m128i_u32[1] = table[temp0.m128i_i32[1]]; temp0.m128i_u32[2] = table[temp0.m128i_i32[2]]; temp0.m128i_u32[3] = table[temp0.m128i_i32[3]]; \
8717  temp1.m128i_u32[0] = table[temp1.m128i_i32[0]]; temp1.m128i_u32[1] = table[temp1.m128i_i32[1]]; temp1.m128i_u32[2] = table[temp1.m128i_i32[2]]; temp1.m128i_u32[3] = table[temp1.m128i_i32[3]]; \
8718  temp2.m128i_u32[0] = table[temp2.m128i_i32[0]]; temp2.m128i_u32[1] = table[temp2.m128i_i32[1]]; temp2.m128i_u32[2] = table[temp2.m128i_i32[2]]; temp2.m128i_u32[3] = table[temp2.m128i_i32[3]]; \
8719  v0 = temp0.m128i_i128; \
8720  v1 = temp1.m128i_i128; \
8721  v2 = temp2.m128i_i128; \
8722}
8723
8724#define stbir__simdi_table_lookup4( v0,v1,v2,v3, table ) \
8725{ \
8726  stbir__simdi_u32 temp0,temp1,temp2,temp3; \
8727  temp0.m128i_i128 = v0; \
8728  temp1.m128i_i128 = v1; \
8729  temp2.m128i_i128 = v2; \
8730  temp3.m128i_i128 = v3; \
8731  temp0.m128i_u32[0] = table[temp0.m128i_i32[0]]; temp0.m128i_u32[1] = table[temp0.m128i_i32[1]]; temp0.m128i_u32[2] = table[temp0.m128i_i32[2]]; temp0.m128i_u32[3] = table[temp0.m128i_i32[3]]; \
8732  temp1.m128i_u32[0] = table[temp1.m128i_i32[0]]; temp1.m128i_u32[1] = table[temp1.m128i_i32[1]]; temp1.m128i_u32[2] = table[temp1.m128i_i32[2]]; temp1.m128i_u32[3] = table[temp1.m128i_i32[3]]; \
8733  temp2.m128i_u32[0] = table[temp2.m128i_i32[0]]; temp2.m128i_u32[1] = table[temp2.m128i_i32[1]]; temp2.m128i_u32[2] = table[temp2.m128i_i32[2]]; temp2.m128i_u32[3] = table[temp2.m128i_i32[3]]; \
8734  temp3.m128i_u32[0] = table[temp3.m128i_i32[0]]; temp3.m128i_u32[1] = table[temp3.m128i_i32[1]]; temp3.m128i_u32[2] = table[temp3.m128i_i32[2]]; temp3.m128i_u32[3] = table[temp3.m128i_i32[3]]; \
8735  v0 = temp0.m128i_i128; \
8736  v1 = temp1.m128i_i128; \
8737  v2 = temp2.m128i_i128; \
8738  v3 = temp3.m128i_i128; \
8739}
8740
8741static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int width_times_channels, float const * encode )
8742{
8743  unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char*) outputp;
8744  unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels;
8745
8746  #ifdef STBIR_SIMD
8747
8748  if ( width_times_channels >= 16 )
8749  {
8750    float const * end_encode_m16 = encode + width_times_channels - 16;
8751    end_output -= 16;
8752    STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
8753    for(;;)
8754    {
8755      stbir__simdf f0, f1, f2, f3;
8756      stbir__simdi i0, i1, i2, i3;
8757      STBIR_SIMD_NO_UNROLL(encode);
8758
8759      stbir__simdf_load4_transposed( f0, f1, f2, f3, encode );
8760
8761      stbir__min_max_shift20( i0, f0 );
8762      stbir__min_max_shift20( i1, f1 );
8763      stbir__min_max_shift20( i2, f2 );
8764      stbir__min_max_shift20( i3, f3 );
8765
8766      stbir__simdi_table_lookup4( i0, i1, i2, i3, ( fp32_to_srgb8_tab4 - (127-13)*8 ) );
8767
8768      stbir__linear_to_srgb_finish( i0, f0 );
8769      stbir__linear_to_srgb_finish( i1, f1 );
8770      stbir__linear_to_srgb_finish( i2, f2 );
8771      stbir__linear_to_srgb_finish( i3, f3 );
8772
8773      stbir__interleave_pack_and_store_16_u8( output,  STBIR_strs_join1(i, ,stbir__encode_order0), STBIR_strs_join1(i, ,stbir__encode_order1), STBIR_strs_join1(i, ,stbir__encode_order2), STBIR_strs_join1(i, ,stbir__encode_order3) );
8774
8775      encode += 16;
8776      output += 16;
8777      if ( output <= end_output )
8778        continue;
8779      if ( output == ( end_output + 16 ) )
8780        break;
8781      output = end_output; // backup and do last couple
8782      encode = end_encode_m16;
8783    }
8784    return;
8785  }
8786  #endif
8787
8788  // try to do blocks of 4 when you can
8789  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8790  output += 4;
8791  STBIR_SIMD_NO_UNROLL_LOOP_START
8792  while ( output <= end_output )
8793  {
8794    STBIR_SIMD_NO_UNROLL(encode);
8795
8796    output[0-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order0] );
8797    output[1-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order1] );
8798    output[2-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order2] );
8799    output[3-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order3] );
8800
8801    output += 4;
8802    encode += 4;
8803  }
8804  output -= 4;
8805  #endif
8806
8807  // do the remnants
8808  #if stbir__coder_min_num < 4
8809  STBIR_NO_UNROLL_LOOP_START
8810  while( output < end_output )
8811  {
8812    STBIR_NO_UNROLL(encode);
8813    output[0] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order0] );
8814    #if stbir__coder_min_num >= 2
8815    output[1] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order1] );
8816    #endif
8817    #if stbir__coder_min_num >= 3
8818    output[2] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order2] );
8819    #endif
8820    output += stbir__coder_min_num;
8821    encode += stbir__coder_min_num;
8822  }
8823  #endif
8824}
8825
8826#if ( stbir__coder_min_num == 4 ) || ( ( stbir__coder_min_num == 1 ) && ( !defined(stbir__decode_swizzle) ) )
8827
8828static void STBIR__CODER_NAME(stbir__decode_uint8_srgb4_linearalpha)( float * decodep, int width_times_channels, void const * inputp )
8829{
8830  float STBIR_STREAMOUT_PTR( * ) decode = decodep;
8831  float const * decode_end = (float*) decode + width_times_channels;
8832  unsigned char const * input = (unsigned char const *)inputp;
8833  do {
8834    decode[0] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order0] ];
8835    decode[1] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order1] ];
8836    decode[2] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order2] ];
8837    decode[3] = ( (float) input[stbir__decode_order3] ) * stbir__max_uint8_as_float_inverted;
8838    input += 4;
8839    decode += 4;
8840  } while( decode < decode_end );
8841}
8842
8843
8844static void STBIR__CODER_NAME( stbir__encode_uint8_srgb4_linearalpha )( void * outputp, int width_times_channels, float const * encode )
8845{
8846  unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char*) outputp;
8847  unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels;
8848
8849  #ifdef STBIR_SIMD
8850
8851  if ( width_times_channels >= 16 )
8852  {
8853    float const * end_encode_m16 = encode + width_times_channels - 16;
8854    end_output -= 16;
8855    STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
8856    for(;;)
8857    {
8858      stbir__simdf f0, f1, f2, f3;
8859      stbir__simdi i0, i1, i2, i3;
8860
8861      STBIR_SIMD_NO_UNROLL(encode);
8862      stbir__simdf_load4_transposed( f0, f1, f2, f3, encode );
8863
8864      stbir__min_max_shift20( i0, f0 );
8865      stbir__min_max_shift20( i1, f1 );
8866      stbir__min_max_shift20( i2, f2 );
8867      stbir__scale_and_convert( i3, f3 );
8868
8869      stbir__simdi_table_lookup3( i0, i1, i2, ( fp32_to_srgb8_tab4 - (127-13)*8 ) );
8870
8871      stbir__linear_to_srgb_finish( i0, f0 );
8872      stbir__linear_to_srgb_finish( i1, f1 );
8873      stbir__linear_to_srgb_finish( i2, f2 );
8874
8875      stbir__interleave_pack_and_store_16_u8( output,  STBIR_strs_join1(i, ,stbir__encode_order0), STBIR_strs_join1(i, ,stbir__encode_order1), STBIR_strs_join1(i, ,stbir__encode_order2), STBIR_strs_join1(i, ,stbir__encode_order3) );
8876
8877      output += 16;
8878      encode += 16;
8879
8880      if ( output <= end_output )
8881        continue;
8882      if ( output == ( end_output + 16 ) )
8883        break;
8884      output = end_output; // backup and do last couple
8885      encode = end_encode_m16;
8886    }
8887    return;
8888  }
8889  #endif
8890
8891  STBIR_SIMD_NO_UNROLL_LOOP_START
8892  do {
8893    float f;
8894    STBIR_SIMD_NO_UNROLL(encode);
8895
8896    output[stbir__decode_order0] = stbir__linear_to_srgb_uchar( encode[0] );
8897    output[stbir__decode_order1] = stbir__linear_to_srgb_uchar( encode[1] );
8898    output[stbir__decode_order2] = stbir__linear_to_srgb_uchar( encode[2] );
8899
8900    f = encode[3] * stbir__max_uint8_as_float + 0.5f;
8901    STBIR_CLAMP(f, 0, 255);
8902    output[stbir__decode_order3] = (unsigned char) f;
8903
8904    output += 4;
8905    encode += 4;
8906  } while( output < end_output );
8907}
8908
8909#endif
8910
8911#if ( stbir__coder_min_num == 2 ) || ( ( stbir__coder_min_num == 1 ) && ( !defined(stbir__decode_swizzle) ) )
8912
8913static void STBIR__CODER_NAME(stbir__decode_uint8_srgb2_linearalpha)( float * decodep, int width_times_channels, void const * inputp )
8914{
8915  float STBIR_STREAMOUT_PTR( * ) decode = decodep;
8916  float const * decode_end = (float*) decode + width_times_channels;
8917  unsigned char const * input = (unsigned char const *)inputp;
8918  decode += 4;
8919  while( decode <= decode_end )
8920  {
8921    decode[0-4] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order0] ];
8922    decode[1-4] = ( (float) input[stbir__decode_order1] ) * stbir__max_uint8_as_float_inverted;
8923    decode[2-4] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order0+2] ];
8924    decode[3-4] = ( (float) input[stbir__decode_order1+2] ) * stbir__max_uint8_as_float_inverted;
8925    input += 4;
8926    decode += 4;
8927  }
8928  decode -= 4;
8929  if( decode < decode_end )
8930  {
8931    decode[0] = stbir__srgb_uchar_to_linear_float[ stbir__decode_order0 ];
8932    decode[1] = ( (float) input[stbir__decode_order1] ) * stbir__max_uint8_as_float_inverted;
8933  }
8934}
8935
8936static void STBIR__CODER_NAME( stbir__encode_uint8_srgb2_linearalpha )( void * outputp, int width_times_channels, float const * encode )
8937{
8938  unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char*) outputp;
8939  unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels;
8940
8941  #ifdef STBIR_SIMD
8942
8943  if ( width_times_channels >= 16 )
8944  {
8945    float const * end_encode_m16 = encode + width_times_channels - 16;
8946    end_output -= 16;
8947    STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
8948    for(;;)
8949    {
8950      stbir__simdf f0, f1, f2, f3;
8951      stbir__simdi i0, i1, i2, i3;
8952
8953      STBIR_SIMD_NO_UNROLL(encode);
8954      stbir__simdf_load4_transposed( f0, f1, f2, f3, encode );
8955
8956      stbir__min_max_shift20( i0, f0 );
8957      stbir__scale_and_convert( i1, f1 );
8958      stbir__min_max_shift20( i2, f2 );
8959      stbir__scale_and_convert( i3, f3 );
8960
8961      stbir__simdi_table_lookup2( i0, i2, ( fp32_to_srgb8_tab4 - (127-13)*8 ) );
8962
8963      stbir__linear_to_srgb_finish( i0, f0 );
8964      stbir__linear_to_srgb_finish( i2, f2 );
8965
8966      stbir__interleave_pack_and_store_16_u8( output,  STBIR_strs_join1(i, ,stbir__encode_order0), STBIR_strs_join1(i, ,stbir__encode_order1), STBIR_strs_join1(i, ,stbir__encode_order2), STBIR_strs_join1(i, ,stbir__encode_order3) );
8967
8968      output += 16;
8969      encode += 16;
8970      if ( output <= end_output )
8971        continue;
8972      if ( output == ( end_output + 16 ) )
8973        break;
8974      output = end_output; // backup and do last couple
8975      encode = end_encode_m16;
8976    }
8977    return;
8978  }
8979  #endif
8980
8981  STBIR_SIMD_NO_UNROLL_LOOP_START
8982  do {
8983    float f;
8984    STBIR_SIMD_NO_UNROLL(encode);
8985
8986    output[stbir__decode_order0] = stbir__linear_to_srgb_uchar( encode[0] );
8987
8988    f = encode[1] * stbir__max_uint8_as_float + 0.5f;
8989    STBIR_CLAMP(f, 0, 255);
8990    output[stbir__decode_order1] = (unsigned char) f;
8991
8992    output += 2;
8993    encode += 2;
8994  } while( output < end_output );
8995}
8996
8997#endif
8998
8999static void STBIR__CODER_NAME(stbir__decode_uint16_linear_scaled)( float * decodep, int width_times_channels, void const * inputp )
9000{
9001  float STBIR_STREAMOUT_PTR( * ) decode = decodep;
9002  float * decode_end = (float*) decode + width_times_channels;
9003  unsigned short const * input = (unsigned short const *)inputp;
9004
9005  #ifdef STBIR_SIMD
9006  unsigned short const * end_input_m8 = input + width_times_channels - 8;
9007  if ( width_times_channels >= 8 )
9008  {
9009    decode_end -= 8;
9010    STBIR_NO_UNROLL_LOOP_START_INF_FOR
9011    for(;;)
9012    {
9013      #ifdef STBIR_SIMD8
9014      stbir__simdi i; stbir__simdi8 o;
9015      stbir__simdf8 of;
9016      STBIR_NO_UNROLL(decode);
9017      stbir__simdi_load( i, input );
9018      stbir__simdi8_expand_u16_to_u32( o, i );
9019      stbir__simdi8_convert_i32_to_float( of, o );
9020      stbir__simdf8_mult( of, of, STBIR_max_uint16_as_float_inverted8);
9021      stbir__decode_simdf8_flip( of );
9022      stbir__simdf8_store( decode + 0, of );
9023      #else
9024      stbir__simdi i, o0, o1;
9025      stbir__simdf of0, of1;
9026      STBIR_NO_UNROLL(decode);
9027      stbir__simdi_load( i, input );
9028      stbir__simdi_expand_u16_to_u32( o0,o1,i );
9029      stbir__simdi_convert_i32_to_float( of0, o0 );
9030      stbir__simdi_convert_i32_to_float( of1, o1 );
9031      stbir__simdf_mult( of0, of0, STBIR__CONSTF(STBIR_max_uint16_as_float_inverted) );
9032      stbir__simdf_mult( of1, of1, STBIR__CONSTF(STBIR_max_uint16_as_float_inverted));
9033      stbir__decode_simdf4_flip( of0 );
9034      stbir__decode_simdf4_flip( of1 );
9035      stbir__simdf_store( decode + 0,  of0 );
9036      stbir__simdf_store( decode + 4,  of1 );
9037      #endif
9038      decode += 8;
9039      input += 8;
9040      if ( decode <= decode_end )
9041        continue;
9042      if ( decode == ( decode_end + 8 ) )
9043        break;
9044      decode = decode_end; // backup and do last couple
9045      input = end_input_m8;
9046    }
9047    return;
9048  }
9049  #endif
9050
9051  // try to do blocks of 4 when you can
9052  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9053  decode += 4;
9054  STBIR_SIMD_NO_UNROLL_LOOP_START
9055  while( decode <= decode_end )
9056  {
9057    STBIR_SIMD_NO_UNROLL(decode);
9058    decode[0-4] = ((float)(input[stbir__decode_order0])) * stbir__max_uint16_as_float_inverted;
9059    decode[1-4] = ((float)(input[stbir__decode_order1])) * stbir__max_uint16_as_float_inverted;
9060    decode[2-4] = ((float)(input[stbir__decode_order2])) * stbir__max_uint16_as_float_inverted;
9061    decode[3-4] = ((float)(input[stbir__decode_order3])) * stbir__max_uint16_as_float_inverted;
9062    decode += 4;
9063    input += 4;
9064  }
9065  decode -= 4;
9066  #endif
9067
9068  // do the remnants
9069  #if stbir__coder_min_num < 4
9070  STBIR_NO_UNROLL_LOOP_START
9071  while( decode < decode_end )
9072  {
9073    STBIR_NO_UNROLL(decode);
9074    decode[0] = ((float)(input[stbir__decode_order0])) * stbir__max_uint16_as_float_inverted;
9075    #if stbir__coder_min_num >= 2
9076    decode[1] = ((float)(input[stbir__decode_order1])) * stbir__max_uint16_as_float_inverted;
9077    #endif
9078    #if stbir__coder_min_num >= 3
9079    decode[2] = ((float)(input[stbir__decode_order2])) * stbir__max_uint16_as_float_inverted;
9080    #endif
9081    decode += stbir__coder_min_num;
9082    input += stbir__coder_min_num;
9083  }
9084  #endif
9085}
9086
9087
9088static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * outputp, int width_times_channels, float const * encode )
9089{
9090  unsigned short STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned short*) outputp;
9091  unsigned short * end_output = ( (unsigned short*) output ) + width_times_channels;
9092
9093  #ifdef STBIR_SIMD
9094  {
9095    if ( width_times_channels >= stbir__simdfX_float_count*2 )
9096    {
9097      float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
9098      end_output -= stbir__simdfX_float_count*2;
9099      STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
9100      for(;;)
9101      {
9102        stbir__simdfX e0, e1;
9103        stbir__simdiX i;
9104        STBIR_SIMD_NO_UNROLL(encode);
9105        stbir__simdfX_madd_mem( e0, STBIR_simd_point5X, STBIR_max_uint16_as_floatX, encode );
9106        stbir__simdfX_madd_mem( e1, STBIR_simd_point5X, STBIR_max_uint16_as_floatX, encode+stbir__simdfX_float_count );
9107        stbir__encode_simdfX_unflip( e0 );
9108        stbir__encode_simdfX_unflip( e1 );
9109        stbir__simdfX_pack_to_words( i, e0, e1 );
9110        stbir__simdiX_store( output, i );
9111        encode += stbir__simdfX_float_count*2;
9112        output += stbir__simdfX_float_count*2;
9113        if ( output <= end_output )
9114          continue;
9115        if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
9116          break;
9117        output = end_output;     // backup and do last couple
9118        encode = end_encode_m8;
9119      }
9120      return;
9121    }
9122  }
9123
9124  // try to do blocks of 4 when you can
9125  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9126  output += 4;
9127  STBIR_NO_UNROLL_LOOP_START
9128  while( output <= end_output )
9129  {
9130    stbir__simdf e;
9131    stbir__simdi i;
9132    STBIR_NO_UNROLL(encode);
9133    stbir__simdf_load( e, encode );
9134    stbir__simdf_madd( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), e );
9135    stbir__encode_simdf4_unflip( e );
9136    stbir__simdf_pack_to_8words( i, e, e );  // only use first 4
9137    stbir__simdi_store2( output-4, i );
9138    output += 4;
9139    encode += 4;
9140  }
9141  output -= 4;
9142  #endif
9143
9144  // do the remnants
9145  #if stbir__coder_min_num < 4
9146  STBIR_NO_UNROLL_LOOP_START
9147  while( output < end_output )
9148  {
9149    stbir__simdf e;
9150    STBIR_NO_UNROLL(encode);
9151    stbir__simdf_madd1_mem( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), encode+stbir__encode_order0 ); output[0] = stbir__simdf_convert_float_to_short( e );
9152    #if stbir__coder_min_num >= 2
9153    stbir__simdf_madd1_mem( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), encode+stbir__encode_order1 ); output[1] = stbir__simdf_convert_float_to_short( e );
9154    #endif
9155    #if stbir__coder_min_num >= 3
9156    stbir__simdf_madd1_mem( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), encode+stbir__encode_order2 ); output[2] = stbir__simdf_convert_float_to_short( e );
9157    #endif
9158    output += stbir__coder_min_num;
9159    encode += stbir__coder_min_num;
9160  }
9161  #endif
9162
9163  #else
9164
9165  // try to do blocks of 4 when you can
9166  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9167  output += 4;
9168  STBIR_SIMD_NO_UNROLL_LOOP_START
9169  while( output <= end_output )
9170  {
9171    float f;
9172    STBIR_SIMD_NO_UNROLL(encode);
9173    f = encode[stbir__encode_order0] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0-4] = (unsigned short)f;
9174    f = encode[stbir__encode_order1] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1-4] = (unsigned short)f;
9175    f = encode[stbir__encode_order2] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2-4] = (unsigned short)f;
9176    f = encode[stbir__encode_order3] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[3-4] = (unsigned short)f;
9177    output += 4;
9178    encode += 4;
9179  }
9180  output -= 4;
9181  #endif
9182
9183  // do the remnants
9184  #if stbir__coder_min_num < 4
9185  STBIR_NO_UNROLL_LOOP_START
9186  while( output < end_output )
9187  {
9188    float f;
9189    STBIR_NO_UNROLL(encode);
9190    f = encode[stbir__encode_order0] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0] = (unsigned short)f;
9191    #if stbir__coder_min_num >= 2
9192    f = encode[stbir__encode_order1] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1] = (unsigned short)f;
9193    #endif
9194    #if stbir__coder_min_num >= 3
9195    f = encode[stbir__encode_order2] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2] = (unsigned short)f;
9196    #endif
9197    output += stbir__coder_min_num;
9198    encode += stbir__coder_min_num;
9199  }
9200  #endif
9201  #endif
9202}
9203
9204static void STBIR__CODER_NAME(stbir__decode_uint16_linear)( float * decodep, int width_times_channels, void const * inputp )
9205{
9206  float STBIR_STREAMOUT_PTR( * ) decode = decodep;
9207  float * decode_end = (float*) decode + width_times_channels;
9208  unsigned short const * input = (unsigned short const *)inputp;
9209
9210  #ifdef STBIR_SIMD
9211  unsigned short const * end_input_m8 = input + width_times_channels - 8;
9212  if ( width_times_channels >= 8 )
9213  {
9214    decode_end -= 8;
9215    STBIR_NO_UNROLL_LOOP_START_INF_FOR
9216    for(;;)
9217    {
9218      #ifdef STBIR_SIMD8
9219      stbir__simdi i; stbir__simdi8 o;
9220      stbir__simdf8 of;
9221      STBIR_NO_UNROLL(decode);
9222      stbir__simdi_load( i, input );
9223      stbir__simdi8_expand_u16_to_u32( o, i );
9224      stbir__simdi8_convert_i32_to_float( of, o );
9225      stbir__decode_simdf8_flip( of );
9226      stbir__simdf8_store( decode + 0, of );
9227      #else
9228      stbir__simdi i, o0, o1;
9229      stbir__simdf of0, of1;
9230      STBIR_NO_UNROLL(decode);
9231      stbir__simdi_load( i, input );
9232      stbir__simdi_expand_u16_to_u32( o0, o1, i );
9233      stbir__simdi_convert_i32_to_float( of0, o0 );
9234      stbir__simdi_convert_i32_to_float( of1, o1 );
9235      stbir__decode_simdf4_flip( of0 );
9236      stbir__decode_simdf4_flip( of1 );
9237      stbir__simdf_store( decode + 0,  of0 );
9238      stbir__simdf_store( decode + 4,  of1 );
9239      #endif
9240      decode += 8;
9241      input += 8;
9242      if ( decode <= decode_end )
9243        continue;
9244      if ( decode == ( decode_end + 8 ) )
9245        break;
9246      decode = decode_end; // backup and do last couple
9247      input = end_input_m8;
9248    }
9249    return;
9250  }
9251  #endif
9252
9253  // try to do blocks of 4 when you can
9254  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9255  decode += 4;
9256  STBIR_SIMD_NO_UNROLL_LOOP_START
9257  while( decode <= decode_end )
9258  {
9259    STBIR_SIMD_NO_UNROLL(decode);
9260    decode[0-4] = ((float)(input[stbir__decode_order0]));
9261    decode[1-4] = ((float)(input[stbir__decode_order1]));
9262    decode[2-4] = ((float)(input[stbir__decode_order2]));
9263    decode[3-4] = ((float)(input[stbir__decode_order3]));
9264    decode += 4;
9265    input += 4;
9266  }
9267  decode -= 4;
9268  #endif
9269
9270  // do the remnants
9271  #if stbir__coder_min_num < 4
9272  STBIR_NO_UNROLL_LOOP_START
9273  while( decode < decode_end )
9274  {
9275    STBIR_NO_UNROLL(decode);
9276    decode[0] = ((float)(input[stbir__decode_order0]));
9277    #if stbir__coder_min_num >= 2
9278    decode[1] = ((float)(input[stbir__decode_order1]));
9279    #endif
9280    #if stbir__coder_min_num >= 3
9281    decode[2] = ((float)(input[stbir__decode_order2]));
9282    #endif
9283    decode += stbir__coder_min_num;
9284    input += stbir__coder_min_num;
9285  }
9286  #endif
9287}
9288
9289static void STBIR__CODER_NAME(stbir__encode_uint16_linear)( void * outputp, int width_times_channels, float const * encode )
9290{
9291  unsigned short STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned short*) outputp;
9292  unsigned short * end_output = ( (unsigned short*) output ) + width_times_channels;
9293
9294  #ifdef STBIR_SIMD
9295  {
9296    if ( width_times_channels >= stbir__simdfX_float_count*2 )
9297    {
9298      float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
9299      end_output -= stbir__simdfX_float_count*2;
9300      STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
9301      for(;;)
9302      {
9303        stbir__simdfX e0, e1;
9304        stbir__simdiX i;
9305        STBIR_SIMD_NO_UNROLL(encode);
9306        stbir__simdfX_add_mem( e0, STBIR_simd_point5X, encode );
9307        stbir__simdfX_add_mem( e1, STBIR_simd_point5X, encode+stbir__simdfX_float_count );
9308        stbir__encode_simdfX_unflip( e0 );
9309        stbir__encode_simdfX_unflip( e1 );
9310        stbir__simdfX_pack_to_words( i, e0, e1 );
9311        stbir__simdiX_store( output, i );
9312        encode += stbir__simdfX_float_count*2;
9313        output += stbir__simdfX_float_count*2;
9314        if ( output <= end_output )
9315          continue;
9316        if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
9317          break;
9318        output = end_output; // backup and do last couple
9319        encode = end_encode_m8;
9320      }
9321      return;
9322    }
9323  }
9324
9325  // try to do blocks of 4 when you can
9326  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9327  output += 4;
9328  STBIR_NO_UNROLL_LOOP_START
9329  while( output <= end_output )
9330  {
9331    stbir__simdf e;
9332    stbir__simdi i;
9333    STBIR_NO_UNROLL(encode);
9334    stbir__simdf_load( e, encode );
9335    stbir__simdf_add( e, STBIR__CONSTF(STBIR_simd_point5), e );
9336    stbir__encode_simdf4_unflip( e );
9337    stbir__simdf_pack_to_8words( i, e, e );  // only use first 4
9338    stbir__simdi_store2( output-4, i );
9339    output += 4;
9340    encode += 4;
9341  }
9342  output -= 4;
9343  #endif
9344
9345  #else
9346
9347  // try to do blocks of 4 when you can
9348  #if  stbir__coder_min_num != 3 // doesn't divide cleanly by four
9349  output += 4;
9350  STBIR_SIMD_NO_UNROLL_LOOP_START
9351  while( output <= end_output )
9352  {
9353    float f;
9354    STBIR_SIMD_NO_UNROLL(encode);
9355    f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0-4] = (unsigned short)f;
9356    f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1-4] = (unsigned short)f;
9357    f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2-4] = (unsigned short)f;
9358    f = encode[stbir__encode_order3] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[3-4] = (unsigned short)f;
9359    output += 4;
9360    encode += 4;
9361  }
9362  output -= 4;
9363  #endif
9364
9365  #endif
9366
9367  // do the remnants
9368  #if stbir__coder_min_num < 4
9369  STBIR_NO_UNROLL_LOOP_START
9370  while( output < end_output )
9371  {
9372    float f;
9373    STBIR_NO_UNROLL(encode);
9374    f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0] = (unsigned short)f;
9375    #if stbir__coder_min_num >= 2
9376    f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1] = (unsigned short)f;
9377    #endif
9378    #if stbir__coder_min_num >= 3
9379    f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2] = (unsigned short)f;
9380    #endif
9381    output += stbir__coder_min_num;
9382    encode += stbir__coder_min_num;
9383  }
9384  #endif
9385}
9386
9387static void STBIR__CODER_NAME(stbir__decode_half_float_linear)( float * decodep, int width_times_channels, void const * inputp )
9388{
9389  float STBIR_STREAMOUT_PTR( * ) decode = decodep;
9390  float * decode_end = (float*) decode + width_times_channels;
9391  stbir__FP16 const * input = (stbir__FP16 const *)inputp;
9392
9393  #ifdef STBIR_SIMD
9394  if ( width_times_channels >= 8 )
9395  {
9396    stbir__FP16 const * end_input_m8 = input + width_times_channels - 8;
9397    decode_end -= 8;
9398    STBIR_NO_UNROLL_LOOP_START_INF_FOR
9399    for(;;)
9400    {
9401      STBIR_NO_UNROLL(decode);
9402
9403      stbir__half_to_float_SIMD( decode, input );
9404      #ifdef stbir__decode_swizzle
9405      #ifdef STBIR_SIMD8
9406      {
9407        stbir__simdf8 of;
9408        stbir__simdf8_load( of, decode );
9409        stbir__decode_simdf8_flip( of );
9410        stbir__simdf8_store( decode, of );
9411      }
9412      #else
9413      {
9414        stbir__simdf of0,of1;
9415        stbir__simdf_load( of0, decode );
9416        stbir__simdf_load( of1, decode+4 );
9417        stbir__decode_simdf4_flip( of0 );
9418        stbir__decode_simdf4_flip( of1 );
9419        stbir__simdf_store( decode, of0 );
9420        stbir__simdf_store( decode+4, of1 );
9421      }
9422      #endif
9423      #endif
9424      decode += 8;
9425      input += 8;
9426      if ( decode <= decode_end )
9427        continue;
9428      if ( decode == ( decode_end + 8 ) )
9429        break;
9430      decode = decode_end; // backup and do last couple
9431      input = end_input_m8;
9432    }
9433    return;
9434  }
9435  #endif
9436
9437  // try to do blocks of 4 when you can
9438  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9439  decode += 4;
9440  STBIR_SIMD_NO_UNROLL_LOOP_START
9441  while( decode <= decode_end )
9442  {
9443    STBIR_SIMD_NO_UNROLL(decode);
9444    decode[0-4] = stbir__half_to_float(input[stbir__decode_order0]);
9445    decode[1-4] = stbir__half_to_float(input[stbir__decode_order1]);
9446    decode[2-4] = stbir__half_to_float(input[stbir__decode_order2]);
9447    decode[3-4] = stbir__half_to_float(input[stbir__decode_order3]);
9448    decode += 4;
9449    input += 4;
9450  }
9451  decode -= 4;
9452  #endif
9453
9454  // do the remnants
9455  #if stbir__coder_min_num < 4
9456  STBIR_NO_UNROLL_LOOP_START
9457  while( decode < decode_end )
9458  {
9459    STBIR_NO_UNROLL(decode);
9460    decode[0] = stbir__half_to_float(input[stbir__decode_order0]);
9461    #if stbir__coder_min_num >= 2
9462    decode[1] = stbir__half_to_float(input[stbir__decode_order1]);
9463    #endif
9464    #if stbir__coder_min_num >= 3
9465    decode[2] = stbir__half_to_float(input[stbir__decode_order2]);
9466    #endif
9467    decode += stbir__coder_min_num;
9468    input += stbir__coder_min_num;
9469  }
9470  #endif
9471}
9472
9473static void STBIR__CODER_NAME( stbir__encode_half_float_linear )( void * outputp, int width_times_channels, float const * encode )
9474{
9475  stbir__FP16 STBIR_SIMD_STREAMOUT_PTR( * ) output = (stbir__FP16*) outputp;
9476  stbir__FP16 * end_output = ( (stbir__FP16*) output ) + width_times_channels;
9477
9478  #ifdef STBIR_SIMD
9479  if ( width_times_channels >= 8 )
9480  {
9481    float const * end_encode_m8 = encode + width_times_channels - 8;
9482    end_output -= 8;
9483    STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
9484    for(;;)
9485    {
9486      STBIR_SIMD_NO_UNROLL(encode);
9487      #ifdef stbir__decode_swizzle
9488      #ifdef STBIR_SIMD8
9489      {
9490        stbir__simdf8 of;
9491        stbir__simdf8_load( of, encode );
9492        stbir__encode_simdf8_unflip( of );
9493        stbir__float_to_half_SIMD( output, (float*)&of );
9494      }
9495      #else
9496      {
9497        stbir__simdf of[2];
9498        stbir__simdf_load( of[0], encode );
9499        stbir__simdf_load( of[1], encode+4 );
9500        stbir__encode_simdf4_unflip( of[0] );
9501        stbir__encode_simdf4_unflip( of[1] );
9502        stbir__float_to_half_SIMD( output, (float*)of );
9503      }
9504      #endif
9505      #else
9506      stbir__float_to_half_SIMD( output, encode );
9507      #endif
9508      encode += 8;
9509      output += 8;
9510      if ( output <= end_output )
9511        continue;
9512      if ( output == ( end_output + 8 ) )
9513        break;
9514      output = end_output; // backup and do last couple
9515      encode = end_encode_m8;
9516    }
9517    return;
9518  }
9519  #endif
9520
9521  // try to do blocks of 4 when you can
9522  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9523  output += 4;
9524  STBIR_SIMD_NO_UNROLL_LOOP_START
9525  while( output <= end_output )
9526  {
9527    STBIR_SIMD_NO_UNROLL(output);
9528    output[0-4] = stbir__float_to_half(encode[stbir__encode_order0]);
9529    output[1-4] = stbir__float_to_half(encode[stbir__encode_order1]);
9530    output[2-4] = stbir__float_to_half(encode[stbir__encode_order2]);
9531    output[3-4] = stbir__float_to_half(encode[stbir__encode_order3]);
9532    output += 4;
9533    encode += 4;
9534  }
9535  output -= 4;
9536  #endif
9537
9538  // do the remnants
9539  #if stbir__coder_min_num < 4
9540  STBIR_NO_UNROLL_LOOP_START
9541  while( output < end_output )
9542  {
9543    STBIR_NO_UNROLL(output);
9544    output[0] = stbir__float_to_half(encode[stbir__encode_order0]);
9545    #if stbir__coder_min_num >= 2
9546    output[1] = stbir__float_to_half(encode[stbir__encode_order1]);
9547    #endif
9548    #if stbir__coder_min_num >= 3
9549    output[2] = stbir__float_to_half(encode[stbir__encode_order2]);
9550    #endif
9551    output += stbir__coder_min_num;
9552    encode += stbir__coder_min_num;
9553  }
9554  #endif
9555}
9556
9557static void STBIR__CODER_NAME(stbir__decode_float_linear)( float * decodep, int width_times_channels, void const * inputp )
9558{
9559  #ifdef stbir__decode_swizzle
9560  float STBIR_STREAMOUT_PTR( * ) decode = decodep;
9561  float * decode_end = (float*) decode + width_times_channels;
9562  float const * input = (float const *)inputp;
9563
9564  #ifdef STBIR_SIMD
9565  if ( width_times_channels >= 16 )
9566  {
9567    float const * end_input_m16 = input + width_times_channels - 16;
9568    decode_end -= 16;
9569    STBIR_NO_UNROLL_LOOP_START_INF_FOR
9570    for(;;)
9571    {
9572      STBIR_NO_UNROLL(decode);
9573      #ifdef stbir__decode_swizzle
9574      #ifdef STBIR_SIMD8
9575      {
9576        stbir__simdf8 of0,of1;
9577        stbir__simdf8_load( of0, input );
9578        stbir__simdf8_load( of1, input+8 );
9579        stbir__decode_simdf8_flip( of0 );
9580        stbir__decode_simdf8_flip( of1 );
9581        stbir__simdf8_store( decode, of0 );
9582        stbir__simdf8_store( decode+8, of1 );
9583      }
9584      #else
9585      {
9586        stbir__simdf of0,of1,of2,of3;
9587        stbir__simdf_load( of0, input );
9588        stbir__simdf_load( of1, input+4 );
9589        stbir__simdf_load( of2, input+8 );
9590        stbir__simdf_load( of3, input+12 );
9591        stbir__decode_simdf4_flip( of0 );
9592        stbir__decode_simdf4_flip( of1 );
9593        stbir__decode_simdf4_flip( of2 );
9594        stbir__decode_simdf4_flip( of3 );
9595        stbir__simdf_store( decode, of0 );
9596        stbir__simdf_store( decode+4, of1 );
9597        stbir__simdf_store( decode+8, of2 );
9598        stbir__simdf_store( decode+12, of3 );
9599      }
9600      #endif
9601      #endif
9602      decode += 16;
9603      input += 16;
9604      if ( decode <= decode_end )
9605        continue;
9606      if ( decode == ( decode_end + 16 ) )
9607        break;
9608      decode = decode_end; // backup and do last couple
9609      input = end_input_m16;
9610    }
9611    return;
9612  }
9613  #endif
9614
9615  // try to do blocks of 4 when you can
9616  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9617  decode += 4;
9618  STBIR_SIMD_NO_UNROLL_LOOP_START
9619  while( decode <= decode_end )
9620  {
9621    STBIR_SIMD_NO_UNROLL(decode);
9622    decode[0-4] = input[stbir__decode_order0];
9623    decode[1-4] = input[stbir__decode_order1];
9624    decode[2-4] = input[stbir__decode_order2];
9625    decode[3-4] = input[stbir__decode_order3];
9626    decode += 4;
9627    input += 4;
9628  }
9629  decode -= 4;
9630  #endif
9631
9632  // do the remnants
9633  #if stbir__coder_min_num < 4
9634  STBIR_NO_UNROLL_LOOP_START
9635  while( decode < decode_end )
9636  {
9637    STBIR_NO_UNROLL(decode);
9638    decode[0] = input[stbir__decode_order0];
9639    #if stbir__coder_min_num >= 2
9640    decode[1] = input[stbir__decode_order1];
9641    #endif
9642    #if stbir__coder_min_num >= 3
9643    decode[2] = input[stbir__decode_order2];
9644    #endif
9645    decode += stbir__coder_min_num;
9646    input += stbir__coder_min_num;
9647  }
9648  #endif
9649
9650  #else
9651
9652  if ( (void*)decodep != inputp )
9653    STBIR_MEMCPY( decodep, inputp, width_times_channels * sizeof( float ) );
9654
9655  #endif
9656}
9657
9658static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int width_times_channels, float const * encode )
9659{
9660  #if !defined( STBIR_FLOAT_HIGH_CLAMP ) && !defined(STBIR_FLOAT_LO_CLAMP) && !defined(stbir__decode_swizzle)
9661
9662  if ( (void*)outputp != (void*) encode )
9663    STBIR_MEMCPY( outputp, encode, width_times_channels * sizeof( float ) );
9664
9665  #else
9666
9667  float STBIR_SIMD_STREAMOUT_PTR( * ) output = (float*) outputp;
9668  float * end_output = ( (float*) output ) + width_times_channels;
9669
9670  #ifdef STBIR_FLOAT_HIGH_CLAMP
9671  #define stbir_scalar_hi_clamp( v ) if ( v > STBIR_FLOAT_HIGH_CLAMP ) v = STBIR_FLOAT_HIGH_CLAMP;
9672  #else
9673  #define stbir_scalar_hi_clamp( v )
9674  #endif
9675  #ifdef STBIR_FLOAT_LOW_CLAMP
9676  #define stbir_scalar_lo_clamp( v ) if ( v < STBIR_FLOAT_LOW_CLAMP ) v = STBIR_FLOAT_LOW_CLAMP;
9677  #else
9678  #define stbir_scalar_lo_clamp( v )
9679  #endif
9680
9681  #ifdef STBIR_SIMD
9682
9683  #ifdef STBIR_FLOAT_HIGH_CLAMP
9684  const stbir__simdfX high_clamp = stbir__simdf_frepX(STBIR_FLOAT_HIGH_CLAMP);
9685  #endif
9686  #ifdef STBIR_FLOAT_LOW_CLAMP
9687  const stbir__simdfX low_clamp = stbir__simdf_frepX(STBIR_FLOAT_LOW_CLAMP);
9688  #endif
9689
9690  if ( width_times_channels >= ( stbir__simdfX_float_count * 2 ) )
9691  {
9692    float const * end_encode_m8 = encode + width_times_channels - ( stbir__simdfX_float_count * 2 );
9693    end_output -= ( stbir__simdfX_float_count * 2 );
9694    STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
9695    for(;;)
9696    {
9697      stbir__simdfX e0, e1;
9698      STBIR_SIMD_NO_UNROLL(encode);
9699      stbir__simdfX_load( e0, encode );
9700      stbir__simdfX_load( e1, encode+stbir__simdfX_float_count );
9701#ifdef STBIR_FLOAT_HIGH_CLAMP
9702      stbir__simdfX_min( e0, e0, high_clamp );
9703      stbir__simdfX_min( e1, e1, high_clamp );
9704#endif
9705#ifdef STBIR_FLOAT_LOW_CLAMP
9706      stbir__simdfX_max( e0, e0, low_clamp );
9707      stbir__simdfX_max( e1, e1, low_clamp );
9708#endif
9709      stbir__encode_simdfX_unflip( e0 );
9710      stbir__encode_simdfX_unflip( e1 );
9711      stbir__simdfX_store( output, e0 );
9712      stbir__simdfX_store( output+stbir__simdfX_float_count, e1 );
9713      encode += stbir__simdfX_float_count * 2;
9714      output += stbir__simdfX_float_count * 2;
9715      if ( output < end_output )
9716        continue;
9717      if ( output == ( end_output + ( stbir__simdfX_float_count * 2 ) ) )
9718        break;
9719      output = end_output; // backup and do last couple
9720      encode = end_encode_m8;
9721    }
9722    return;
9723  }
9724
9725  // try to do blocks of 4 when you can
9726  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9727  output += 4;
9728  STBIR_NO_UNROLL_LOOP_START
9729  while( output <= end_output )
9730  {
9731    stbir__simdf e0;
9732    STBIR_NO_UNROLL(encode);
9733    stbir__simdf_load( e0, encode );
9734#ifdef STBIR_FLOAT_HIGH_CLAMP
9735    stbir__simdf_min( e0, e0, high_clamp );
9736#endif
9737#ifdef STBIR_FLOAT_LOW_CLAMP
9738    stbir__simdf_max( e0, e0, low_clamp );
9739#endif
9740    stbir__encode_simdf4_unflip( e0 );
9741    stbir__simdf_store( output-4, e0 );
9742    output += 4;
9743    encode += 4;
9744  }
9745  output -= 4;
9746  #endif
9747
9748  #else
9749
9750  // try to do blocks of 4 when you can
9751  #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9752  output += 4;
9753  STBIR_SIMD_NO_UNROLL_LOOP_START
9754  while( output <= end_output )
9755  {
9756    float e;
9757    STBIR_SIMD_NO_UNROLL(encode);
9758    e = encode[ stbir__encode_order0 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[0-4] = e;
9759    e = encode[ stbir__encode_order1 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[1-4] = e;
9760    e = encode[ stbir__encode_order2 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[2-4] = e;
9761    e = encode[ stbir__encode_order3 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[3-4] = e;
9762    output += 4;
9763    encode += 4;
9764  }
9765  output -= 4;
9766
9767  #endif
9768
9769  #endif
9770
9771  // do the remnants
9772  #if stbir__coder_min_num < 4
9773  STBIR_NO_UNROLL_LOOP_START
9774  while( output < end_output )
9775  {
9776    float e;
9777    STBIR_NO_UNROLL(encode);
9778    e = encode[ stbir__encode_order0 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[0] = e;
9779    #if stbir__coder_min_num >= 2
9780    e = encode[ stbir__encode_order1 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[1] = e;
9781    #endif
9782    #if stbir__coder_min_num >= 3
9783    e = encode[ stbir__encode_order2 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[2] = e;
9784    #endif
9785    output += stbir__coder_min_num;
9786    encode += stbir__coder_min_num;
9787  }
9788  #endif
9789
9790  #endif
9791}
9792
9793#undef stbir__decode_suffix
9794#undef stbir__decode_simdf8_flip
9795#undef stbir__decode_simdf4_flip
9796#undef stbir__decode_order0
9797#undef stbir__decode_order1
9798#undef stbir__decode_order2
9799#undef stbir__decode_order3
9800#undef stbir__encode_order0
9801#undef stbir__encode_order1
9802#undef stbir__encode_order2
9803#undef stbir__encode_order3
9804#undef stbir__encode_simdf8_unflip
9805#undef stbir__encode_simdf4_unflip
9806#undef stbir__encode_simdfX_unflip
9807#undef STBIR__CODER_NAME
9808#undef stbir__coder_min_num
9809#undef stbir__decode_swizzle
9810#undef stbir_scalar_hi_clamp
9811#undef stbir_scalar_lo_clamp
9812#undef STB_IMAGE_RESIZE_DO_CODERS
9813
9814#elif defined( STB_IMAGE_RESIZE_DO_VERTICALS)
9815
9816#ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
9817#define STBIR_chans( start, end ) STBIR_strs_join14(start,STBIR__vertical_channels,end,_cont)
9818#else
9819#define STBIR_chans( start, end ) STBIR_strs_join1(start,STBIR__vertical_channels,end)
9820#endif
9821
9822#if STBIR__vertical_channels >= 1
9823#define stbIF0( code ) code
9824#else
9825#define stbIF0( code )
9826#endif
9827#if STBIR__vertical_channels >= 2
9828#define stbIF1( code ) code
9829#else
9830#define stbIF1( code )
9831#endif
9832#if STBIR__vertical_channels >= 3
9833#define stbIF2( code ) code
9834#else
9835#define stbIF2( code )
9836#endif
9837#if STBIR__vertical_channels >= 4
9838#define stbIF3( code ) code
9839#else
9840#define stbIF3( code )
9841#endif
9842#if STBIR__vertical_channels >= 5
9843#define stbIF4( code ) code
9844#else
9845#define stbIF4( code )
9846#endif
9847#if STBIR__vertical_channels >= 6
9848#define stbIF5( code ) code
9849#else
9850#define stbIF5( code )
9851#endif
9852#if STBIR__vertical_channels >= 7
9853#define stbIF6( code ) code
9854#else
9855#define stbIF6( code )
9856#endif
9857#if STBIR__vertical_channels >= 8
9858#define stbIF7( code ) code
9859#else
9860#define stbIF7( code )
9861#endif
9862
9863static void STBIR_chans( stbir__vertical_scatter_with_,_coeffs)( float ** outputs, float const * vertical_coefficients, float const * input, float const * input_end )
9864{
9865  stbIF0( float STBIR_SIMD_STREAMOUT_PTR( * ) output0 = outputs[0]; float c0s = vertical_coefficients[0]; )
9866  stbIF1( float STBIR_SIMD_STREAMOUT_PTR( * ) output1 = outputs[1]; float c1s = vertical_coefficients[1]; )
9867  stbIF2( float STBIR_SIMD_STREAMOUT_PTR( * ) output2 = outputs[2]; float c2s = vertical_coefficients[2]; )
9868  stbIF3( float STBIR_SIMD_STREAMOUT_PTR( * ) output3 = outputs[3]; float c3s = vertical_coefficients[3]; )
9869  stbIF4( float STBIR_SIMD_STREAMOUT_PTR( * ) output4 = outputs[4]; float c4s = vertical_coefficients[4]; )
9870  stbIF5( float STBIR_SIMD_STREAMOUT_PTR( * ) output5 = outputs[5]; float c5s = vertical_coefficients[5]; )
9871  stbIF6( float STBIR_SIMD_STREAMOUT_PTR( * ) output6 = outputs[6]; float c6s = vertical_coefficients[6]; )
9872  stbIF7( float STBIR_SIMD_STREAMOUT_PTR( * ) output7 = outputs[7]; float c7s = vertical_coefficients[7]; )
9873
9874  #ifdef STBIR_SIMD
9875  {
9876    stbIF0(stbir__simdfX c0 = stbir__simdf_frepX( c0s ); )
9877    stbIF1(stbir__simdfX c1 = stbir__simdf_frepX( c1s ); )
9878    stbIF2(stbir__simdfX c2 = stbir__simdf_frepX( c2s ); )
9879    stbIF3(stbir__simdfX c3 = stbir__simdf_frepX( c3s ); )
9880    stbIF4(stbir__simdfX c4 = stbir__simdf_frepX( c4s ); )
9881    stbIF5(stbir__simdfX c5 = stbir__simdf_frepX( c5s ); )
9882    stbIF6(stbir__simdfX c6 = stbir__simdf_frepX( c6s ); )
9883    stbIF7(stbir__simdfX c7 = stbir__simdf_frepX( c7s ); )
9884    STBIR_SIMD_NO_UNROLL_LOOP_START
9885    while ( ( (char*)input_end - (char*) input ) >= (16*stbir__simdfX_float_count) )
9886    {
9887      stbir__simdfX o0, o1, o2, o3, r0, r1, r2, r3;
9888      STBIR_SIMD_NO_UNROLL(output0);
9889
9890      stbir__simdfX_load( r0, input );               stbir__simdfX_load( r1, input+stbir__simdfX_float_count );     stbir__simdfX_load( r2, input+(2*stbir__simdfX_float_count) );      stbir__simdfX_load( r3, input+(3*stbir__simdfX_float_count) );
9891
9892      #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
9893      stbIF0( stbir__simdfX_load( o0, output0 );     stbir__simdfX_load( o1, output0+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output0+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output0+(3*stbir__simdfX_float_count) );
9894              stbir__simdfX_madd( o0, o0, r0, c0 );  stbir__simdfX_madd( o1, o1, r1, c0 );  stbir__simdfX_madd( o2, o2, r2, c0 );   stbir__simdfX_madd( o3, o3, r3, c0 );
9895              stbir__simdfX_store( output0, o0 );    stbir__simdfX_store( output0+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output0+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output0+(3*stbir__simdfX_float_count), o3 ); )
9896      stbIF1( stbir__simdfX_load( o0, output1 );     stbir__simdfX_load( o1, output1+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output1+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output1+(3*stbir__simdfX_float_count) );
9897              stbir__simdfX_madd( o0, o0, r0, c1 );  stbir__simdfX_madd( o1, o1, r1, c1 );  stbir__simdfX_madd( o2, o2, r2, c1 );   stbir__simdfX_madd( o3, o3, r3, c1 );
9898              stbir__simdfX_store( output1, o0 );    stbir__simdfX_store( output1+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output1+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output1+(3*stbir__simdfX_float_count), o3 ); )
9899      stbIF2( stbir__simdfX_load( o0, output2 );     stbir__simdfX_load( o1, output2+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output2+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output2+(3*stbir__simdfX_float_count) );
9900              stbir__simdfX_madd( o0, o0, r0, c2 );  stbir__simdfX_madd( o1, o1, r1, c2 );  stbir__simdfX_madd( o2, o2, r2, c2 );   stbir__simdfX_madd( o3, o3, r3, c2 );
9901              stbir__simdfX_store( output2, o0 );    stbir__simdfX_store( output2+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output2+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output2+(3*stbir__simdfX_float_count), o3 ); )
9902      stbIF3( stbir__simdfX_load( o0, output3 );     stbir__simdfX_load( o1, output3+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output3+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output3+(3*stbir__simdfX_float_count) );
9903              stbir__simdfX_madd( o0, o0, r0, c3 );  stbir__simdfX_madd( o1, o1, r1, c3 );  stbir__simdfX_madd( o2, o2, r2, c3 );   stbir__simdfX_madd( o3, o3, r3, c3 );
9904              stbir__simdfX_store( output3, o0 );    stbir__simdfX_store( output3+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output3+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output3+(3*stbir__simdfX_float_count), o3 ); )
9905      stbIF4( stbir__simdfX_load( o0, output4 );     stbir__simdfX_load( o1, output4+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output4+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output4+(3*stbir__simdfX_float_count) );
9906              stbir__simdfX_madd( o0, o0, r0, c4 );  stbir__simdfX_madd( o1, o1, r1, c4 );  stbir__simdfX_madd( o2, o2, r2, c4 );   stbir__simdfX_madd( o3, o3, r3, c4 );
9907              stbir__simdfX_store( output4, o0 );    stbir__simdfX_store( output4+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output4+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output4+(3*stbir__simdfX_float_count), o3 ); )
9908      stbIF5( stbir__simdfX_load( o0, output5 );     stbir__simdfX_load( o1, output5+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output5+(2*stbir__simdfX_float_count));    stbir__simdfX_load( o3, output5+(3*stbir__simdfX_float_count) );
9909              stbir__simdfX_madd( o0, o0, r0, c5 );  stbir__simdfX_madd( o1, o1, r1, c5 );  stbir__simdfX_madd( o2, o2, r2, c5 );   stbir__simdfX_madd( o3, o3, r3, c5 );
9910              stbir__simdfX_store( output5, o0 );    stbir__simdfX_store( output5+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output5+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output5+(3*stbir__simdfX_float_count), o3 ); )
9911      stbIF6( stbir__simdfX_load( o0, output6 );     stbir__simdfX_load( o1, output6+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output6+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output6+(3*stbir__simdfX_float_count) );
9912              stbir__simdfX_madd( o0, o0, r0, c6 );  stbir__simdfX_madd( o1, o1, r1, c6 );  stbir__simdfX_madd( o2, o2, r2, c6 );   stbir__simdfX_madd( o3, o3, r3, c6 );
9913              stbir__simdfX_store( output6, o0 );    stbir__simdfX_store( output6+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output6+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output6+(3*stbir__simdfX_float_count), o3 ); )
9914      stbIF7( stbir__simdfX_load( o0, output7 );     stbir__simdfX_load( o1, output7+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output7+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output7+(3*stbir__simdfX_float_count) );
9915              stbir__simdfX_madd( o0, o0, r0, c7 );  stbir__simdfX_madd( o1, o1, r1, c7 );  stbir__simdfX_madd( o2, o2, r2, c7 );   stbir__simdfX_madd( o3, o3, r3, c7 );
9916              stbir__simdfX_store( output7, o0 );    stbir__simdfX_store( output7+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output7+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output7+(3*stbir__simdfX_float_count), o3 ); )
9917      #else
9918      stbIF0( stbir__simdfX_mult( o0, r0, c0 );      stbir__simdfX_mult( o1, r1, c0 );      stbir__simdfX_mult( o2, r2, c0 );       stbir__simdfX_mult( o3, r3, c0 );
9919              stbir__simdfX_store( output0, o0 );    stbir__simdfX_store( output0+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output0+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output0+(3*stbir__simdfX_float_count), o3 ); )
9920      stbIF1( stbir__simdfX_mult( o0, r0, c1 );      stbir__simdfX_mult( o1, r1, c1 );      stbir__simdfX_mult( o2, r2, c1 );       stbir__simdfX_mult( o3, r3, c1 );
9921              stbir__simdfX_store( output1, o0 );    stbir__simdfX_store( output1+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output1+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output1+(3*stbir__simdfX_float_count), o3 ); )
9922      stbIF2( stbir__simdfX_mult( o0, r0, c2 );      stbir__simdfX_mult( o1, r1, c2 );      stbir__simdfX_mult( o2, r2, c2 );       stbir__simdfX_mult( o3, r3, c2 );
9923              stbir__simdfX_store( output2, o0 );    stbir__simdfX_store( output2+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output2+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output2+(3*stbir__simdfX_float_count), o3 ); )
9924      stbIF3( stbir__simdfX_mult( o0, r0, c3 );      stbir__simdfX_mult( o1, r1, c3 );      stbir__simdfX_mult( o2, r2, c3 );       stbir__simdfX_mult( o3, r3, c3 );
9925              stbir__simdfX_store( output3, o0 );    stbir__simdfX_store( output3+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output3+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output3+(3*stbir__simdfX_float_count), o3 ); )
9926      stbIF4( stbir__simdfX_mult( o0, r0, c4 );      stbir__simdfX_mult( o1, r1, c4 );      stbir__simdfX_mult( o2, r2, c4 );       stbir__simdfX_mult( o3, r3, c4 );
9927              stbir__simdfX_store( output4, o0 );    stbir__simdfX_store( output4+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output4+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output4+(3*stbir__simdfX_float_count), o3 ); )
9928      stbIF5( stbir__simdfX_mult( o0, r0, c5 );      stbir__simdfX_mult( o1, r1, c5 );      stbir__simdfX_mult( o2, r2, c5 );       stbir__simdfX_mult( o3, r3, c5 );
9929              stbir__simdfX_store( output5, o0 );    stbir__simdfX_store( output5+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output5+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output5+(3*stbir__simdfX_float_count), o3 ); )
9930      stbIF6( stbir__simdfX_mult( o0, r0, c6 );      stbir__simdfX_mult( o1, r1, c6 );      stbir__simdfX_mult( o2, r2, c6 );       stbir__simdfX_mult( o3, r3, c6 );
9931              stbir__simdfX_store( output6, o0 );    stbir__simdfX_store( output6+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output6+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output6+(3*stbir__simdfX_float_count), o3 ); )
9932      stbIF7( stbir__simdfX_mult( o0, r0, c7 );      stbir__simdfX_mult( o1, r1, c7 );      stbir__simdfX_mult( o2, r2, c7 );       stbir__simdfX_mult( o3, r3, c7 );
9933              stbir__simdfX_store( output7, o0 );    stbir__simdfX_store( output7+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output7+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output7+(3*stbir__simdfX_float_count), o3 ); )
9934      #endif
9935
9936      input += (4*stbir__simdfX_float_count);
9937      stbIF0( output0 += (4*stbir__simdfX_float_count); ) stbIF1( output1 += (4*stbir__simdfX_float_count); ) stbIF2( output2 += (4*stbir__simdfX_float_count); ) stbIF3( output3 += (4*stbir__simdfX_float_count); ) stbIF4( output4 += (4*stbir__simdfX_float_count); ) stbIF5( output5 += (4*stbir__simdfX_float_count); ) stbIF6( output6 += (4*stbir__simdfX_float_count); ) stbIF7( output7 += (4*stbir__simdfX_float_count); )
9938    }
9939    STBIR_SIMD_NO_UNROLL_LOOP_START
9940    while ( ( (char*)input_end - (char*) input ) >= 16 )
9941    {
9942      stbir__simdf o0, r0;
9943      STBIR_SIMD_NO_UNROLL(output0);
9944
9945      stbir__simdf_load( r0, input );
9946
9947      #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
9948      stbIF0( stbir__simdf_load( o0, output0 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) );  stbir__simdf_store( output0, o0 ); )
9949      stbIF1( stbir__simdf_load( o0, output1 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c1 ) );  stbir__simdf_store( output1, o0 ); )
9950      stbIF2( stbir__simdf_load( o0, output2 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c2 ) );  stbir__simdf_store( output2, o0 ); )
9951      stbIF3( stbir__simdf_load( o0, output3 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c3 ) );  stbir__simdf_store( output3, o0 ); )
9952      stbIF4( stbir__simdf_load( o0, output4 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c4 ) );  stbir__simdf_store( output4, o0 ); )
9953      stbIF5( stbir__simdf_load( o0, output5 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c5 ) );  stbir__simdf_store( output5, o0 ); )
9954      stbIF6( stbir__simdf_load( o0, output6 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c6 ) );  stbir__simdf_store( output6, o0 ); )
9955      stbIF7( stbir__simdf_load( o0, output7 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c7 ) );  stbir__simdf_store( output7, o0 ); )
9956      #else
9957      stbIF0( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) );   stbir__simdf_store( output0, o0 ); )
9958      stbIF1( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c1 ) );   stbir__simdf_store( output1, o0 ); )
9959      stbIF2( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c2 ) );   stbir__simdf_store( output2, o0 ); )
9960      stbIF3( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c3 ) );   stbir__simdf_store( output3, o0 ); )
9961      stbIF4( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c4 ) );   stbir__simdf_store( output4, o0 ); )
9962      stbIF5( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c5 ) );   stbir__simdf_store( output5, o0 ); )
9963      stbIF6( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c6 ) );   stbir__simdf_store( output6, o0 ); )
9964      stbIF7( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c7 ) );   stbir__simdf_store( output7, o0 ); )
9965      #endif
9966
9967      input += 4;
9968      stbIF0( output0 += 4; ) stbIF1( output1 += 4; ) stbIF2( output2 += 4; ) stbIF3( output3 += 4; ) stbIF4( output4 += 4; ) stbIF5( output5 += 4; ) stbIF6( output6 += 4; ) stbIF7( output7 += 4; )
9969    }
9970  }
9971  #else
9972  STBIR_NO_UNROLL_LOOP_START
9973  while ( ( (char*)input_end - (char*) input ) >= 16 )
9974  {
9975    float r0, r1, r2, r3;
9976    STBIR_NO_UNROLL(input);
9977
9978    r0 = input[0], r1 = input[1], r2 = input[2], r3 = input[3];
9979
9980    #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
9981    stbIF0( output0[0] += ( r0 * c0s ); output0[1] += ( r1 * c0s ); output0[2] += ( r2 * c0s ); output0[3] += ( r3 * c0s ); )
9982    stbIF1( output1[0] += ( r0 * c1s ); output1[1] += ( r1 * c1s ); output1[2] += ( r2 * c1s ); output1[3] += ( r3 * c1s ); )
9983    stbIF2( output2[0] += ( r0 * c2s ); output2[1] += ( r1 * c2s ); output2[2] += ( r2 * c2s ); output2[3] += ( r3 * c2s ); )
9984    stbIF3( output3[0] += ( r0 * c3s ); output3[1] += ( r1 * c3s ); output3[2] += ( r2 * c3s ); output3[3] += ( r3 * c3s ); )
9985    stbIF4( output4[0] += ( r0 * c4s ); output4[1] += ( r1 * c4s ); output4[2] += ( r2 * c4s ); output4[3] += ( r3 * c4s ); )
9986    stbIF5( output5[0] += ( r0 * c5s ); output5[1] += ( r1 * c5s ); output5[2] += ( r2 * c5s ); output5[3] += ( r3 * c5s ); )
9987    stbIF6( output6[0] += ( r0 * c6s ); output6[1] += ( r1 * c6s ); output6[2] += ( r2 * c6s ); output6[3] += ( r3 * c6s ); )
9988    stbIF7( output7[0] += ( r0 * c7s ); output7[1] += ( r1 * c7s ); output7[2] += ( r2 * c7s ); output7[3] += ( r3 * c7s ); )
9989    #else
9990    stbIF0( output0[0]  = ( r0 * c0s ); output0[1]  = ( r1 * c0s ); output0[2]  = ( r2 * c0s ); output0[3]  = ( r3 * c0s ); )
9991    stbIF1( output1[0]  = ( r0 * c1s ); output1[1]  = ( r1 * c1s ); output1[2]  = ( r2 * c1s ); output1[3]  = ( r3 * c1s ); )
9992    stbIF2( output2[0]  = ( r0 * c2s ); output2[1]  = ( r1 * c2s ); output2[2]  = ( r2 * c2s ); output2[3]  = ( r3 * c2s ); )
9993    stbIF3( output3[0]  = ( r0 * c3s ); output3[1]  = ( r1 * c3s ); output3[2]  = ( r2 * c3s ); output3[3]  = ( r3 * c3s ); )
9994    stbIF4( output4[0]  = ( r0 * c4s ); output4[1]  = ( r1 * c4s ); output4[2]  = ( r2 * c4s ); output4[3]  = ( r3 * c4s ); )
9995    stbIF5( output5[0]  = ( r0 * c5s ); output5[1]  = ( r1 * c5s ); output5[2]  = ( r2 * c5s ); output5[3]  = ( r3 * c5s ); )
9996    stbIF6( output6[0]  = ( r0 * c6s ); output6[1]  = ( r1 * c6s ); output6[2]  = ( r2 * c6s ); output6[3]  = ( r3 * c6s ); )
9997    stbIF7( output7[0]  = ( r0 * c7s ); output7[1]  = ( r1 * c7s ); output7[2]  = ( r2 * c7s ); output7[3]  = ( r3 * c7s ); )
9998    #endif
9999
10000    input += 4;
10001    stbIF0( output0 += 4; ) stbIF1( output1 += 4; ) stbIF2( output2 += 4; ) stbIF3( output3 += 4; ) stbIF4( output4 += 4; ) stbIF5( output5 += 4; ) stbIF6( output6 += 4; ) stbIF7( output7 += 4; )
10002  }
10003  #endif
10004  STBIR_NO_UNROLL_LOOP_START
10005  while ( input < input_end )
10006  {
10007    float r = input[0];
10008    STBIR_NO_UNROLL(output0);
10009
10010    #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
10011    stbIF0( output0[0] += ( r * c0s ); )
10012    stbIF1( output1[0] += ( r * c1s ); )
10013    stbIF2( output2[0] += ( r * c2s ); )
10014    stbIF3( output3[0] += ( r * c3s ); )
10015    stbIF4( output4[0] += ( r * c4s ); )
10016    stbIF5( output5[0] += ( r * c5s ); )
10017    stbIF6( output6[0] += ( r * c6s ); )
10018    stbIF7( output7[0] += ( r * c7s ); )
10019    #else
10020    stbIF0( output0[0]  = ( r * c0s ); )
10021    stbIF1( output1[0]  = ( r * c1s ); )
10022    stbIF2( output2[0]  = ( r * c2s ); )
10023    stbIF3( output3[0]  = ( r * c3s ); )
10024    stbIF4( output4[0]  = ( r * c4s ); )
10025    stbIF5( output5[0]  = ( r * c5s ); )
10026    stbIF6( output6[0]  = ( r * c6s ); )
10027    stbIF7( output7[0]  = ( r * c7s ); )
10028    #endif
10029
10030    ++input;
10031    stbIF0( ++output0; ) stbIF1( ++output1; ) stbIF2( ++output2; ) stbIF3( ++output3; ) stbIF4( ++output4; ) stbIF5( ++output5; ) stbIF6( ++output6; ) stbIF7( ++output7; )
10032  }
10033}
10034
10035static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp, float const * vertical_coefficients, float const ** inputs, float const * input0_end )
10036{
10037  float STBIR_SIMD_STREAMOUT_PTR( * ) output = outputp;
10038
10039  stbIF0( float const * input0 = inputs[0]; float c0s = vertical_coefficients[0]; )
10040  stbIF1( float const * input1 = inputs[1]; float c1s = vertical_coefficients[1]; )
10041  stbIF2( float const * input2 = inputs[2]; float c2s = vertical_coefficients[2]; )
10042  stbIF3( float const * input3 = inputs[3]; float c3s = vertical_coefficients[3]; )
10043  stbIF4( float const * input4 = inputs[4]; float c4s = vertical_coefficients[4]; )
10044  stbIF5( float const * input5 = inputs[5]; float c5s = vertical_coefficients[5]; )
10045  stbIF6( float const * input6 = inputs[6]; float c6s = vertical_coefficients[6]; )
10046  stbIF7( float const * input7 = inputs[7]; float c7s = vertical_coefficients[7]; )
10047
10048#if ( STBIR__vertical_channels == 1 ) && !defined(STB_IMAGE_RESIZE_VERTICAL_CONTINUE)
10049  // check single channel one weight
10050  if ( ( c0s >= (1.0f-0.000001f) ) && ( c0s <= (1.0f+0.000001f) ) )
10051  {
10052    STBIR_MEMCPY( output, input0, (char*)input0_end - (char*)input0 );
10053    return;
10054  }
10055#endif
10056
10057  #ifdef STBIR_SIMD
10058  {
10059    stbIF0(stbir__simdfX c0 = stbir__simdf_frepX( c0s ); )
10060    stbIF1(stbir__simdfX c1 = stbir__simdf_frepX( c1s ); )
10061    stbIF2(stbir__simdfX c2 = stbir__simdf_frepX( c2s ); )
10062    stbIF3(stbir__simdfX c3 = stbir__simdf_frepX( c3s ); )
10063    stbIF4(stbir__simdfX c4 = stbir__simdf_frepX( c4s ); )
10064    stbIF5(stbir__simdfX c5 = stbir__simdf_frepX( c5s ); )
10065    stbIF6(stbir__simdfX c6 = stbir__simdf_frepX( c6s ); )
10066    stbIF7(stbir__simdfX c7 = stbir__simdf_frepX( c7s ); )
10067
10068    STBIR_SIMD_NO_UNROLL_LOOP_START
10069    while ( ( (char*)input0_end - (char*) input0 ) >= (16*stbir__simdfX_float_count) )
10070    {
10071      stbir__simdfX o0, o1, o2, o3, r0, r1, r2, r3;
10072      STBIR_SIMD_NO_UNROLL(output);
10073
10074      // prefetch four loop iterations ahead (doesn't affect much for small resizes, but helps with big ones)
10075      stbIF0( stbir__prefetch( input0 + (16*stbir__simdfX_float_count) ); )
10076      stbIF1( stbir__prefetch( input1 + (16*stbir__simdfX_float_count) ); )
10077      stbIF2( stbir__prefetch( input2 + (16*stbir__simdfX_float_count) ); )
10078      stbIF3( stbir__prefetch( input3 + (16*stbir__simdfX_float_count) ); )
10079      stbIF4( stbir__prefetch( input4 + (16*stbir__simdfX_float_count) ); )
10080      stbIF5( stbir__prefetch( input5 + (16*stbir__simdfX_float_count) ); )
10081      stbIF6( stbir__prefetch( input6 + (16*stbir__simdfX_float_count) ); )
10082      stbIF7( stbir__prefetch( input7 + (16*stbir__simdfX_float_count) ); )
10083
10084      #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
10085      stbIF0( stbir__simdfX_load( o0, output );      stbir__simdfX_load( o1, output+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( o3, output+(3*stbir__simdfX_float_count) );
10086              stbir__simdfX_load( r0, input0 );      stbir__simdfX_load( r1, input0+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input0+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input0+(3*stbir__simdfX_float_count) );
10087              stbir__simdfX_madd( o0, o0, r0, c0 );  stbir__simdfX_madd( o1, o1, r1, c0 );                         stbir__simdfX_madd( o2, o2, r2, c0 );                             stbir__simdfX_madd( o3, o3, r3, c0 ); )
10088      #else
10089      stbIF0( stbir__simdfX_load( r0, input0 );      stbir__simdfX_load( r1, input0+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input0+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input0+(3*stbir__simdfX_float_count) );
10090              stbir__simdfX_mult( o0, r0, c0 );      stbir__simdfX_mult( o1, r1, c0 );                             stbir__simdfX_mult( o2, r2, c0 );                                 stbir__simdfX_mult( o3, r3, c0 );  )
10091      #endif
10092
10093      stbIF1( stbir__simdfX_load( r0, input1 );      stbir__simdfX_load( r1, input1+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input1+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input1+(3*stbir__simdfX_float_count) );
10094              stbir__simdfX_madd( o0, o0, r0, c1 );  stbir__simdfX_madd( o1, o1, r1, c1 );                         stbir__simdfX_madd( o2, o2, r2, c1 );                             stbir__simdfX_madd( o3, o3, r3, c1 ); )
10095      stbIF2( stbir__simdfX_load( r0, input2 );      stbir__simdfX_load( r1, input2+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input2+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input2+(3*stbir__simdfX_float_count) );
10096              stbir__simdfX_madd( o0, o0, r0, c2 );  stbir__simdfX_madd( o1, o1, r1, c2 );                         stbir__simdfX_madd( o2, o2, r2, c2 );                             stbir__simdfX_madd( o3, o3, r3, c2 ); )
10097      stbIF3( stbir__simdfX_load( r0, input3 );      stbir__simdfX_load( r1, input3+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input3+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input3+(3*stbir__simdfX_float_count) );
10098              stbir__simdfX_madd( o0, o0, r0, c3 );  stbir__simdfX_madd( o1, o1, r1, c3 );                         stbir__simdfX_madd( o2, o2, r2, c3 );                             stbir__simdfX_madd( o3, o3, r3, c3 ); )
10099      stbIF4( stbir__simdfX_load( r0, input4 );      stbir__simdfX_load( r1, input4+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input4+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input4+(3*stbir__simdfX_float_count) );
10100              stbir__simdfX_madd( o0, o0, r0, c4 );  stbir__simdfX_madd( o1, o1, r1, c4 );                         stbir__simdfX_madd( o2, o2, r2, c4 );                             stbir__simdfX_madd( o3, o3, r3, c4 ); )
10101      stbIF5( stbir__simdfX_load( r0, input5 );      stbir__simdfX_load( r1, input5+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input5+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input5+(3*stbir__simdfX_float_count) );
10102              stbir__simdfX_madd( o0, o0, r0, c5 );  stbir__simdfX_madd( o1, o1, r1, c5 );                         stbir__simdfX_madd( o2, o2, r2, c5 );                             stbir__simdfX_madd( o3, o3, r3, c5 ); )
10103      stbIF6( stbir__simdfX_load( r0, input6 );      stbir__simdfX_load( r1, input6+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input6+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input6+(3*stbir__simdfX_float_count) );
10104              stbir__simdfX_madd( o0, o0, r0, c6 );  stbir__simdfX_madd( o1, o1, r1, c6 );                         stbir__simdfX_madd( o2, o2, r2, c6 );                             stbir__simdfX_madd( o3, o3, r3, c6 ); )
10105      stbIF7( stbir__simdfX_load( r0, input7 );      stbir__simdfX_load( r1, input7+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input7+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input7+(3*stbir__simdfX_float_count) );
10106              stbir__simdfX_madd( o0, o0, r0, c7 );  stbir__simdfX_madd( o1, o1, r1, c7 );                         stbir__simdfX_madd( o2, o2, r2, c7 );                             stbir__simdfX_madd( o3, o3, r3, c7 ); )
10107
10108      stbir__simdfX_store( output, o0 );             stbir__simdfX_store( output+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output+(2*stbir__simdfX_float_count), o2 );  stbir__simdfX_store( output+(3*stbir__simdfX_float_count), o3 );
10109      output += (4*stbir__simdfX_float_count);
10110      stbIF0( input0 += (4*stbir__simdfX_float_count); ) stbIF1( input1 += (4*stbir__simdfX_float_count); ) stbIF2( input2 += (4*stbir__simdfX_float_count); ) stbIF3( input3 += (4*stbir__simdfX_float_count); ) stbIF4( input4 += (4*stbir__simdfX_float_count); ) stbIF5( input5 += (4*stbir__simdfX_float_count); ) stbIF6( input6 += (4*stbir__simdfX_float_count); ) stbIF7( input7 += (4*stbir__simdfX_float_count); )
10111    }
10112
10113    STBIR_SIMD_NO_UNROLL_LOOP_START
10114    while ( ( (char*)input0_end - (char*) input0 ) >= 16 )
10115    {
10116      stbir__simdf o0, r0;
10117      STBIR_SIMD_NO_UNROLL(output);
10118
10119      #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
10120      stbIF0( stbir__simdf_load( o0, output );   stbir__simdf_load( r0, input0 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) ); )
10121      #else
10122      stbIF0( stbir__simdf_load( r0, input0 );  stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) ); )
10123      #endif
10124      stbIF1( stbir__simdf_load( r0, input1 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c1 ) ); )
10125      stbIF2( stbir__simdf_load( r0, input2 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c2 ) ); )
10126      stbIF3( stbir__simdf_load( r0, input3 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c3 ) ); )
10127      stbIF4( stbir__simdf_load( r0, input4 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c4 ) ); )
10128      stbIF5( stbir__simdf_load( r0, input5 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c5 ) ); )
10129      stbIF6( stbir__simdf_load( r0, input6 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c6 ) ); )
10130      stbIF7( stbir__simdf_load( r0, input7 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c7 ) ); )
10131
10132      stbir__simdf_store( output, o0 );
10133      output += 4;
10134      stbIF0( input0 += 4; ) stbIF1( input1 += 4; ) stbIF2( input2 += 4; ) stbIF3( input3 += 4; ) stbIF4( input4 += 4; ) stbIF5( input5 += 4; ) stbIF6( input6 += 4; ) stbIF7( input7 += 4; )
10135    }
10136  }
10137  #else
10138  STBIR_NO_UNROLL_LOOP_START
10139  while ( ( (char*)input0_end - (char*) input0 ) >= 16 )
10140  {
10141    float o0, o1, o2, o3;
10142    STBIR_NO_UNROLL(output);
10143    #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
10144    stbIF0( o0 = output[0] + input0[0] * c0s; o1 = output[1] + input0[1] * c0s; o2 = output[2] + input0[2] * c0s; o3 = output[3] + input0[3] * c0s; )
10145    #else
10146    stbIF0( o0  = input0[0] * c0s; o1  = input0[1] * c0s; o2  = input0[2] * c0s; o3  = input0[3] * c0s; )
10147    #endif
10148    stbIF1( o0 += input1[0] * c1s; o1 += input1[1] * c1s; o2 += input1[2] * c1s; o3 += input1[3] * c1s; )
10149    stbIF2( o0 += input2[0] * c2s; o1 += input2[1] * c2s; o2 += input2[2] * c2s; o3 += input2[3] * c2s; )
10150    stbIF3( o0 += input3[0] * c3s; o1 += input3[1] * c3s; o2 += input3[2] * c3s; o3 += input3[3] * c3s; )
10151    stbIF4( o0 += input4[0] * c4s; o1 += input4[1] * c4s; o2 += input4[2] * c4s; o3 += input4[3] * c4s; )
10152    stbIF5( o0 += input5[0] * c5s; o1 += input5[1] * c5s; o2 += input5[2] * c5s; o3 += input5[3] * c5s; )
10153    stbIF6( o0 += input6[0] * c6s; o1 += input6[1] * c6s; o2 += input6[2] * c6s; o3 += input6[3] * c6s; )
10154    stbIF7( o0 += input7[0] * c7s; o1 += input7[1] * c7s; o2 += input7[2] * c7s; o3 += input7[3] * c7s; )
10155    output[0] = o0; output[1] = o1; output[2] = o2; output[3] = o3;
10156    output += 4;
10157    stbIF0( input0 += 4; ) stbIF1( input1 += 4; ) stbIF2( input2 += 4; ) stbIF3( input3 += 4; ) stbIF4( input4 += 4; ) stbIF5( input5 += 4; ) stbIF6( input6 += 4; ) stbIF7( input7 += 4; )
10158  }
10159  #endif
10160  STBIR_NO_UNROLL_LOOP_START
10161  while ( input0 < input0_end )
10162  {
10163    float o0;
10164    STBIR_NO_UNROLL(output);
10165    #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
10166    stbIF0( o0 = output[0] + input0[0] * c0s; )
10167    #else
10168    stbIF0( o0  = input0[0] * c0s; )
10169    #endif
10170    stbIF1( o0 += input1[0] * c1s; )
10171    stbIF2( o0 += input2[0] * c2s; )
10172    stbIF3( o0 += input3[0] * c3s; )
10173    stbIF4( o0 += input4[0] * c4s; )
10174    stbIF5( o0 += input5[0] * c5s; )
10175    stbIF6( o0 += input6[0] * c6s; )
10176    stbIF7( o0 += input7[0] * c7s; )
10177    output[0] = o0;
10178    ++output;
10179    stbIF0( ++input0; ) stbIF1( ++input1; ) stbIF2( ++input2; ) stbIF3( ++input3; ) stbIF4( ++input4; ) stbIF5( ++input5; ) stbIF6( ++input6; ) stbIF7( ++input7; )
10180  }
10181}
10182
10183#undef stbIF0
10184#undef stbIF1
10185#undef stbIF2
10186#undef stbIF3
10187#undef stbIF4
10188#undef stbIF5
10189#undef stbIF6
10190#undef stbIF7
10191#undef STB_IMAGE_RESIZE_DO_VERTICALS
10192#undef STBIR__vertical_channels
10193#undef STB_IMAGE_RESIZE_DO_HORIZONTALS
10194#undef STBIR_strs_join24
10195#undef STBIR_strs_join14
10196#undef STBIR_chans
10197#ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
10198#undef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
10199#endif
10200
10201#else // !STB_IMAGE_RESIZE_DO_VERTICALS
10202
10203#define STBIR_chans( start, end ) STBIR_strs_join1(start,STBIR__horizontal_channels,end)
10204
10205#ifndef stbir__2_coeff_only
10206#define stbir__2_coeff_only()             \
10207    stbir__1_coeff_only();                \
10208    stbir__1_coeff_remnant(1);
10209#endif
10210
10211#ifndef stbir__2_coeff_remnant
10212#define stbir__2_coeff_remnant( ofs )     \
10213    stbir__1_coeff_remnant(ofs);          \
10214    stbir__1_coeff_remnant((ofs)+1);
10215#endif
10216
10217#ifndef stbir__3_coeff_only
10218#define stbir__3_coeff_only()             \
10219    stbir__2_coeff_only();                \
10220    stbir__1_coeff_remnant(2);
10221#endif
10222
10223#ifndef stbir__3_coeff_remnant
10224#define stbir__3_coeff_remnant( ofs )     \
10225    stbir__2_coeff_remnant(ofs);          \
10226    stbir__1_coeff_remnant((ofs)+2);
10227#endif
10228
10229#ifndef stbir__3_coeff_setup
10230#define stbir__3_coeff_setup()
10231#endif
10232
10233#ifndef stbir__4_coeff_start
10234#define stbir__4_coeff_start()            \
10235    stbir__2_coeff_only();                \
10236    stbir__2_coeff_remnant(2);
10237#endif
10238
10239#ifndef stbir__4_coeff_continue_from_4
10240#define stbir__4_coeff_continue_from_4( ofs )     \
10241    stbir__2_coeff_remnant(ofs);                  \
10242    stbir__2_coeff_remnant((ofs)+2);
10243#endif
10244
10245#ifndef stbir__store_output_tiny
10246#define stbir__store_output_tiny stbir__store_output
10247#endif
10248
10249static void STBIR_chans( stbir__horizontal_gather_,_channels_with_1_coeff)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10250{
10251  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10252  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10253  STBIR_SIMD_NO_UNROLL_LOOP_START
10254  do {
10255    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10256    float const * hc = horizontal_coefficients;
10257    stbir__1_coeff_only();
10258    stbir__store_output_tiny();
10259  } while ( output < output_end );
10260}
10261
10262static void STBIR_chans( stbir__horizontal_gather_,_channels_with_2_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10263{
10264  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10265  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10266  STBIR_SIMD_NO_UNROLL_LOOP_START
10267  do {
10268    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10269    float const * hc = horizontal_coefficients;
10270    stbir__2_coeff_only();
10271    stbir__store_output_tiny();
10272  } while ( output < output_end );
10273}
10274
10275static void STBIR_chans( stbir__horizontal_gather_,_channels_with_3_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10276{
10277  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10278  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10279  STBIR_SIMD_NO_UNROLL_LOOP_START
10280  do {
10281    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10282    float const * hc = horizontal_coefficients;
10283    stbir__3_coeff_only();
10284    stbir__store_output_tiny();
10285  } while ( output < output_end );
10286}
10287
10288static void STBIR_chans( stbir__horizontal_gather_,_channels_with_4_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10289{
10290  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10291  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10292  STBIR_SIMD_NO_UNROLL_LOOP_START
10293  do {
10294    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10295    float const * hc = horizontal_coefficients;
10296    stbir__4_coeff_start();
10297    stbir__store_output();
10298  } while ( output < output_end );
10299}
10300
10301static void STBIR_chans( stbir__horizontal_gather_,_channels_with_5_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10302{
10303  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10304  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10305  STBIR_SIMD_NO_UNROLL_LOOP_START
10306  do {
10307    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10308    float const * hc = horizontal_coefficients;
10309    stbir__4_coeff_start();
10310    stbir__1_coeff_remnant(4);
10311    stbir__store_output();
10312  } while ( output < output_end );
10313}
10314
10315static void STBIR_chans( stbir__horizontal_gather_,_channels_with_6_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10316{
10317  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10318  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10319  STBIR_SIMD_NO_UNROLL_LOOP_START
10320  do {
10321    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10322    float const * hc = horizontal_coefficients;
10323    stbir__4_coeff_start();
10324    stbir__2_coeff_remnant(4);
10325    stbir__store_output();
10326  } while ( output < output_end );
10327}
10328
10329static void STBIR_chans( stbir__horizontal_gather_,_channels_with_7_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10330{
10331  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10332  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10333  stbir__3_coeff_setup();
10334  STBIR_SIMD_NO_UNROLL_LOOP_START
10335  do {
10336    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10337    float const * hc = horizontal_coefficients;
10338
10339    stbir__4_coeff_start();
10340    stbir__3_coeff_remnant(4);
10341    stbir__store_output();
10342  } while ( output < output_end );
10343}
10344
10345static void STBIR_chans( stbir__horizontal_gather_,_channels_with_8_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10346{
10347  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10348  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10349  STBIR_SIMD_NO_UNROLL_LOOP_START
10350  do {
10351    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10352    float const * hc = horizontal_coefficients;
10353    stbir__4_coeff_start();
10354    stbir__4_coeff_continue_from_4(4);
10355    stbir__store_output();
10356  } while ( output < output_end );
10357}
10358
10359static void STBIR_chans( stbir__horizontal_gather_,_channels_with_9_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10360{
10361  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10362  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10363  STBIR_SIMD_NO_UNROLL_LOOP_START
10364  do {
10365    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10366    float const * hc = horizontal_coefficients;
10367    stbir__4_coeff_start();
10368    stbir__4_coeff_continue_from_4(4);
10369    stbir__1_coeff_remnant(8);
10370    stbir__store_output();
10371  } while ( output < output_end );
10372}
10373
10374static void STBIR_chans( stbir__horizontal_gather_,_channels_with_10_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10375{
10376  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10377  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10378  STBIR_SIMD_NO_UNROLL_LOOP_START
10379  do {
10380    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10381    float const * hc = horizontal_coefficients;
10382    stbir__4_coeff_start();
10383    stbir__4_coeff_continue_from_4(4);
10384    stbir__2_coeff_remnant(8);
10385    stbir__store_output();
10386  } while ( output < output_end );
10387}
10388
10389static void STBIR_chans( stbir__horizontal_gather_,_channels_with_11_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10390{
10391  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10392  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10393  stbir__3_coeff_setup();
10394  STBIR_SIMD_NO_UNROLL_LOOP_START
10395  do {
10396    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10397    float const * hc = horizontal_coefficients;
10398    stbir__4_coeff_start();
10399    stbir__4_coeff_continue_from_4(4);
10400    stbir__3_coeff_remnant(8);
10401    stbir__store_output();
10402  } while ( output < output_end );
10403}
10404
10405static void STBIR_chans( stbir__horizontal_gather_,_channels_with_12_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10406{
10407  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10408  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10409  STBIR_SIMD_NO_UNROLL_LOOP_START
10410  do {
10411    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10412    float const * hc = horizontal_coefficients;
10413    stbir__4_coeff_start();
10414    stbir__4_coeff_continue_from_4(4);
10415    stbir__4_coeff_continue_from_4(8);
10416    stbir__store_output();
10417  } while ( output < output_end );
10418}
10419
10420static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod0 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10421{
10422  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10423  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10424  STBIR_SIMD_NO_UNROLL_LOOP_START
10425  do {
10426    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10427    int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 4 + 3 ) >> 2;
10428    float const * hc = horizontal_coefficients;
10429
10430    stbir__4_coeff_start();
10431    STBIR_SIMD_NO_UNROLL_LOOP_START
10432    do {
10433      hc += 4;
10434      decode += STBIR__horizontal_channels * 4;
10435      stbir__4_coeff_continue_from_4( 0 );
10436      --n;
10437    } while ( n > 0 );
10438    stbir__store_output();
10439  } while ( output < output_end );
10440}
10441
10442static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod1 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10443{
10444  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10445  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10446  STBIR_SIMD_NO_UNROLL_LOOP_START
10447  do {
10448    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10449    int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 5 + 3 ) >> 2;
10450    float const * hc = horizontal_coefficients;
10451
10452    stbir__4_coeff_start();
10453    STBIR_SIMD_NO_UNROLL_LOOP_START
10454    do {
10455      hc += 4;
10456      decode += STBIR__horizontal_channels * 4;
10457      stbir__4_coeff_continue_from_4( 0 );
10458      --n;
10459    } while ( n > 0 );
10460    stbir__1_coeff_remnant( 4 );
10461    stbir__store_output();
10462  } while ( output < output_end );
10463}
10464
10465static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod2 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10466{
10467  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10468  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10469  STBIR_SIMD_NO_UNROLL_LOOP_START
10470  do {
10471    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10472    int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 6 + 3 ) >> 2;
10473    float const * hc = horizontal_coefficients;
10474
10475    stbir__4_coeff_start();
10476    STBIR_SIMD_NO_UNROLL_LOOP_START
10477    do {
10478      hc += 4;
10479      decode += STBIR__horizontal_channels * 4;
10480      stbir__4_coeff_continue_from_4( 0 );
10481      --n;
10482    } while ( n > 0 );
10483    stbir__2_coeff_remnant( 4 );
10484
10485    stbir__store_output();
10486  } while ( output < output_end );
10487}
10488
10489static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod3 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10490{
10491  float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10492  float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10493  stbir__3_coeff_setup();
10494  STBIR_SIMD_NO_UNROLL_LOOP_START
10495  do {
10496    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10497    int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 7 + 3 ) >> 2;
10498    float const * hc = horizontal_coefficients;
10499
10500    stbir__4_coeff_start();
10501    STBIR_SIMD_NO_UNROLL_LOOP_START
10502    do {
10503      hc += 4;
10504      decode += STBIR__horizontal_channels * 4;
10505      stbir__4_coeff_continue_from_4( 0 );
10506      --n;
10507    } while ( n > 0 );
10508    stbir__3_coeff_remnant( 4 );
10509
10510    stbir__store_output();
10511  } while ( output < output_end );
10512}
10513
10514static stbir__horizontal_gather_channels_func * STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_funcs)[4]=
10515{
10516  STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod0),
10517  STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod1),
10518  STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod2),
10519  STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod3),
10520};
10521
10522static stbir__horizontal_gather_channels_func * STBIR_chans(stbir__horizontal_gather_,_channels_funcs)[12]=
10523{
10524  STBIR_chans(stbir__horizontal_gather_,_channels_with_1_coeff),
10525  STBIR_chans(stbir__horizontal_gather_,_channels_with_2_coeffs),
10526  STBIR_chans(stbir__horizontal_gather_,_channels_with_3_coeffs),
10527  STBIR_chans(stbir__horizontal_gather_,_channels_with_4_coeffs),
10528  STBIR_chans(stbir__horizontal_gather_,_channels_with_5_coeffs),
10529  STBIR_chans(stbir__horizontal_gather_,_channels_with_6_coeffs),
10530  STBIR_chans(stbir__horizontal_gather_,_channels_with_7_coeffs),
10531  STBIR_chans(stbir__horizontal_gather_,_channels_with_8_coeffs),
10532  STBIR_chans(stbir__horizontal_gather_,_channels_with_9_coeffs),
10533  STBIR_chans(stbir__horizontal_gather_,_channels_with_10_coeffs),
10534  STBIR_chans(stbir__horizontal_gather_,_channels_with_11_coeffs),
10535  STBIR_chans(stbir__horizontal_gather_,_channels_with_12_coeffs),
10536};
10537
10538#undef STBIR__horizontal_channels
10539#undef STB_IMAGE_RESIZE_DO_HORIZONTALS
10540#undef stbir__1_coeff_only
10541#undef stbir__1_coeff_remnant
10542#undef stbir__2_coeff_only
10543#undef stbir__2_coeff_remnant
10544#undef stbir__3_coeff_only
10545#undef stbir__3_coeff_remnant
10546#undef stbir__3_coeff_setup
10547#undef stbir__4_coeff_start
10548#undef stbir__4_coeff_continue_from_4
10549#undef stbir__store_output
10550#undef stbir__store_output_tiny
10551#undef STBIR_chans
10552
10553#endif  // HORIZONALS
10554
10555#undef STBIR_strs_join2
10556#undef STBIR_strs_join1
10557
10558#endif // STB_IMAGE_RESIZE_DO_HORIZONTALS/VERTICALS/CODERS
10559
10560/*
10561------------------------------------------------------------------------------
10562This software is available under 2 licenses -- choose whichever you prefer.
10563------------------------------------------------------------------------------
10564ALTERNATIVE A - MIT License
10565Copyright (c) 2017 Sean Barrett
10566Permission is hereby granted, free of charge, to any person obtaining a copy of
10567this software and associated documentation files (the "Software"), to deal in
10568the Software without restriction, including without limitation the rights to
10569use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
10570of the Software, and to permit persons to whom the Software is furnished to do
10571so, subject to the following conditions:
10572The above copyright notice and this permission notice shall be included in all
10573copies or substantial portions of the Software.
10574THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10575IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
10576FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
10577AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
10578LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
10579OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
10580SOFTWARE.
10581------------------------------------------------------------------------------
10582ALTERNATIVE B - Public Domain (www.unlicense.org)
10583This is free and unencumbered software released into the public domain.
10584Anyone is free to copy, modify, publish, use, compile, sell, or distribute this
10585software, either in source code form or as a compiled binary, for any purpose,
10586commercial or non-commercial, and by any means.
10587In jurisdictions that recognize copyright laws, the author or authors of this
10588software dedicate any and all copyright interest in the software to the public
10589domain. We make this dedication for the benefit of the public at large and to
10590the detriment of our heirs and successors. We intend this dedication to be an
10591overt act of relinquishment in perpetuity of all present and future rights to
10592this software under copyright law.
10593THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10594IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
10595FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
10596AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
10597ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
10598WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
10599------------------------------------------------------------------------------
10600*/
10601
index : raylib-jai