Node.js Add-ons for High Performance Numeric Computing

Overview

Intro
Toolchain
Numeric Computing
Basic Example
BLAS
Performance
Challenges
N-API
Conclusions

This talk will be technical and contains many slides displaying source code. I won't spend much time on those slides, only pausing long enough to highlight key points. If I move too quickly, this talk is on-line with notes so you can revisit during and after this conference.

The talk overview is as follows...

First, I will provide an overview of Node.js native add-ons.
Next, I will introduce the current toolchain for authoring add-ons.
Then, I will discuss why native add-ons are important for numeric computing.
I'll follow by showing a basic native add-on example.
After the basic example, I'll show a more complex example where we need to write an add-on which links a BLAS library written in Fortran to the JavaScript runtime.
Next, I will show performance comparisons.
Then, I'll discuss some of the challenges we have faced writing native add-ons for numeric computing and how we have addressed them.
Before concluding, I will mention N-API, an application binary interface, or ABI, which will provide a stable abstraction layer over JavaScript engines.
And finally, I will offer some conclusions and additional resources you can use to get started using Node.js native add-ons for high-performance numeric computing.

Native Add-ons

Interface between JS running in Node.js and C/C++ libraries

APIs

V8
libuv
Internal Libraries
Dependencies

Examples

Why?

Leverage existing codebases
Access lower-level APIs
Non-JavaScript features
Performance

Toolchain

node-gyp

GYP

Challenges

V8
NAN
GYP
Engine Bias

Here have been some of the challenges.

The foremost challenge has been handling breaking changes in V8.

Each Node.js major release entailed a new V8. In the past, the V8 team was not concerned about backward compatibility and would often introduce significant changes, removing, replacing, and adding interfaces and functionality. These changes would force add-on authors to rewrite their packages, publish a new major version, and make providing backward compatibility extremely difficult.

To alleviate some of the "pain" of native add-ons, members of the Node.js community created a package NAN, which stands for Native Abstractions for Node.js.

NAN attempts to provide a stable abstraction layer that native add-on authors can target. Internally, NAN handles the complex logic required to maintain functionality from one V8 version to the next.

And while NAN has been beneficial, even it has had breaking changes in its API.

Another issue is GYP. GYP was designed with a particular use case in mind: Chrome. It was not explicitly designed for Node.js add-ons.

Further, GYP documentation is either poor or incomplete, presenting significant challenges whenever you want to do something beyond simple "hello world" type examples.

Because of poor documentation, you spend considerable time searching the Internet and looking for other projects using GYP to see how those projects handle special configurations. And in particular, anytime you want to use GYP to compile languages other than C/C++, e.g., Fortran, Cuda, Rust, and Golang, good luck.

Resources are few and far between.

A more forward looking concern is that node-gyp is biased toward V8. Meaning the toolchain is not engine neutral. This means compiling Node.js and Node.js native add-ons with alternative engines, such as Chakra, is less straightforward, requiring shims like Chakrashim.

Numeric Computing

Native add-ons are important for numeric computing because they allow us to interface with high-performance numeric computing libraries written in Fortran/C/C++.

What you find when reading the source code of Julia, R, and Python libraries like NumPy and SciPy is that a substantial amount of the functionality they expose relies on providing wrappers for existing numeric computing code bases written in C/C++ and Fortran.

For example, for high-performance linear algebra, these platforms wrap BLAS and LAPACK. For fast Fourier transforms, they wrap FFTW. For BigInt, Julia wraps GMP. For BigFloat, Julia wraps MPFR.

Node.js native add-ons allow us to do something similar; namely, we can expose high-performance numeric computing functionality to Node.js and to JavaScript.

This means we can leverage highly optimized libraries which have been used with great success for decades and not spend time rewriting implementations.

In summary, native add-ons allow us to do in Node.js what other environments used for numeric computing can do.

Basic Example


/* hypot.h */
#ifndef C_HYPOT_H
#define C_HYPOT_H

#ifdef __cplusplus
extern "C" {
#endif

double c_hypot( const double x, const double y );

#ifdef __cplusplus
}
#endif

#endif


/* hypot.c */
#include <math.h>
#include "hypot.h"

double c_hypot( const double x, const double y ) {
    double tmp;
    double a;
    double b;
    if ( isnan( x ) || isnan( y ) ) {
        return NAN;
    }
    if ( isinf( x ) || isinf( y ) ) {
        return INFINITY;
    }
    a = x;
    b = y;
    if ( a < 0.0 ) {
        a = -a;
    }
    if ( b < 0.0 ) {
        b = -b;
    }
    if ( a < b ) {
        tmp = b;
        b = a;
        a = tmp;
    }
    if ( a == 0.0 ) {
        return 0.0;
    }
    b /= a;
    return a * sqrt( 1.0 + (b*b) );
}


/* addon.cpp */
#include <nan.h>
#include "hypot.h"

namespace addon_hypot {

    using Nan::FunctionCallbackInfo;
    using Nan::ThrowTypeError;
    using Nan::ThrowError;
    using v8::Number;
    using v8::Local;
    using v8::Value;

    void node_hypot( const FunctionCallbackInfo<Value>& info ) {
        if ( info.Length() != 2 ) {
            ThrowError( "invalid invocation. Must provide 2 arguments." );
            return;
        }
        if ( !info[ 0 ]->IsNumber() ) {
            ThrowTypeError( "invalid input argument. First argument must be a number." );
            return;
        }
        if ( !info[ 1 ]->IsNumber() ) {
            ThrowTypeError( "invalid input argument. Second argument must be a number." );
            return;
        }
        const double x = info[ 0 ]->NumberValue();
        const double y = info[ 1 ]->NumberValue();

        Local<Number> h = Nan::New( c_hypot( x, y ) );
        info.GetReturnValue().Set( h );
    }

    NAN_MODULE_INIT( Init ) {
        Nan::Export( target, "hypot", node_hypot );
    }

    NODE_MODULE( addon, Init )
}

Once our implementation is finished, we create a wrapper written in C++ which calls the C function.

Note the inclusion of NAN and recall that NAN provides a stable API across V8 versions.

Most of the C++ is unwrapping and wrapping object values. The function wrapper takes a single argument: the arguments object. We perform basic input value sanity checks and then proceed to unwrap the individual arguments x and y. Once we have x and y, we call our C function and set the return value.

And finally, we end by exporting an initialization function, which is required of all Node.js native add-ons.

I should note that we did not have to write the implementation in a separate C file. We could have written directly in our add-on file; however, using a separate file a) facilitates re-usability of source files in non-add-on contexts and b) is a more common scenario when working with existing codebases.


# binding.gyp
{
  'targets': [
    {
      'target_name': 'addon',
      'sources': [
        'addon.cpp',
        'hypot.c'
      ],
      'include_dirs': [
        '<!(node -e "require(\'nan\')")',
        './'
      ]
    }
  ]
}


# Navigate to add-on directory:
$ cd path/to/hypot/binding.gyp

# Generate build files:
$ node-gyp configure

# On Windows:
# node-gyp configure --msvs_version=2015

# Build add-on:
$ node-gyp build


/* hypot.js */
var hypot = require( './path/to/build/Release/addon.node' ).hypot;

var h = hypot( 5.0, 12.0 );
// returns 13.0

	ops/sec	perf
Builtin	3,954,799	1x
Native	4,732,108	1.2x
JavaScript	7,337,790	1.85x

Okay, so we have implemented our add-on and we're ready to use it, but we should ask ourselves: how does the performance compare to an implementation written purely in JavaScript?

Here are benchmark results run on my laptop running Node.js version 8 which is using one of the latest versions of V8.

In the first row, I am showing the results for the builtin hypot function provided by the JavaScript standard math library. We see that, on my machine, we can compute the hypotenuse around 4 million times per second.

On the next row, I am showing the results of our native add-on. We see that we get a slight performance boost of around 800,000 operations per second.

Lastly, on the third row, I am showing the results of an equivalent implementation written purely in JavaScript. We can see that, when compared to add-on performance, we can perform 2.5 million more operations per second, which is a significant performance boost.

Two comments. First, simply because we can write code in C, this does not mean we will achieve better performance by doing so, due to overhead when calling into an add-on. Second, simply because something is a standard, this does not mean the function is fast, in an absolute sense. You can often achieve better performance in JavaScript via userland implementations by restricting the domain of input argument types and choosing your algorithms wisely.

BLAS


! dasum.f
! Computes the sum of absolute values.
double precision function dasum( N, dx, stride )
  implicit none
  integer :: stride, N
  double precision :: dx(*)
  double precision :: dtemp
  integer :: nstride, mp1, m, i
  intrinsic dabs, mod
  ! ..
  dasum = 0.0d0
  dtemp = 0.0d0
  ! ..
  if ( N <= 0 .OR. stride <= 0 ) then
    return
  end if
  ! ..
  if ( stride == 1 ) then
    m = mod( N, 6 )
    if ( m /= 0 ) then
      do i = 1, m
        dtemp = dtemp + dabs( dx( i ) )
      end do
      if ( N < 6 ) then
        dasum = dtemp
        return
      end if
    end if
    mp1 = m + 1
    do i = mp1, N, 6
      dtemp = dtemp + &
        dabs( dx( i ) ) + dabs( dx( i+1 ) ) + &
        dabs( dx( i+2 ) ) + dabs( dx( i+3 ) ) + &
        dabs( dx( i+4 ) ) + dabs( dx( i+5 ) )
    end do
  else
    nstride = N * stride
    do i = 1, nstride, stride
      dtemp = dtemp + dabs( dx( i ) )
    end do
  end if
  dasum = dtemp
  return
end function dasum


! dasumsub.f
! Wraps dasum as a subroutine.
subroutine dasumsub( N, dx, stride, sum )
  implicit none
  ! ..
  interface
    double precision function dasum( N, dx, stride )
      integer :: stride, N
      double precision :: dx(*)
    end function dasum
  end interface
  ! ..
  integer :: stride, N
  double precision :: sum
  double precision :: dx(*)
  ! ..
  sum = dasum( N, dx, stride )
  return
end subroutine dasumsub


/* dasum_fortran.h */
#ifndef DASUM_FORTRAN_H
#define DASUM_FORTRAN_H

#ifdef __cplusplus
extern "C" {
#endif

void dasumsub( const int *, const double *, const int *, double * );

#ifdef __cplusplus
}
#endif

#endif


/* dasum.h */
#ifndef DASUM_H
#define DASUM_H

#ifdef __cplusplus
extern "C" {
#endif

double c_dasum( const int N, const double *X, const int stride );

#ifdef __cplusplus
}
#endif

#endif


/* dasum_f.c */
#include "dasum.h"
#include "dasum_fortran.h"

double c_dasum( const int N, const double *X, const int stride ) {
    double sum;
    dasumsub( &N, X, &stride, &sum );
    return sum;
}


/* addon.cpp */
#include <nan.h>
#include "dasum.h"

namespace addon_dasum {

    using Nan::FunctionCallbackInfo;
    using Nan::TypedArrayContents;
    using Nan::ThrowTypeError;
    using Nan::ThrowError;
    using v8::Number;
    using v8::Local;
    using v8::Value;

    void node_dasum( const FunctionCallbackInfo<Value>& info ) {
        if ( info.Length() != 3 ) {
            ThrowError( "invalid invocation. Must provide 3 arguments." );
            return;
        }
        if ( !info[ 0 ]->IsNumber() ) {
            ThrowTypeError( "invalid input argument. First argument must be a number." );
            return;
        }
        if ( !info[ 2 ]->IsNumber() ) {
            ThrowTypeError( "invalid input argument. Third argument must be a number." );
            return;
        }
        const int N = info[ 0 ]->Uint32Value();
        const int stride = info[ 2 ]->Uint32Value();

        TypedArrayContents<double> X( info[ 1 ] );

        Local<Number> sum = Nan::New( c_dasum( N, *X, stride ) );
        info.GetReturnValue().Set( sum );
    }

    NAN_MODULE_INIT( Init ) {
        Nan::Export( target, "dasum", node_dasum );
    }

    NODE_MODULE( addon, Init )
}


$ gfortran \
    -std=f95 \
    -ffree-form \
    -O3 \
    -Wall \
    -Wextra \
    -Wimplicit-interface \
    -fno-underscoring \
    -pedantic \
    -fPIC \
    -c \
    -o dasum.o \
    dasum.f
$ gfortran \
    -std=f95 \
    -ffree-form \
    -O3 \
    -Wall \
    -Wextra \
    -Wimplicit-interface \
    -fno-underscoring \
    -pedantic \
    -fPIC \
    -c \
    -o dasumsub.o \
    dasumsub.f
$ gcc \
    -std=c99 \
    -O3 \
    -Wall \
    -pedantic \
    -fPIC \
    -I ../include \
    -c \
    -o dasum_f.o \
    dasum_f.c
$ gcc -o dasum dasum_f.o dasumsub.o dasum_f.o -lgfortran

Compiling our add-on is not as straightforward as before. Recall that I mentioned that GYP is oriented toward C/C++, and, here, we have to compile Fortran. Accordingly, we'll need to teach GYP how to compile Fortran, and our configuration will become considerably more complex.

Forgetting the add-on for a second, if we were going to compile just the C and Fortran, we would do something like the following.

First, we would need to compile our Fortran files, specifying various command-line options.

Next, we would compile our C files, once again specifying various command-line options.

After compiling our source files, we would link them together into a single library, making sure to include the standard Fortran libraries.

To compile our add-on, we will need to translate this sequence, or something similar, to a GYP configuration file.


# binding.gyp
{
  'variables': {
    'addon_target_name%': 'addon',
    'addon_output_dir': './src',
    'fortran_compiler%': 'gfortran',
    'fflags': [
      '-std=f95',
      '-ffree-form',
      '-O3',
      '-Wall',
      '-Wextra',
      '-Wimplicit-interface',
      '-fno-underscoring',
      '-pedantic',
      '-c',
    ],
    'conditions': [
      [
        'OS=="win"',
        {
          'obj': 'obj',
        },
        {
          'obj': 'o',
        }
      ],
    ],
  },


# binding.gyp (cont.)
  'targets': [
    {
      'target_name': '<(addon_target_name)',
      'dependencies': [],
      'include_dirs': [
        '<!(node -e "require(\'nan\')")',
        '../include',
      ],
      'sources': [
        'dasum.f',
        'dasumsub.f',
        'dasum_f.c',
        'addon.cpp'
      ],
      'link_settings': {
        'libraries': [
          '-lgfortran',
        ],
        'library_dirs': [],
      },
      'cflags': [
        '-Wall',
        '-O3',
      ],
      'cflags_c': [
        '-std=c99',
      ],
      'cflags_cpp': [
        '-std=c++11',
      ],
      'ldflags': [],
      'conditions': [
        [
          'OS=="mac"',
          {
            'ldflags': [
              '-undefined dynamic_lookup',
              '-Wl,-no-pie',
              '-Wl,-search_paths_first',
            ],
          },
        ],
        [
          'OS!="win"',
          {
            'cflags': [
              '-fPIC',
            ],
          },
        ],
      ],


# binding.gyp (cont.)
      'rules': [
        {
          'extension': 'f',
          'inputs': [
            '<(RULE_INPUT_PATH)'
          ],
          'outputs': [
            '<(INTERMEDIATE_DIR)/<(RULE_INPUT_ROOT).<(obj)'
          ],
          'conditions': [
            [
              'OS=="win"',
              {
                'rule_name': 'compile_fortran_windows',
                'process_outputs_as_sources': 0,
                'action': [
                  '<(fortran_compiler)',
                  '<@(fflags)',
                  '<@(_inputs)',
                  '-o',
                  '<@(_outputs)',
                ],
              },
              {
                'rule_name': 'compile_fortran_linux',
                'process_outputs_as_sources': 1,
                'action': [
                  '<(fortran_compiler)',
                  '<@(fflags)',
                  '-fPIC',
                  '<@(_inputs)',
                  '-o',
                  '<@(_outputs)',
                ],
              }
            ],
          ],
        },
      ],
    },


$ cd path/to/dasum/binding.gyp
$ node-gyp configure
# node-gyp configure --msvs_version=2015
$ node-gyp build


/* dasum.js */
var dasum = require( './path/to/src/addon.node' ).dasum;

var x = new Float64Array( [ 1.0, -2.0, 3.0, -4.0, 5.0 ] );
var s = dasum( x.length, x, 1 );
// returns 15.0

Length	JavaScript	Native	Perf
10	22,438,020	7,435,590	0.33x
100	4,350,384	4,594,292	1.05x
1,000	481,417	827,513	1.71x
10,000	28,186	97,695	3.46x
100,000	1,617	9,471	5.85x
1,000,000	153	873	5.7x

To measure add-on performance, we benchmark against an equivalent implementation written in plain JavaScript. Each row in the table corresponds to an input array length. The two middle columns correspond to operations per second. And the last column is the relative performance of the native add-on to the JavaScript implementation.

As we can see, for small arrays, JavaScript is significantly faster, but that advantage disappears as soon as an input array has 100 elements.

As I mentioned earlier, array unwrapping and reinterpretation as a C vector can have a significant impact on performance for small arrays. However, that cost is largely constant, becoming negligible as array length increases.

For large input arrays, the add-on is significantly more performant, nearly 6 times more performant than the equivalent JavaScript implementation.


/* dasum_cblas.h */
#ifndef DASUM_CBLAS_H
#define DASUM_CBLAS_H

#ifdef __cplusplus
extern "C" {
#endif

double cblas_dasum( const int N, const double *X, const int stride );

#ifdef __cplusplus
}
#endif

#endif

Our BLAS journey is not, however, over. The Fortran reference implementation does not take into account hardware capabilities or chip architecture and, thus, is not the most performant.

For optimal performance, we would rather use hardware optimized BLAS libraries, if available. For instance, on MacOS, we could use the Apple Accelerate Framework. On Intel chips, we could use Intel's Math Kernel Library (MKL). For a cross-platform hardware optimized library, we could use OpenBLAS.

As an example, if we wanted to use the Apple Accelerate Framework, we could proceed as follows.

First, we need to create a header file defining the prototype of the function we want to use. The function signature is the same, as before, but now we are using the CBLAS naming convention.


/* dasum_cblas.c */
#include "dasum.h"
#include "dasum_cblas.h"

double c_dasum( const int N, const double *X, const int stride ) {
    return cblas_dasum( N, X, stride );
}


# binding.gyp
{
  'variables': {
    'addon_target_name%': 'addon',
    'addon_output_dir': './src',
  },
  'targets': [
    {
      'target_name': '<(addon_target_name)',
      'dependencies': [],
      'include_dirs': [
        '<!(node -e "require(\'nan\')")',
        './../include',
      ],
      'sources': [
        'dasum_cblas.c',
        'addon.cpp'
      ],
      'link_settings': {
        'libraries': [
          '-lblas',
        ],
        'library_dirs': [],
      },
      'cflags': [
        '-Wall',
        '-O3',
      ],
      'cflags_c': [
        '-std=c99',
      ],
      'cflags_cpp': [
        '-std=c++11',
      ],
      'ldflags': [
		'-undefined dynamic_lookup',
        '-Wl,-no-pie',
        '-Wl,-search_paths_first'
      ],
    },
    {
      'target_name': 'copy_addon',
      'type': 'none',
      'dependencies': [
        '<(addon_target_name)',
      ],
      'actions': [
        {
          'action_name': 'copy_addon',
          'inputs': [],
          'outputs': [
            '<(addon_output_dir)/<(addon_target_name).node',
          ],
          'action': [
            'cp',
            '<(PRODUCT_DIR)/<(addon_target_name).node',
            '<(addon_output_dir)/<(addon_target_name).node',
          ],
        },
      ],
    },
  ],
}

Length	JavaScript	wasm	Native	Perf
10	22,438,020	18,226,375	7,084,870	0.31x
100	4,350,384	6,428,586	6,428,626	1.47x
1,000	481,417	997,234	3,289,090	6.83x
10,000	28,186	110,540	355,172	12.60x
100,000	1,617	11,157	30,058	18.58x
1,000,000	153	979	1,850	12.09x

When we benchmark the hardware optimized BLAS libraries against equivalent implementations in JavaScript, we get the following results.

As with the reference implementation, the add-on is slower for short array lengths.

However, as we increase the array length, the add-on achieves significantly better performance even for an array length of 100 and better performance when compared to the reference implementation.

Note that I have also included WebAssembly benchmarks. For those hoping that WebAssembly will remove the need for native add-ons and provide equivalent performance, you are mistaken.

The main conclusion of these results is to use a hardware optimized library when available. These results are simply not possible otherwise.

Challenges

Bugs
Standards
Proprietary
Windows
Portability
Complexity

At this point, you may be excited seeing a 20x improvement. One small problem, however: detecting and/or installing hardware optimized libraries is hard.

The first problem is that some hardware optimized libraries contain bugs, so you need to provide patches; e.g., Apple Accelerate Framework.

Next, resolving library installation locations in a robust cross-platform way is difficult, as no standard locations or naming conventions exist.

Third, some hardware optimized libraries are proprietary and cannot be guaranteed to exist on a target platform.

Fourth, hardware optimized BLAS on Windows is especially painful. And in fact, in general, Fortran BLAS is painful on Windows, and node-gyp cannot compile Fortran on Windows due to node-gyp's dependency on Microsoft Visual Studio, which does not include a Fortran compiler.

Fifth, while OpenBLAS is close, there is no fully robust and fully cross-platform hardware optimized BLAS library that you can install alongside your add-on.

...which means that you always need to ship a reference implementation fallback, and, for those environments where you cannot compile your native add-on, you also need to ship a pure JavaScript fallback.

In short, to handle cross-platform complexity, your binding.gyp files become complex very quickly.

N-API

Features

Stability
Compatibility
VM Neutrality


/* addon.cpp */
#include <node_api.h>
#include <assert.h>
#include "hypot.h"

namespace addon_hypot {

    napi_value node_hypot( napi_env env, napi_callback_info info ) {
        napi_status status;

        size_t argc = 2;
        napi_value argc[ 2 ];
        status = napi_get_cb_info( env, info, &argc, args, nullptr, nullptr );
        assert( status == napi_ok );

        if ( argc < 2 ) {
            napi_throw_type_error( env, "invalid invocation. Must provide 2 arguments." );
            return nullptr;
        }

        napi_value vtype0;
        status = napi_typeof( env, args[ 0 ], &vtype0 );
        assert( status == napi_ok );
        if ( vtype0 != napi_number ) {
            napi_throw_type_error( env, "invalid input argument. First argument must be a number." );
            return nullptr;
        }

        napi_value vtype1;
        status = napi_typeof( env, args[ 0 ], &vtype1 );
        assert( status == napi_ok );
        if ( vtype1 != napi_number ) {
            napi_throw_type_error( env, "invalid input argument. Second argument must be a number." );
            return nullptr;
        }

        const double x;
        status = napi_get_value_double( env, args[ 0 ], &x );
        assert( status == napi_ok );

        const double y;
        status = napi_get_value_double( env, args[ 1 ], &y );
        assert( status == napi_ok );

        napi_value h;
        status = napi_create_number( env, c_hypot( x, y ), &h );
        assert( status == napi_ok );

        return h;
    }

    #define DECLARE_NAPI_METHOD( name, func ) { name, 0, func, 0, 0, 0, napi_default, 0 }

    void Init( napi_env env, napi_value exports, napi_value module, void* priv ) {
        napi_status status;
        napi_property_descriptor addDescriptor = DECLARE_NAPI_METHOD( "hypot", node_hypot );
        status = napi_define_properties( env, exports, 1, &addDescriptor );
        assert( status == napi_ok );
    }

    NAPI_MODULE( addon, Init )
}

Conclusions

Parity
Performance
Progress

Thank you!

https://github.com/stdlib-js/stdlib
https://www.patreon.com/athan

Node.js Add-ons for High Performance Numeric Computing

Overview

Native Add-ons

APIs

Examples

Why?

Toolchain

node-gyp

GYP

Challenges

Numeric Computing

Basic Example

BLAS

Challenges

N-API

Features

Conclusions

Thank you!

Appendix

The End