Faster than NumPy

High-performance numerical computation in JavaScript

About me.

I work for Quansight, a software consulting company cofounded by Travis Oliphant, the primary original author of NumPy and founding contributor to SciPy.

As a software consulting company, Quansight acts as an intermediary between companies and the open source ecosystems on which those companies depend. Most of the in-house expertise, and thus the consulting work, centers around the PyData stack: NumPy, SciPy, Numba, Dask, PyData Sparse, pandas, Jupyter, conda, and others.

(next fragment)

Under the same corporate umbrella as Quansight is the public benefit division Quansight Labs which consists of the developers, community managers, designers, and writers who actually create and maintain the open-source technologies mentioned earlier.

(next fragment)

The majority of my work at Quansight is as part of the Consortium for Python Data API standards, which is a Quansight Labs led initiative to develop API standards for array and dataframe libraries within the PyData ecosystem.

The Consortium brings together core maintainers from NumPy, CuPy, Dask, Torch, TensorFlow, Apache Arrow, and others in order to create hardware agnostic specifications for commonly used APIs to facilitate interoperability between array and dataframe libraries and reduce ecosystem fragmentation.

(next fragment)

Also, as part of my work, I am a core maintainer of the open-source project, stdlib, a standard library for JavaScript and Node.js.

One of the foremost goals of stdlib is to provide the equivalent of NumPy and SciPy for numerical and scientific computation on the web and in Node.js.

Most of my work on stdlib focuses on developing APIs for high-performance numerical operations on multi-dimensional arrays in JavaScript and C--the precise subject of this talk. :)

CPU-based library for performing numerical computation on multidimensional arrays.

ndarray

(fragment)

Originally written in 2005, NumPy is a CPU-based library for performing numerical computation on multidimensional arrays.

Without the assistance of a multi-process library, NumPy operations are blocking, running on a single-core and executing synchronously in sequential order.

I should note that NumPy is able to leverage parallelism through SIMD operations, depending on host architecture and compilation.

However--and this is relevant for later discussion--while other libraries exist to address limitations in NumPy--for example, through the use of GPUs (CuPy), GPUs/TPUs (Torch/TensorFlow), or distributed computation (Dask), NumPy does not, by itself, enjoy any inherit advantages compared to a comparable library written in JavaScript by virtue of being written in and for Python.

The same issues, such as single-threading and limited host memory, that are seen as limitations in JavaScript apply equally to Python, and thus NumPy, as well.

(fragment)

The fundamental building block of NumPy is the ndarray, which is a data structure for describing multidimensional arrays.

I'll discuss the technical aspects of ndarrays throughout this talk, but, before I do, we should probably ask what is so special about multidimensional arrays.

Demos

Most of us probably have some sense that multidimensional arrays are useful in traditional numerical and scientific computing fields, even if we don't typically work with them ourselves.

We've undoubtedly heard of "big data", machine learning, and artificial intelligence and probably assume that they involve working with large datasets with potentially many dimensions.

At university, we may have used MATLAB and taken a data science course on Coursera or somewhere else.

So my first demo should not come as too much of a surprise in that it involves working with large datasets.

Currently displayed is Falcon, a library for high-performance cross-filtering of large datasets directly in the browser.

What is shown is a collection of histograms visualizing different facets of a 1 million flight dataset.

Internally, Falcon is using an ndarray data structure to aggregate and bin the flight data in real-time as a user interacts with the graph brushes.

(play with brushes and demonstrate responsiveness)

Multidimensional arrays

Now that we've gotten a sense as to the types of applications which utilize ndarrays--data visualization, three-dimensional rendering, gaming, neural networks and artificial intelligence, and, perhaps unsurprisingly, more traditional machine learning tasks such as clustering and classification--let's move on to the more technical aspects of their implementation.

To start, we should ask "what is a multidimensional array?"

Most of us have a working understanding of multidimensional arrays and their implementation, often within the context of a list of lists (or an array of arrays in JavaScript).

I should note that I'll be using the terms "list of lists" and "array of arrays" interchangeably throughout this talk.

(next slide)


arr = [
	[ 1, 2, 3 ],
	[ 4, 5, 6 ],
	[ 7, 8, 9 ]
];

// Get individual item:
v = arr[1][1];
// returns 5

// Get row:
row = arr[1];
// returns [ 4, 5, 6 ]

// Get column: (!!!)
col = [ arr[0][1], arr[1][1], arr[2][1] ];
// returns [ 2, 5, 8 ]

// Column-major:
arr = [
	[ 1, [ 2, [ 3,
	  4,   5,   6,
	  7 ], 8 ], 9 ]
];

Let \( n \) be the number of elements per dimension and \( d \) be the number of dimensions.

\( O(d) \) — data access
\( O(n^{d}d) \) — traversal
\( O(n^{d}) \) — slicing
\( O(n^{d-1}) \) — extra storage requirements

Consider this example. Here, we have a list of three arrays, each with three items.

Interpreted as a two-dimensional data structure, this would comprise a 3x3 matrix, where each row is represented by a nested list/array.

(fragment)

To access an individual item, we index into each dimension, dereferencing nested lists/arrays until we reached the last dimension.

(fragment)

For higher-dimensional arrays, to retrieve sub-vectors, matrices, and tensors, we perform a similar procedure, except that we stop dereferencing lists once we've reached the desired dimensionality.

Here, we're accessing the vector representing the second row in our matrix.

(fragment)

Suppose, however, that we want to access the second column in our matrix. How do we do that?

Unfortunately, accessing the second column is not as simple as dereferencing a single list.

Instead, we need to dereference each row, retrieve the item in the desired column, and then copy to a new array.

(fragment)

Of course, if we wanted to optimize for column access, we could change how we store the data, such that items belonging to the same column are stored together.

However, in both scenarios, optimizing data locality for one type of data access de-optimizes the other.

In general, for an array of arrays approach to storing multidimensional data, we have the following performance characteristics.

(fragment)

Let \( n \) be the number of elements per dimension and \( d \) be the number of dimensions.

(fragment)

Data access scales linearly with the number dimensions, as we dereference pointers to nested blocks of memory.

(fragment)

Array traversal scales according to \( O(d n^{d}) \) and, importantly, is not guaranteed to be cache coherent.

(fragment)

Slicing (i.e., extracting a subarray) scales according to \( O(n^{d}) \) ).

(fragment)

Lastly, storage requirements scale according to \( O(n^{d-1}) \) due to the need for intermediate arrays containing pointers.

(next slide)

Implicit arrays


arr = [
	1, 2, 3,
	4, 5, 6,
	7, 8, 9
];

// Get individual item:
v = arr[ (1*3) + 1 ];
// returns 5

// Get row:
row = [arr[(1*3)+0], arr[(1*3)+1], arr[(1*3)+2]];
// returns [ 4, 5, 6 ]

// Get column:
col = [arr[(0*3)+1], arr[(1*3)+1], arr[(2*3)+1]];
// returns [ 2, 5, 8 ]

Let \( n \) be the number of elements per dimension, \( d \) be the number of dimensions, and \( B \) be the cache line size.

\( O(1) \) — data access
\( O(\frac{n^d}{B}) \) — traversal
\( O(\frac{n^d}{B}) \) — slicing
\( O(0) \) — extra storage requirements

To convert our previous array of arrays to an implicit data structure, we do away with nested arrays and, instead, store all items in a single flat list/array.

This is an implicit data structure as the ordering tells us something about the relationship of each item with the others.

Notably, as we progress along the array, we traverse rows.

(fragment)

Beside the data layout, there are a few other key differences of this data structure compared to an array of arrays.

First, in order to access a single item, we need to perform some arithmetic.

In order to get the second item in the second row, we first need to multiply the desired row number (using zero-based indexing) by the number of elements in the row and then add the desired column number (again, using zero-based indexing).

(fragment)

Retrieving a row is not as simple as the list of lists implementation.

We need to apply the same formula we used for accessing a single element to each element in the desired row.

(fragment)

That same formula can also be applied for retrieving a column, and, thus, avoids the dereferencing overhead implicit in the array of arrays implementation.

(fragment)

In general, for an implicit array approach to storing multidimensional data, we have the following performance characteristics.

(fragment)

Let \( n \) be the number of elements per dimension, \( d \) be the number of dimensions, and \( B \) be the cache line size.

(fragment)

Data access is effectively \( O(1) \), given that performing arithmetic is generally cheaper than dereferencing memory addresses.

(fragment)

For traversal, we lose a factor of \( d \), and, importantly, traversal is more likely to be cache coherent.

(fragment)

Similar to traversal, slicing scales according to \( O(n^d) \), being partially offset by improved data locality.

(fragment)

Lastly, an implicit array does not require extra storage, as we no longer need intermediate arrays containing pointers.

In summary, implicit arrays have better theoretical performance guarantees than their array of arrays counterparts.

(next slide)

Strided arrays


x = [
	1, 2, 3,
	4, 5, 6,
	7, 8, 9
];

// Define meta data:
shx = [ 3, 3 ]; // shape
ox = 0;         // offset
sx = [ 3, 1 ];  // strides (row-major/lexicographic)

// Get individual item:
v = x[ (1*sx[0]) + (1*sx[1]) + ox ];
// returns 5

// Define row:
shr = [ shx[1] ];     // [ 3 ]
sr = [ sx[1] ];       // [ 1 ]
or = (1*sx[0]) + ox;  // 3

// Define column:
shc = [ shx[0] ];     // [ 3 ]
sc = [ sx[0] ];       // [ 3 ]
oc = (1*sx[1]) + ox;  // 1

\[ \textrm{index}(x_{i_0 i_1 \dots i_{d-1}}) = o + \sum^{d-1}_{k = 0} s_k i_k \] where \( d \) is the number of dimensions, \( o \) is the offset, \( s_k \) are the strides, and \( i_k \) are the dimension subscripts.

Let \( n \) be the number of elements per dimension, \( d \) be the number of dimensions, and \( B \) be the cache line size.

\( O(1) \) — data access
\( O(\frac{n^d}{B}) \) — traversal
\( O(d) \) — slicing
\( O(d) \) — extra storage requirements

A strided array generalizes an implicit array by explicitly defining meta data (aka, a dope vector) describing the data layout.

Similar to the previous examples, we'll define a single flat array containing the elements of our multidimensional array.

(fragment)

Next, we'll define meta describing the data layout.

First, we define the shape of our array. In this case, a three-by-three matrix.

(fragment)

Next, we define an offset which tells us the location of the first indexed element in the underlying flat array. In this case, we set the offset to the first element.

(fragment)

Next, we define strides which tell us how many elements in the underlying flat array we need to move a data pointer (via dead reckoning) until we reach the next item.

In this case, to iterate to the next row, we need to skip three elements. To iterate to the next column, we simply move to the next element.

At this point, the shape, strides, and offset meta data allow us to completely describe our two-dimensional matrix and can be generalized to multidimensional arrays having an arbitrary number of dimensions (aka, rank).

I should note, however, that depending on the language, we may need additional meta data, such as the underlying array data type, in order to correctly resolve item memory locations and interpret the item value.

(fragment)

To retrieve an individual element, we apply a similar formula as for implicit arrays, substituting in our strides and adding an offset term.

(fragment)

To create a row slice, we can do as before with implicit arrays and explicitly extract values to a new array, or we can create a "view" over the flat array by defining meta data describing the row.

Here, in order to access the second row, we define the shape to be the number of columns, the strides to be the column stride, and the offset to point to the first element in the second row.

(fragment)

Similarly, we can create a view for the second column by defining the shape to be the number of rows, the strides to be the row stride, and the offset to point to the first element in the second column.

(fragment)

We can generalize mapping the subscripts of an item in a multidimensional array to an item in the underlying flat array according to the following formula, where \( d \) is the number of dimensions, \( o \) is the offset, \( s_k \) are the strides, and \( i_k \) are the dimension subscripts.

(fragment)

In general, we have the following performance characteristics for strided arrays.

(fragment)

Let \( n \) be the number of elements per dimension, \( d \) be the number of dimensions, and \( B \) be the cache line size.

(fragment)

Data access is effectively \( O(1) \), same as for implicit arrays.

(fragment)

Traversal is the same as for implicit arrays, benefiting from data locality, although this depends heavily on the strides.

(fragment)

Where strided arrays can significantly depart from implicit arrays is in slicing, as, by defining meta data describing the slice, we can create a view over the underlying collection. Accordingly, slicing scales according to the number of dimensions.

(fragment)

Lastly, when compared to implicit arrays, strided arrays have extra storage requirements which scale with the number of dimensions.

In summary, in exchange for slightly larger memory requirements, strided arrays offer considerably more flexibility, and thus utility, than their implicit array counterparts.

(next slide)


// Reverse an array...
shape = [8]; strides = [1]; offset = 0;

// Negate strides and adjust offset:
strides[0] *= -1;
offset = 7;


// Matrix transpose...
shape = [3, 3]; strides = [3, 1]; offset = 0;

// Swap strides (and dimensions):
strides = [1, 3];


// Reshape an array...
shape = [3, 3]; strides = [3, 1]; offset = 0;

// Update shape and strides:
shape = [3, 1, 3];
strides = [3, 3, 1];


// Access a subarray (slice)...
shape = [3, 3]; strides = [3, 1]; offset = 0;

// Update shape and offset:
shape = [2, 2];
offset = 4;

To help demonstrate their utility, let's consider a few operations which we effectively get for free when using strided arrays.

First, suppose we want to reverse a one-dimensional strided array. We define a shape, strides, and an offset as shown previously.

(next fragment)

To realize the reverse of this array, we need only negate the stride and adjust the offset.

According to our index formula from the previous slide, accessing the first element of our reversed array will access the last element of the underlying collection of items.

(fragment)

Next, suppose we want to take the transpose of a three-by-three matrix.

(fragment)

To realize the transpose, we simply swap the strides (and, more generally, the dimensions).

(fragment)

Next, suppose we want to reshape an array from a three-by-three matrix to a three dimensional tensor containing three one-by-three matrices.

(fragment)

To derive an array with a different shape (and the same number of total elements), we update the shape and strides, without needing to touch the underlying data--something which is not readily achievable when using an array of arrays.

(fragment)

Lastly, suppose we want to access a two-by-two subarray within our three-by-three matrix.

(fragment)

To do so, we update the shape and adjust our offset, and, using the index formula shown earlier, we'll only access those elements highlighted in blue.

These are only a few examples of how strided arrays are rather useful when working with multidimensional data, as they allow fast slicing and manipulation of views atop a flat underlying data store.

(next slide)

Now that we've discussed strided arrays, we can return our attention to NumPy and its fundamental data structure: the ndarray.

(fragment)

An ndarray is collection of "items", where each item has the same data type.

And as a corollary, an ndarray is a homogeneous data structure, in which every item occupies a memory block of the same size and in which all blocks are interpreted in exactly the same way.

(fragment)

Simply having a collection of items is not sufficient, however, to perform operations on multidimensional data. We need to specify meta data describing the data location (i.e., offset) and layout (i.e., shape and strides).

(fragment)

By associating the meta data with the collection of items, we are able to fully describe and interface with an ndarray.

(fragment)

How each item in the array is to be interpreted is specified by a separate data-type object (or identifier) as referenced by the ndarray meta data.

(fragment)

Using the meta data, we can locate and access an individual item within the collection.

(fragment)

In Python (and in un-optimized JavaScript), an accessed item is "boxed", being associated with meta data and interpreted according to the previously mentioned data type object.

All of this comprises the conceptual model of an ndarray.

As I hope is evident from previous slides, at its essence, NumPy's ndarray is a fancy wrapper around a strided array, with some additional meta data.

(next slide)

Vectorization

Parallelization

Hardware Optimization

Broadcasting

So how does NumPy leverage ndarrays to unlock better performance? There are four ways.

(fragment)

First, NumPy utilizes vectorization, providing interfaces such as the universal function interface (aka "ufunc") which operate on ndarrays, rather than relying on users to perform element-wise iteration in userland.

(fragment)

Second, NumPy takes advantage of SIMD operations for parallelization (i.e., operating on multiple elements within a strided array at the same time).

(fragment)

Third, NumPy binds to hardware optimized libraries for various numerical computing tasks, such as using OpenBLAS for linear algebra.

(fragment)

Lastly, NumPy supports broadcasting semantics (i.e., the ability to tile a lower dimensional array to a higher dimension without copying data).

For example, when multiplying a matrix by scalar, NumPy does not instantiate a second matrix with the same dimensions as the first and set every element to a value equal to the scalar.

Instead, NumPy virtually repeats a scalar by simply creating a single element ndarray with modified shape and strides.

In short, broadcasting enables better performance by avoiding data copies.

So what does all this mean for JavaScript, and is it possible to leverage the same four secrets to NumPy performance in JavaScript and Node.js?

As I alluded to at the start of this talk, I contribute to a project with this explicit goal: stdlib.

And, as much as possible, stdlib tries to emulate NumPy in various respects.

First and foremost, stdlib utilizes ndarrays for its fundamental data structure for operating on multidimensional arrays.

(next slide)>

------------

(click on the logo and navigate to stdlib/ndarray/array)

A few notes:

generic ndarray creation API
can create fresh arrays or views
returned object can be JSON serialized for transfer over wire (see examples)

(navigate to stdlib/ndarray/array-ctor)

A few notes:

lower level API
highlight set/get APIs
highlight accessible C APIs for use in Node.js native add-ons and elsewhere

Building on the ndarray, we are currently working on vectorized interfaces for ndarray operations.

For example, we've recently started adding the equivalent of universal functions interfaces to the project.

(navigate to stdlib/math/special/abs)

A few notes:

can operate on numbers, array-like objects, and ndarrays
by default, return an new array
an "assign" API for mutating a previously allocated array to allow for memory reuse

For hardware optimization, we've been slowly adding BLAS interfaces for linear algebra computations.

(navigate to stdlib/blas/base/daxpy)

BLAS interfaces are lower level interfaces for operating on contiguous strided array data.

Similar to NumPy, we've added bindings to OpenBLAS for hardware optimized linear algebra routines when compiled and running in Node.js.

And similar to the ndarray constructor interface earlier, we also provide C APIs for interfacing with these APIs in Node.js native add-ons.

In terms of parallelization, we do not currently leverage SIMD, but it is something we'd like to add in the future.

And in terms of broadcasting, we've recently started adding broadcasting support, but this is still very much a work in progress.

(next slide)


import array from '@stdlib/ndarray-array';

// Create a 4-dimensional array:
const arr = array({
    'dtype': 'float32',
    'shape': [ 3, 3, 3, 3 ]
});

// Retrieve the array shape:
const shape = arr.shape;
// returns [ 3, 3, 3, 3 ]

// Retrieve the array strides:
const strides = arr.strides;
// returns [ 27, 9, 3, 1 ];

// Retrieve the array offset:
const offset = arr.offset;
// returns 0

// Retrieve the array dtype:
const dtype = arr.dtype;
// returns 'float32'

// Retrieve the underlying array data:
const data = arr.data;
// returns <Float32Array>[ 0, ..., 0 ]

// Retrieve the array byte length:
const byteLength = arr.byteLength;
// returns 324

// Retrieve an array value:
let v = arr.get( 1, 2, 1, 2 );
// returns 0.0

// Set an array value:
arr.set( 1, 2, 1, 2, 10.0 );

// Retrieve the array value:
v = arr.get( 1, 2, 1, 2 );
// returns 10.0

// Serialize as JSON:
const json = arr.toJSON();
// returns {"type":"ndarray","dtype":"float32","flags":{},"order":"row-major","shape":[3,3,3,3],"strides":[27,9,3,1],"data":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}

// Create a 2x2 matrix:
const arr2 = array([
	[ 1.0, 2.0 ],
	[ 3.0, 4.0 ]
]);
// returns <ndarray>[ 1.0, 2.0; 3.0, 4.0 ]

ndarray


import array from '@stdlib/ndarray-array';
import abs from '@stdlib/math-special-abs';

// Create a one-dimensional vector:
let x = array([ -1.0, -2.0 ]);
// returns <ndarray>[ -1.0, -2.0 ]

// Retrieve the shape:
let shape = x.shape;
// returns [ 2 ]

// Retrieve the dtype:
let dtype = x.dtype;
// returns 'float64'

// Compute the absolute value for each element:
let y = abs( x );
// returns <ndarray>[ 1.0, 2.0 ]

// Retrieve the shape:
shape = y.shape;
// returns [ 2 ]

// Define an output array:
y = array({
	'shape': [ 4, 2 ]
});
// returns <ndarray>[ 0.0, 0.0; 0.0, 0.0; 0.0, 0.0; 0.0, 0.0 ]

// Broadcast results across rows:
abs.assign( x, y );

// Retrieve values:
let v = y.get(0, 0);
// returns 1.0

v = y.get(0, 1);
// returns 2.0

v = y.get(1, 0);
// returns 1.0

v = y.get(3, 1);
// returns 2.0

Vectorization and Broadcasting

Okay, so there's an ndarray API. But what can you do with it?

Similar to NumPy, stdlib provides vectorized interfaces for performing element-wise operations on multidimensional arrays (in NumPy lingo, these are "ufuncs", or "universal functions")--although, I should state that this is very much a work in progress.

Apart from its bindings to hardware optimized libraries, NumPy's internals are essentially a bunch of for-loops, enhanced with SIMD, written in C.

stdlib is no different, in that, internally, stdlib tries to delegate to optimized loops for iterating over ndarray elements.

In the browser, these loops are implemented in JavaScript, and, in Node.js, these loops are written in C and accessed through native add-ons.

As an example of how to use vectorized APIs, consider the use case of computing the element-wise absolute value.

To start, we begin by importing two packages: one to create an ndarray and another to compute the absolute value.

(continue walking through the code example)


import array from '@stdlib/ndarray-array';
import ddot from '@stdlib/blas/ddot';

// Create two one-dimensional vectors:
let x = array([ 1.0, 2.0, 3.0, 4.0 ]);
let y = array([ 5.0, 6.0, 7.0, 8.0 ]);

// Compute the dot product:
let v = ddot(x, y);
// returns 70

// Import a lower-level ndarray constructor:
import ndarray from '@stdlib/ndarray-ctor';

// Create a reverse view atop `y`:
let yr = ndarray(
	y.dtype,
	y.data,
	y.shape,
	[-1],
	y.length-1,
	y.order
);
// returns <ndarray>[ 8.0, 7.0, 6.0, 5.0 ]

// Retrieve the first element:
v = yr.get(0);
// returns 8.0

// Retrieve the underlying data:
let data = yr.data;
// returns <Float64Array>[ 5.0, 6.0, 7.0, 8.0 ]

// Confirm that `yr` is a view:
let bool = ( y.data === yr.data );
// returns true

// Compute the dot product with the reversed vector:
v = ddot(x, yr);
// returns 60

// Import a lower-level BLAS interface:
import daxpy from '@stdlib/blas-base-daxpy';

// Compute `y = a*x + y`:
daxpy.ndarray(
	x.length,
	5.0,
	x.data,
	x.strides[ 0 ],
	x.offset,
	y.data,
	y.strides[ 0 ],
	y.offset
);

// Retrieve the underlying data of `y`:
data = y.data;
// returns <ndarray>[ 10, 16, 22, 28 ]

// Compute the dot product with the reversed vector:
v = ddot(x, yr);
// returns 160

Hardware Optimization

Okay. So it seems stdlib either has, or is working toward, comparable functionality to NumPy. But what about performance?

(fragment)

First graph: computing the element-wise absolute value over a one-dimensional ndarray with contiguous data.

Explain various chart aspects: axes, legend, the higher the bar the better. Note that the y-axis is a log-scale, which eases visualization of results, but visually obscures the magnitude in performance difference.

Takeaway: mostly within the same order of magnitude, but not as performant across the board.

(fragment)

Okay. So what about multdimensional data? In this second graph, I'm showing results for element-wise computation of the absolute value for a three dimensional ndarray with contiguous data.

The results effectively the same as for the previous set of benchmarks, which is not unexpected in that a higher-dimensional contiguous array can be treated equivalent to a one-dimensional array for element-wise computation.

(fragment)

Okay. So what about non-contiguous data? In this third graph, I'm showing results for element-wise computation of the absolute value for a three dimensional ndarray with non-contiguous data.

With non-contiguous data, cannot use simple loops, but need to either copy to contiguous or use cache oblivious (i.e., block) algorithms.

Takeaway: results are better than the previous sets of benchmarks, especially for larger arrays; however, although within striking distance, can not yet match NumPy performance.

Why? There are two reasons: (1) NumPy uses SIMD allowing it to operate on multiple array elements at once. (2) Especially, for shorter arrays, calling into native add-ons is relatively expensive, requiring the marshalling and unmarshalling of ndarray data as one cross the bridge from JavaScript to C.

There are also additional issues with V8 attempting to be--arguably--too clever with how it handles initial typed array memory allocation, leading to performance cliffs when accessing typed array data from within native add-ons.

(fragment)

For sake of comparison, in this fourth graph, I'm showing results for element-wise computation of the absolute value for a one-dimensional array-like object (i.e., an untyped generic array in JavaScript and a list in Python) with contiguous data.

Finally, we've beaten NumPy! The main reason for the observed performance difference is that, when provided a value which is not an ndarray, NumPy first converts that value to an ndarray before performing element-wise computations. In stdlib, no such copy occurs.

(next slide)

Not (yet) faster than NumPy

So where does this leave us?

(fragment)

Not (yet) faster than NumPy.

However, I am optimistic for the future. I think there are three keys areas for growth.

First, taking advantage of data parallelization (i.e., SIMD). WebAssembly could help in this regard, given the relatively recent addition of SIMD support.

Second, reducing the Node.js native add-on overhead of calling from JavaScript into C. Node.js core devs are aware of this overhead, and I expect this to improve over time.

Third, addressing some of the V8 quirks, such as array allocation strategies, that trigger performance cliffs. This is a bit harder to address, but as numerical computation on the web becomes more common, I'd expect that we'll figure out workarounds which address some of the optimized use cases that V8 is targeting, while not negatively impacting other use cases where such optimizations are not desired.

Thank you.


#define BINARY_DEFS\
    char *ip1 = args[0], *ip2 = args[1], *op1 = args[2];\
    npy_intp is1 = steps[0], is2 = steps[1], os1 = steps[2];\
    npy_intp n = dimensions[0];\
    npy_intp i;\

#define BINARY_LOOP_SLIDING\
    for(i = 0; i < n; i++, ip1 += is1, ip2 += is2, op1 += os1)

/** (ip1, ip2) -> (op1) */
#define BINARY_LOOP\
    BINARY_DEFS\
    BINARY_LOOP_SLIDING


/**begin repeat
 * Float types
 *  #type = npy_float, npy_double#
 *  #TYPE = FLOAT, DOUBLE#
 *  #c = f, #
 *  #C = F, #
 */
/**begin repeat1
 * Arithmetic
 * # kind = add, subtract, multiply, divide#
 * # OP = +, -, *, /#
 * # PW = 1, 0, 0, 0#
 */
NPY_NO_EXPORT void NPY_CPU_DISPATCH_CURFX(@TYPE@_@kind@)
(char **args, npy_intp const *dimensions, npy_intp const *steps, void *NPY_UNUSED(func))
{
    if (IS_BINARY_REDUCE) {
#if @PW@
        @type@ * iop1 = (@type@ *)args[0];
        npy_intp n = dimensions[0];

        *iop1 @OP@= @TYPE@_pairwise_sum(args[1], n, steps[1]);
#else
        BINARY_REDUCE_LOOP(@type@) {
            io1 @OP@= *(@type@ *)ip2;
        }
        *((@type@ *)iop1) = io1;
#endif
    }
    else if (!run_binary_simd_@kind@_@TYPE@(args, dimensions, steps)) {
        BINARY_LOOP {
            const @type@ in1 = *(@type@ *)ip1;
            const @type@ in2 = *(@type@ *)ip2;
            *((@type@ *)op1) = in1 @OP@ in2;
        }
    }
}
/**end repeat1**/
/**end repeat**/