Skip navigation.
Home
That which cannot be rendered in binary is by definition a delusion
 

Funneling multi-gigabytes of binary data into Mongo via Node.js

: Assigning the return value of new by reference is deprecated in /home/bingoman/public_html/wll_drupal/sites/all/modules/paging/pagination/pagination.module on line 508.

The topography of Mars

Digesting the topography of mars burst the process memory limit on my iMac. It took a lot of refactoring to get the data from the NASA-supplied data binaries to Mongo via Node. Even the node conventions of process while streaming and using the event loop failed under the sheer size of the data.

I would emphasize that this is NOT the kind of experience that a standard web server using node/Mongo delivers! for one thing, they are usually two seperate servers and therefore don't fight each other for ram. For another, I am shoving gigabytes of data through the pipe during a single script, which again, isn't that common for a web server. 

Here are a few of the strategies I have used to manage the data. 

Process the data in several passes

The data files are 5k x 11k batches of signed integers. I wanted to save a record for each 128 x 128k chunk of ints as a "tile". each bach of data represented 1/16th of a cartesian grid of mars, and each degree of data was mapped to 128 x 128 points of measurement. This meant I was chopping 3,960 tiles of data out of a binary chunk of  64880640 measurements; at two bytes each, the data files are originally 128.8 MB of data - and when expanded and annotated into JSON, they get considerably bigger. *

For reference, the stored format looks like this:

{
  _id: ObjectID(), 
  image: ObjectID, 
  img_ref: DocRef(...),
  heights: [[1141,1102, 1115...], [1122, 1114...]], 
  tile_i: 2,
  tile_j: 4,
  w: 33,
  e: 32,
  n: 50, 
  s: 49
}

The image collection stores the annotation data that each image file was given by the MOLA team; its how I tell which quadrant the image file represents, and how many rows and colums are in that image, and the north, east, west and southern extent of the data.

Even when I made intelligent use of the stream, something in the write to mongo pipe seemed to burst my app. This might be a consequence of memory leaks in the database adapters, but for whatever reason the data just was not being saved before out of memory errors occurred. (this was, by the way, true even after I put 16GB of memory into the system.)

So instead of writing the 128 x 128 chunks directly, first I saved whole rows of data to a seperate collection. that is, 128x11k of data. This was slow going, but if done properly, it gave a fallback for smaller tasks. That way, when I did the next pass, no more than 128/5k % of the file was to be in play at any given time. 

Use nextTick to delay execution

This does two things: 

  1. It gives garbage collection/memory management time to balance out the system
  2. It gives Mongo time to fsync the data from memory, possibly creating more RAM - or at least ensuring that more RAM is marked as recoverable, as it has been saved to disk

I created a Pipe() class to handle timeout-driven tasks; that is, to serialize asynchronous tasks and handle them as events:

/**
 * Pipe calls the same function with different parameters;
 * it allows for asynchronous but single threaded activity.
 *
 * action has a profile:
 *   action(param_array_element, this.static_params, this._act_done_callback, this._pipe_done_callback);
 *
 *   note that the last two parameters are functions PRODUCED by local functions.
 */

function Pipe(callback, action, freq, param_array, static_params) {
    this.callback = callback;

    if (typeof action != 'function') {
        throw new Error(__filename + ': non function passed as action');
    }

    if (typeof callback != 'function') {
        throw new Error(__filename + ': non function passed as callback');
    }

    this.action = action;
    this.param_array = param_array ? param_array : [];
    this.stop_on_end_of_param_array = param_array.length;
    this.static_params = static_params ? static_params : false;
}

module.exports = Pipe;

Pipe.prototype = {


    start: function() {
        var self = this;
        this._pipe_done_callback = function() {
            self.finish();
        };
        this._act_done_callback = function(){
            self.act();
        }
        this.act();
    },

    check_pipe: function() {
        if (this.idle) {
            this.idle = false;
            //     console.log('acting');
            this.act();
        }
    },

    /**
     * Pipe allows for an array of changing parameters,passed in param_array
     * -- howver they are optional.
     * The param_array running out of params
     * doesn't trigger finish unless stop_on_end_of_param_array is set to true.
     * If not, then stopping the loop has to be done inside action
     *  by calling pipe.finish(); this is why pipe is the first parameter
     *  of action.
     */

    act: function() {

        if (this.param_array.length) {
            var params = this.param_array.shift();
        } else if (this.stop_on_end_of_param_array) {
            console.log('out of params - finishing');
            return this.finish();
        } else {
            params = false;
        }

        // console.log('acting on ', params, this.static_params);
        this.action(params, this.static_params, this._act_done_callback, this._pipe_done_callback);
    },

    _act_done_callback: false,
    _pipe_done_callback: false,

    /**
     * Finish will prevent further functions from being launched.
     * It won't abort the execution of a function being executied.
     */

    finish: function() {
        this.callback(this);
        this.done = true;
    }

}

I used Pipe to save one row of the data files at a time. it required careful use of the API I'd created, but it slowed down and managed long processes well. Note - the earlier version of this method used timeouts. Putting tasks into the Node event loop makes tasks MUCH more reliable! See Here for the why.

Batch insert smaller records into Mongo

It seems to create fewer problems to batch insert whole chunks of data at once, if the records were small. i.e., I didn't want to batch insert the 128 x 11k rows, but I did want to batch insert the smaller tiles I'd broken them into. Even if this doesn't stop crashes from happening, at the least, it allows you to make incremental progress before the crashes happen... Also, though this is not a scientific observation, when the memory gets low, the Node event chain also seems to slow down, so batch inserts means fewer events, and by implication, fewer delays between events. 

Try not having Mongo and Node on the same server

Not that easy to do in a desktop scenario.. however having Mongo on its own server ensures that Mongo's memory hogging doesn't bump heads with Nodes, or create wierd race conditions. When you are duplicating large chunks of data, you are likely to see size multiplication as you generate it, send it, journal it, and in some cases, send it back to the callback; in an asynchronous write context, this could seriously cascade, which is why I wrote Pipe to serialize the writes.

Keep in mind the fundamental dilemma is that I am writing large chunks of data to a system that is designed to send a few k of data at a time as its normal use case. Alternate stores including file dumps are the reasonable fallback, especially if the data is intermediate data. (for instance, I never mongo-ize the original data files from Mars - only smaller subsets for 1degree x 1 degree regions.)

* I am still considering whether storing them in BSON or as smaller binary chunks, in gridFS or in the file system, makes more sense; for now, and for usabilities sake, I am storing them in nested arrays. 

Post new comment

  • Allowed HTML tags: <a> <p> <span><small> <div> <h1> <h2> <h3> <h4> <h5> <h6> <img> <map> <area> <hr> <br> <br /> <ul> <ol> <li> <dl> <dt> <dd> <table> <tr> <td> <em> <b> <u> <i> <strong> <font> <del> <ins> <sub> <sup> <quote> <blockquote> <pre> <address> <code> <cite> <embed> <object> <param> <strike> <caption>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options