Skip navigation.
Home
That which cannot be rendered in binary is by definition a delusion
 

Reply to comment

Long Task Monitoring and execution

I've been put to a task that requires digesting the execution of large sets of data over time. To respond to this challenge I have come up with a pattern that seems to be working I thought I'd pass it on, in case it benefits others with a similar chore.

Order of Execution


The steps I've taken can be broken in to a series of steps:

  1. Create a table for projects to be operated on. For the purpose of abstraction, let's call this table Projects.
  2. Create an order of operations for completing projects.
  3. Create a task table for tracking pending, working or complete tasks relavant to a given item.
  4. Optional, but in my experience very necesary, create a thread table that tracks ongoing cron jobs to avoid having too many cron jobs open at once.
  5. Add projects to your Projects table, and create the first task for each project into the task table to begin processing the project.
  6. Add a cron job that executes tasks

Projects

A project is any unit of work that can be completed independant of the degree of completion of another project. Some good examples:

  • Reading and digesting a web page
  • Analyzing a set of numbers and creating an aggregate report (graph/xml/whatever).
  • Finding a specific bit of information in a filesystem (one or more files containing a given word, or of a certain type, filename, whatever.)

In general, a good project for long term digestion is one in which every piece of necessary information is present, and will not change unless your tasks (see below) are doing the changing. Work on one project shouldn't bleed over to another project -- if it does, both projects should be considered elements of a single project.

My project was digesting an incoming mail. They are created when a piece of mail hit the server. Several steps had to be done:

  • The headers had to be found
  • The mail content had to be split into parts (if a multipart mail)
  • Links in the mail had to be found and recorded
  • The IPs of the senders and targets had to be geolocated
  • the text of the mail had to be keyword indexed (in SOLR).

Order of Operations

Order of Operations determine what tasks need to be done in what order. (i.e, once task "A" is done, you can do task(s) "B" and "C"; once "C" is done, you can do task(s) "E" and "F".) You can record this order of operations in any format you like -- table, JSON, YML or whatever. I personally just created a JSON file to record my order of operations.

Depending on the nature of your project, you might have diffent operations for different projects, depending on their value, type, class, or some environmental conditions.

In the case of my personal project there was a fixed order of operations:

  1. The mail had to be stored on disk and indexed in an email database.
  2. The mail has to be split into headers and parts.
  3. Text parts had to be scanned for links.
  4. HTML tags and links had to be removed from the body.
  5. The body had to be indexed in SOLR
  6. The headers had to be digested for IPs
  7. Those IPs had to be geolocated.

Note that steps 2..5 all had to be done in order, but steps 6..7 could be done any time after 2 was complete. i.e., the state of the body didn't affect any work being done on the header elements, and vice versa.

Some other things to consider:

  • In general - its more robust if each stage of information saves its work in new records/files/whatever; that way if a project has to be reverted and re-executed you don't have to work too hard to erase a tasks' artifacts.
  • Tasks should be relatively quick -- ideally you should be able to do a task in a second or so.

The table for recording tasks should include:

  • The project key
  • The name of the task to be done
  • The status: i.e., pending, working, done, error, etc; can be stored in string for or with an int key (0 = pending, 1 = working, 2 = done, -1 = error, etc.)
  • The creation time of the task record
  • Start and end time for task execution
  • Optionally, execution time (the difference of the above)
  • Optionally, a rank for how important a specific task is. That way the task execution loop can focus on the most important tasks first.

Execution Cron

Some sort of repeating process or daemon needs to harvest and execute tasks from the task table.
The execution loop will look something like this:

  1. Grab a set of tasks that can you think can be finished off before the next cron starts up.
  2. Mark each task as "working" in the beginning, so another loop doesn't try and grab them as well.
  3. Execute the tasks one at a time.
  4. For each task that is complete:
    1. Mark it as done.
    2. Create task records for the next task to be done, based on which task you just completed.
    3. Trap any errors that occur in the task record and mark it as an error. (possibly clean up data then as well.)
  5. Stop working when
    1. The cron has executed every task OR
    2. The alotted time for the cron has expired. If this happens, re-mark incomplete tasks as "pending".

define a crontab task for running your execution cron job.

(Optional) Create a thread tracker

You may want to keep stats on your cron runs. How many tasks were executed in how many times. This can be done in a table for Threads. AN entry in the thread table is created every time the cron starts, and closed at the end of the cron.

If your cron is grabbing more tasks than it has time to finish, you need to adjust teh grab number. You can even do this dynamically by calculating the tasks per second based on your thread table.

The other benefit to a thread table is that if your cron jobs go long you can limit the number of simultaneous cron jobs running at a given time. When you start your cron job, check for the number of unclosed cron jobs and if there are too many, stop the current cron jobs til the existing ones run. You might want to at least tolerate a single unfinished thread, so that a cron job that is nearly done doesn't create an unnecessarily idle period in task execution.

Race Conditions

If each cron "claims" a task there should be very few collisions of operations. However to ensure no duplication of effort, you can save all your work in memory during an execution cycle, and after a task is done, re-check the "status" of a task to ensure it is still "working". You might grab a unique id for a cron (perhaps from the thread table) and stamp that id on each task you finish; then immediately before finishing the job, make sure that the cron ID hasn't changed.

Reply

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <p> <span><small> <div> <h1> <h2> <h3> <h4> <h5> <h6> <img> <map> <area> <hr> <br> <br /> <ul> <ol> <li> <dl> <dt> <dd> <table> <tr> <td> <em> <b> <u> <i> <strong> <font> <del> <ins> <sub> <sup> <quote> <blockquote> <pre> <address> <code> <cite> <embed> <object> <param> <strike> <caption>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options