stranded boat

Conditionally Retry a Failed Job

Sound Familiar?

We have jobs that run in the early morning to load data to our data warehouses and other systems, and if they don’t complete correctly, we start to have issues that ripple through the day.

In the past, we had a service provider that would monitor our jobs and rerun them if needed, depending on specific criteria given to them. The criteria were pretty simple:

1) If the job was labeled “level 1 – ” then restart the job from the first step
2) If the job was labeled “level 2 – ” then the job was to be restarted from the step that failed
3) Any other failure notify us, and we will take of it.

After a while, we started to notice that the jobs were not being rerun promptly. Sometimes it would be 20 minutes after a failure, others times 6 hours later, or worst case, not rerun at all. This lack of rerunning or notification became an issue as the day progressed because our team would get attacked when data that was needed, expected even, wasn’t there on time.

We had to do something, and this process evolved out of the need to keep our users and customers happier. Does it work 100% of the time? No. But it does work a vast majority of the time. It has certainly helped us with keeping our load processes running and making our nights and early mornings much easier to manage.

The code

Ok, here’s the beef of the code. This code is called by a process detailed in an earlier post I made about better error messages. In there I hinted at automatic retries, and here it is. You don’t have to use the error message code to make this work, but I found they worked together quite nicely.

Because this process reruns jobs I found that the MSDB database was a good place to store it.

I know, the recommendation is not to store custom code in system databases, but in this scenario it makes sense. You can always store it in your own DBA database if you like.

What’s it do?

Yeah, that’s a fair chunk of code. If you didn’t read through the code comments yet, here’s the basic breakdown of what it is doing.

When the job fails, and the alert message compiled, this procedure gets called and the job name, step name, a delay value are passed to it. There is also a retry flag that comes back fro this procedure.

The first thing this procedure does is go and find the last failed step for the particular job. It then counts and based on the @retry value verifies if a retry job has already been created. This is in case some other process tries to do this same thing and should help prevent too many retries from firing off.
If a retry job does not exist, this process creates a new disposable job that will rerun the original from the beginning or the step that failed based on the checking for “Level 1” or “Level 2” in the job name. The job is prefixed with ‘Retry -‘ so it can be found easily in your server’s job list.
If a delay is specified, 2 minutes in this example, then it calculates a new run time for the retry job and finally creates the job.

So what you have is a lot of wordy code to build a new job scheduled with a delay to rerun the job that failed.
The reason I like this approach is that the job history of the original job is preserved. It doesn’t have weird things stuck in it, or a duplicate of the job created and rerun.

Try it out, fiddle with it, let me know how to make it better! (I know it can be done better!).

P.S.

For reference the way I modified the procedure for the job alerts to then perform the auto retry was to replace the code after the cursor in the original alert code is deallocated with the following:

Leave a Reply

Your email address will not be published. Required fields are marked *