Conditionally Retry a Failed Job

Sound Familiar?

We have jobs that run in the early morning to load data to our data warehouses and other systems, and if they don’t complete correctly, we start to have issues that ripple through the day.

In the past, we had a service provider that would monitor our jobs and rerun them if needed, depending on specific criteria given to them. The criteria were pretty simple:

1) If the job was labeled “level 1 – ” then restart the job from the first step
2) If the job was labeled “level 2 – ” then the job was to be restarted from the step that failed
3) Any other failure notify us, and we will take of it.

After a while, we started to notice that the jobs were not being rerun promptly. Sometimes it would be 20 minutes after a failure, others times 6 hours later, or worst case, not rerun at all. This lack of rerunning or notification became an issue as the day progressed because our team would get attacked when data that was needed, expected even, wasn’t there on time.

We had to do something, and this process evolved out of the need to keep our users and customers happier. Does it work 100% of the time? No. But it does work a vast majority of the time. It has certainly helped us with keeping our load processes running and making our nights and early mornings much easier to manage.

The code

Ok, here’s the beef of the code. This code is called by a process detailed in an earlier post I made about better error messages. In there I hinted at automatic retries, and here it is. You don’t have to use the error message code to make this work, but I found they worked together quite nicely.

Because this process reruns jobs I found that the MSDB database was a good place to store it.

I know, the recommendation is not to store custom code in system databases, but in this scenario it makes sense. You can always store it in your own DBA database if you like.

USE [msdb]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

ALTER PROCEDURE [dbo].[usp_SSIS_Retry_job]
    (
     @jobname sysname
    ,@stepname sysname
    ,@delay SMALLINT = 0
    ,@retry TINYINT OUTPUT
    )
AS
    DECLARE @job_id UNIQUEIDENTIFIER
       ,@tempjobname sysname
       ,@stepcmd NVARCHAR(4000)
       ,@runstatus TINYINT
       ,@laststepname sysname
       ,@laststatus TINYINT = 0
       ,@maxinstanceid INT
       ,@startdate INT
       ,@starttime INT;

/* things this procedure does

1 - Check if this step failed before. If so do not launch a retry again. If a different step failed, then launch a retry.
This logic allows the job to retry at least once, then if that step is fixed and another failed it can retry the new failed step.

2 - Check if a retry job for this job has already been generarted and if so do not create another.

3 - Create the retry job. If job starts with level 1 start the job again from the default start step. 
    If level 2 restart the job from the failed step.

*/
-- Get the last failed step for this job, that is not the job status (step_id 0)
    BEGIN
        SELECT  @maxinstanceid = MAX(sysjobhistory.instance_id)
        FROM    dbo.sysjobs
                INNER JOIN dbo.sysjobhistory ON sysjobhistory.job_id = sysjobs.job_id
        WHERE   sysjobs.name = @jobname
                AND step_id <> 0;

-- Get the prior failed step for this job
        SELECT TOP 1
                @laststepname = sjh1.step_name
               ,@laststatus = sjh1.run_status
        FROM    dbo.sysjobs sj1
                INNER JOIN dbo.sysjobhistory sjh1 ON sjh1.job_id = sj1.job_id
        WHERE   sj1.name = @jobname
                AND sjh1.instance_id < @maxinstanceid
                AND sjh1.step_id <> 0
        ORDER BY sjh1.instance_id DESC;

 -- check last step to fail and if matches current step for this job do not rerun. This prevents an infinite loop. 
 -- logging could be done to raise the number of retries, but in our situation 1 retry was plenty.
        IF ( @stepname = ISNULL(@laststepname, 'First Run')
             AND @laststatus = 0
           )
            BEGIN
                SELECT  @retry = 2;
            END;
        ELSE
            IF EXISTS ( SELECT  1
                        FROM    dbo.sysjobs sj
                        WHERE   sj.name LIKE 'Retry - ' + @jobname + '%' )
                BEGIN
                    SELECT  @retry = 0; -- dummy step for logic to process
                    PRINT 'retry job already exists - not creating another';
                    --        
                     
                END;
            ELSE
                BEGIN
                    SET @tempjobname = N'Retry - ' + @jobname + ' '
                        + CONVERT(NVARCHAR(25), GETDATE());
-- build the command to be run by the retry job. add step_name to re-start at for level 2 jobs
                    SET @stepcmd = N'exec msdb.dbo.sp_start_job @job_name = N'''
                        + @jobname
                        + CASE WHEN @jobname LIKE '%Level 2%'
                               THEN N''', @step_name = N''' + @stepname
                                    + N''''
                               ELSE N''''
                          END;
                   		
-- add custom retry job for the job that failed. Delete-job set to 1 to automatically remove the retry job once it has run successfully
                    EXEC msdb.dbo.sp_add_job @job_name = @tempjobname,
                        @enabled = 1,
                        @description = N'automated retry of a failed job',
                        @delete_level = 1, @job_id = @job_id OUTPUT;

-- add custom step to re-start prior job
                    DECLARE @step_uid UNIQUEIDENTIFIER;
                    EXEC msdb.dbo.sp_add_jobstep @job_id = @job_id,
                        @step_id = 1, @step_name = 'Rerun failed job',
                        @command = @stepcmd, @database_name = msdb,
                        @database_user_name = NULL,
                        @step_uid = @step_uid OUTPUT;
		
-- set teh job to run on the local (same) server
                    EXEC msdb.dbo.sp_add_jobserver @job_id = @job_id;

-- start the new job automatically & immediately
                    IF @delay = 0
                        BEGIN
                            EXEC msdb.dbo.sp_start_job @job_id = @job_id;
                        END;
                    ELSE 
-- add a schedule to start the new job with a short time delay (2 minutes)
                        DECLARE @reruntime DATETIME = DATEADD(MINUTE, @delay,
                                                              GETDATE());

                    SELECT  @startdate = REPLACE(CONVERT(VARCHAR(10), CONVERT(DATE, @reruntime)),
                                                 '-', '')
                           ,@starttime = CONVERT(VARCHAR(2), DATEPART(HOUR,
                                                              @reruntime))+
                            RIGHT('0' + CONVERT(VARCHAR(2), DATEPART(MINUTE, @reruntime)), 2)
            + RIGHT('0' + CONVERT(VARCHAR(2), DATEPART(SECOND, @reruntime)), 2);

                    DECLARE @schedule_id INT
                       ,@schedule_uid UNIQUEIDENTIFIER;
                    EXEC msdb.dbo.sp_add_jobschedule @job_id = @job_id,
                        @name = 'Retry with delay', @freq_type = 1, -- uniqueidentifier
                        @active_start_date = @startdate, -- int
                        @active_start_time = @starttime; -- int
			    
                    SELECT  @retry = 1;
                END;	
    END; 

What’s it do?

Yeah, that’s a fair chunk of code. If you didn’t read through the code comments yet, here’s the basic breakdown of what it is doing.

When the job fails, and the alert message compiled, this procedure gets called and the job name, step name, a delay value are passed to it. There is also a retry flag that comes back fro this procedure.

The first thing this procedure does is go and find the last failed step for the particular job. It then counts and based on the @retry value verifies if a retry job has already been created. This is in case some other process tries to do this same thing and should help prevent too many retries from firing off.
If a retry job does not exist, this process creates a new disposable job that will rerun the original from the beginning or the step that failed based on the checking for “Level 1” or “Level 2” in the job name. The job is prefixed with ‘Retry -‘ so it can be found easily in your server’s job list.
If a delay is specified, 2 minutes in this example, then it calculates a new run time for the retry job and finally creates the job.

So what you have is a lot of wordy code to build a new job scheduled with a delay to rerun the job that failed.
The reason I like this approach is that the job history of the original job is preserved. It doesn’t have weird things stuck in it, or a duplicate of the job created and rerun.

Try it out, fiddle with it, let me know how to make it better! (I know it can be done better!).

P.S.

For reference the way I modified the procedure for the job alerts to then perform the auto retry was to replace the code after the cursor in the original alert code is deallocated with the following:

 IF ( @jobname LIKE '%Level 1%'
         OR @jobname LIKE '%Level 2%'
       )
        BEGIN
	-- level 1 are restarted from failed step
	-- maybe create new job that removes itself after retry 5 minutes later?
            EXEC msdb.dbo.usp_SSIS_Retry_job @jobname, @stepname, @delay,
                @retry = @retrystat OUTPUT; 
			            
        END;
        -- Bonus - send notification to email when job is auto rescheduled.
    	-- set email slack destination based on job name
	 SET @Email_To = CASE WHEN @jobname LIKE 'Report%'
                         THEN '[email protected]' -- private slack channel for #ReportAlerts
                         ELSE '[email protected]' -- private slack channel for DBA Team
                    END; 
    SET @Email_From = @@servername + '@somewhere.com';

-- Send the Email
    IF ( RTRIM(@Body) <> '' )
        BEGIN 
            SET @subject = '' + @jobname + ' FAILED on \\'
                + @@servername;
            SET @Body = -- 'Server= ' + @@servername + @CrLf  +
                'Job_name = ' + @jobname + @CrLf
                + '--------------------------------------' + @CrLf + @Body
                + CASE @retrystat
                    WHEN 1
                    THEN @CrLf
                         + 'An automated retry has been scheduled for this job.'
                    WHEN 2
                    THEN @CrLf
                         + 'An automated retry of the job was attempted but failed. Manual intervention is required.'
                    ELSE ''
                  END;


            EXEC msdb.dbo.sp_send_dbmail @profile_name = 'SQLNotify',
                @recipients = @Email_To, @subject = @subject, @body = @Body;  
   
        END;