Better SQL Agent Job Alerts

Ever receive alerts from the SQL Agent?

 

They kinda suck. I mean it is nice that we have the option for the SQL Agent to tell us when a job fails or succeeds, but if you have ever looked at the message, there isn’t much detail in it.

Take a look at this (slightly edited) email I used to get from our systems:

From SQLServer01
To SQL ALert:
Date Dec 11th, 2016 at 2:46 AM
JOB RUN: ‘Job 001 – Load Sales’ was run on 12/11/2016 at 2:40:47 AM
DURATION: 0 hours, 5 minutes, 25 seconds
STATUS: Failed
MESSAGES: The job failed. The Job was invoked by User Domain/User. The last step to run was step 3 (Load Fact – Prod Sales). The job was requested to start at step 1 (Extract Fact – Prod Sales).

Not very helpful. Sure, I know the job failed, and what step it failed on, but now I have to connect to the agent and look up the history to determine if this is something I have to worry about.

It would be nice to receive the details seen in the history of the job showing up in the email alert received.

Recently I have been working with systems that had all the alert on failure configured so we knew when things failed and could jump on re-running them if needed. We even had them showing up into a data team slack channel, so we had a history as well as notification to everyone on the team at the same time. The problem is that there were not any details in the alerts we received so we had to be able to connect and figure out what to do next or hope that our paid monitoring service would act on something after reading the details of the failure.

It started 10 years ago

Way back in 2007, Jameel Ahmed wrote an article for SQLServerPro on how to get more detailed error messages from failed SQL Agent jobs. I downloaded his code and adapted it some to what I have below. I cleaned out some parts I didn’t need, and it worked like a charm!

USE [master];
GO
/****** Object:  StoredProcedure [dbo].[spDBA_job_notification]    Script Date: 2/3/2017 7:50:51 AM ******/
SET ANSI_NULLS ON;
GO
SET QUOTED_IDENTIFIER ON;
GO
CREATE PROCEDURE [dbo].[spDBA_job_notification]
    @jobid UNIQUEIDENTIFIER
AS /*********************************************************************************************************
 Purpose: SQL Job Agent does not send the error messages on failure, 
           so this procedure queries msdb to for the error message and sends an email.

    usage: EXEC spDBA_job_notification $(ESCAPE_NONE(JOBID))		SQL2005 SP1 + 
********************************************************************************************************/
    SET  NOCOUNT ON;
    DECLARE @Today DATETIME
       ,@CrLf VARCHAR(10)
       ,@stat_Failed TINYINT
       ,@stat_Succeeded TINYINT
       ,@stat_Retry TINYINT
       ,@stat_Canceled TINYINT
       ,@stat_In_progress TINYINT;
    DECLARE @Email_To NVARCHAR(100) = '[email protected]'
       ,@Email_From NVARCHAR(50)
       ,@subject VARCHAR(200)
       ,@Body VARCHAR(8000);
    DECLARE @jobname sysname
       ,@stepname sysname
       ,@Err_severity INT
       ,@run_datetime DATETIME
       ,@DBname sysname
       ,@command VARCHAR(3200)
       ,@ErrMessage VARCHAR(1024)
       ,@retry BIT = 0;
    SET @Body = '';
    SET @CrLf = CHAR(10);
--constants for Job Status execution: 
    SET @stat_Failed = 0;
    SET @stat_Succeeded = 1;
    SET @stat_Retry = 2;
    SET @stat_Canceled = 3; 
    SET @stat_In_progress = 4; 
    SET @Today = GETDATE();

    DECLARE curFailedJobs CURSOR READ_ONLY FAST_FORWARD
    FOR
        SELECT  sj.name
               ,sjh.step_name
               ,sjh.sql_severity
               ,sjs.database_name
               ,run_datetime = CONVERT(DATETIME, LEFT(run_date, 4) + '/'
                + SUBSTRING(run_date, 5, 2) + '/' + RIGHT(run_date, 2) + ' '
                + LEFT(run_time, 2) + ':' + SUBSTRING(run_time, 3, 2) + ':'
                + RIGHT(run_time, 2))
               ,sjs.command
               ,sjh.message
        FROM    msdb..sysjobs sj
                JOIN ( SELECT   instance_id
                               ,job_id
                               ,step_id
                               ,step_name
                               ,sql_message_id
                               ,sql_severity
                               ,message
                               ,run_status
                               ,run_duration
                               ,operator_id_emailed
                               ,operator_id_netsent
                               ,operator_id_paged
                               ,retries_attempted
                               ,server
                               ,run_date = CONVERT(VARCHAR(8), run_date)
                               ,run_time = CASE WHEN LEN(CONVERT(VARCHAR(8), run_time)) = 5
                                                THEN '0'
                                                     + CONVERT(VARCHAR(8), run_time)
                                                ELSE CONVERT(VARCHAR(8), run_time)
                                           END
                       FROM     msdb..sysjobhistory
                     ) sjh ON sj.job_id = sjh.job_id
                JOIN msdb..sysjobsteps sjs ON sjs.job_id = sjh.job_id
                                              AND sjs.step_id = sjh.step_id 
	
			-- sjh_Min contains the most recent instance_id (an identity column) from where we should start checking for any failed status records.
                JOIN (  
			-- to account for when there is are multiple log history
                       SELECT   job_id
                               ,instance_id = MAX(instance_id)
                       FROM     msdb..sysjobhistory
                       WHERE    job_id = @jobid
                                AND step_id = 0
                       GROUP BY job_id
                       UNION  
           -- to account for when you run the job for the first time, there is no history, there will not be any records where the step_id=0.
                       SELECT   job_id
                               ,instance_id = MIN(instance_id)
                       FROM     msdb..sysjobhistory
                       WHERE    job_id = @jobid
                                AND NOT EXISTS ( SELECT *
                                                 FROM   msdb..sysjobhistory
                                                 WHERE  job_id = @jobid
                                                        AND step_id = 0 )
                       GROUP BY job_id
                     ) sjh_Min ON sjh_Min.job_id = sj.job_id
                                  AND sjh.instance_id > sjh_Min.instance_id -- we only want the most recent error message(s).
        WHERE   sj.job_id = @jobid
                AND sjh.step_id <> 0					 --exclude the job outcome step
                AND sjh.run_status IN ( @stat_Failed )  --filter for only failed status
        ORDER BY sjh.instance_id;

    OPEN curFailedJobs;
    FETCH NEXT FROM curFailedJobs INTO @jobname, @stepname, @Err_severity,
        @DBname, @run_datetime, @command, @ErrMessage; 
    WHILE @@fetch_status = 0
        BEGIN
	   -- Build the Email Body
            SET @Body = @Body + 'Step name= ' + @stepname + @CrLf
                + 'DB Name  = ' + CONVERT(VARCHAR(50), ISNULL(@DBname, ''))
                + @CrLf + 'Run Date = ' + CONVERT(VARCHAR(50), @run_datetime)
                + @CrLf; 
       			 
            IF ( @Err_severity <> 0 )
                SET @Body = @Body + 'Severity = '
                    + CONVERT(VARCHAR(10), @Err_severity) + @CrLf; 

            SET @Body = @Body + 'Error    = ' + ISNULL(@ErrMessage, '')
                + @CrLf + @CrLf + 'Command  = ' + ISNULL(@command, '') + @CrLf;  

            FETCH NEXT FROM curFailedJobs INTO @jobname, @stepname,
                @Err_severity, @DBname, @run_datetime, @command, @ErrMessage; 
        END;
    CLOSE curFailedJobs;
    DEALLOCATE curFailedJobs;

-- Send the Email
    IF ( RTRIM(@Body) <> '' )
        BEGIN 
            SET @subject = 'ALERT: ' + @jobname + ' FAILED on \\'
                + @@servername;
            SET @Body = -- 'Server= ' + @@servername + @CrLf  +
                'Job_name = ' + @jobname + @CrLf
                + '--------------------------------------' + @CrLf + @Body;


            EXEC msdb.dbo.sp_send_dbmail @profile_name = 'DBMAILPROFILE',
                @recipients = @Email_To, @subject = @subject, @body = @Body;  
   
        END;
 

Once the code is in place, a slight change to the scheduled jobs needs to happen.
A new step has to be added to call this procedure, and all previous steps configured to call the final step on job failure.

What you have to do

This script uses the running job’s ID to pull info from the system databases to relay the detailed information.
When I was testing this by passing in values and running the procedure, it didn’t give any results, unless that job was actively running.
To make sure this worked nicely the following is how it is setup in the job as the final step of the job.

Report Error

It is just a simple call to execute the stored procedure which I placed in the master database on this server. There are two parameters passed into the procedure. The first, $(ESCAPE_NONE(JOBID)), captures the current job ID that is called the procedure so the procedure can look up the error messages. The second parameter is something I have in my system for an upgraded proc. I’ll cover that in more depth in a later post. For now, you can just leave the ‘,2’ off completely.

The other thing I did for this new alert step was to configure it to fail on both success and failure. I did this so that when I look at the agent job activity monitor, I will still get a visual that the job failed. There isn’t much worse than hiding a failure behind a fake success message (sometimes when dealing with AAGs and jobs it works, but that’s a different post).

 

both to fail

Then I went back to the job’s various steps and set their failure action to call the final step rather than just fail.

That’s it. Even though a little work is required to implement this method, it is a relatively straightforward and helpful solution.

Now I get a far more detailed alert that I can evaluate more efficiently and take actions on without having to dig through the Agent history for the job.

 

From SQLServer01
To SQL ALert:
Date Dec 11th at 2:49 AM
Job_name = Job 001 – Load Sales
————————————–
Step name = Load Fact – Prod Sales
DB Name = master
Run Date = Dec 11 2016 2:49AM
Error = Executed as user: Domain\User. Microsoft (R) SQL Server Execute Package Utility Version 12.0.5000.0 for 64-bit Copyright (C) Microsoft Corporation. All rights reserved. Started: 2:48:53 AM Error: 2016-02-15 02:49:54.07 Code: 0xC001000E Source: Load Fact – Prod Sales
Description: The connection “Connection 1” is not found. This error is thrown by Connections collection when the specific connection element is not found. End Error Error: 2016-02-15 02:49:54.07 Code: 0xC001000E Source: Load Fact – Prod Sales Description: The connection “Connection 2” is not found. This error is thrown by Connections collection when the specific connection element is not found. End Error Error: 2016-02-15 02:49:54.95 Code: 0x00000000 Source: Get Staging Dates Description: Invalid column name ‘modified’. End Error Error: 2016-02-15 02:49:54.95 Code: 0xC002F210 Source: Get Staging Dates Execute SQL Task Description: Executing the query “select convert(smalldatetime, convert(char(10), mi…” failed with the following error: “Invalid column name ‘modified’.”. Possible failure reasons: Problems with the query, “ResultSet” property not set correctly, parameters not set correctly, or connection not established correctly. End Error DTExec: The package execution returned DTSER_FAILURE (1). Started: 2:49:53 AM Finished: 2:49:54 PA Elapsed: 1.625 seconds. The package execution failed. The step failed.

Command = /FILE “\”\\placewherewestoreSSISpackages\SSIS\DW\ProdSales\Load Fact – PROD SALES.dtsx\”” /CONFIGFILE “\”C:\SSIS\DW Extract Configuration.dtsConfig\”” /MAXCONCURRENT ” -1 ” /CHECKPOINTING OFF /SET “\”\package.variables[RunbyJobName].Value\””;”\”Load Prod Sales\”” /REPORTING E

An automated retry has been scheduled for this job.

I even get nifty other alerts, like the actual command that executed and that an automate retry has been scheduled.

Wait! An automatic retry?

Yeah, you read that right. I have elaborated on this alert message so that, based on certain criteria configured within the job, an automated retry can fire off. That second parameter above that I said to ignore. Yeah, that’s my delay in minuted between retries.

For now, I want you to have the ability to get a better error message from your failed jobs. As far as retrying jobs, well that’s for another post (soon, I promise).

Also, it wouldn’t be terribly hard to create a script to add this to all your scheduled jobs by adding the new alert step and then adjust all the other steps to call it on failure. I will probably get around to that eventually, but if you do it before me, let me know!