Ever receive alerts from the SQL Agent?
They kinda suck. I mean it is nice that we have the option for the SQL Agent to tell us when a job fails or succeeds, but if you have ever looked at the message, there isn’t much detail in it.
Take a look at this (slightly edited) email I used to get from our systems:
To SQL ALert:
Date Dec 11th, 2016 at 2:46 AM
JOB RUN: ‘Job 001 – Load Sales’ was run on 12/11/2016 at 2:40:47 AM
DURATION: 0 hours, 5 minutes, 25 seconds
STATUS: Failed
MESSAGES: The job failed. The Job was invoked by User Domain/User. The last step to run was step 3 (Load Fact – Prod Sales). The job was requested to start at step 1 (Extract Fact – Prod Sales).
Not very helpful. Sure, I know the job failed, and what step it failed on, but now I have to connect to the agent and look up the history to determine if this is something I have to worry about.
It would be nice to receive the details seen in the history of the job showing up in the email alert received.
Recently I have been working with systems that had all the alert on failure configured so we knew when things failed and could jump on re-running them if needed. We even had them showing up into a data team slack channel, so we had a history as well as notification to everyone on the team at the same time. The problem is that there were not any details in the alerts we received so we had to be able to connect and figure out what to do next or hope that our paid monitoring service would act on something after reading the details of the failure.
It started 10 years ago
Way back in 2007, Jameel Ahmed wrote an article for SQLServerPro on how to get more detailed error messages from failed SQL Agent jobs. I downloaded his code and adapted it some to what I have below. I cleaned out some parts I didn’t need, and it worked like a charm!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | USE [master]; GO /****** Object: StoredProcedure [dbo].[spDBA_job_notification] Script Date: 2/3/2017 7:50:51 AM ******/ SET ANSI_NULLS ON; GO SET QUOTED_IDENTIFIER ON; GO CREATE PROCEDURE [dbo].[spDBA_job_notification] @jobid UNIQUEIDENTIFIER AS /********************************************************************************************************* Purpose: SQL Job Agent does not send the error messages on failure, so this procedure queries msdb to for the error message and sends an email. usage: EXEC spDBA_job_notification $(ESCAPE_NONE(JOBID)) SQL2005 SP1 + ********************************************************************************************************/ SET NOCOUNT ON; DECLARE @Today DATETIME ,@CrLf VARCHAR(10) ,@stat_Failed TINYINT ,@stat_Succeeded TINYINT ,@stat_Retry TINYINT ,@stat_Canceled TINYINT ,@stat_In_progress TINYINT; DECLARE @Email_To NVARCHAR(100) = 'You@youremailhost.ext' ,@Email_From NVARCHAR(50) ,@subject VARCHAR(200) ,@Body VARCHAR(8000); DECLARE @jobname sysname ,@stepname sysname ,@Err_severity INT ,@run_datetime DATETIME ,@DBname sysname ,@command VARCHAR(3200) ,@ErrMessage VARCHAR(1024) ,@retry BIT = 0; SET @Body = ''; SET @CrLf = CHAR(10); --constants for Job Status execution: SET @stat_Failed = 0; SET @stat_Succeeded = 1; SET @stat_Retry = 2; SET @stat_Canceled = 3; SET @stat_In_progress = 4; SET @Today = GETDATE(); DECLARE curFailedJobs CURSOR READ_ONLY FAST_FORWARD FOR SELECT sj.name ,sjh.step_name ,sjh.sql_severity ,sjs.database_name ,run_datetime = CONVERT(DATETIME, LEFT(run_date, 4) + '/' + SUBSTRING(run_date, 5, 2) + '/' + RIGHT(run_date, 2) + ' ' + LEFT(run_time, 2) + ':' + SUBSTRING(run_time, 3, 2) + ':' + RIGHT(run_time, 2)) ,sjs.command ,sjh.message FROM msdb..sysjobs sj JOIN ( SELECT instance_id ,job_id ,step_id ,step_name ,sql_message_id ,sql_severity ,message ,run_status ,run_duration ,operator_id_emailed ,operator_id_netsent ,operator_id_paged ,retries_attempted ,server ,run_date = CONVERT(VARCHAR(8), run_date) ,run_time = CASE WHEN LEN(CONVERT(VARCHAR(8), run_time)) = 5 THEN '0' + CONVERT(VARCHAR(8), run_time) ELSE CONVERT(VARCHAR(8), run_time) END FROM msdb..sysjobhistory ) sjh ON sj.job_id = sjh.job_id JOIN msdb..sysjobsteps sjs ON sjs.job_id = sjh.job_id AND sjs.step_id = sjh.step_id -- sjh_Min contains the most recent instance_id (an identity column) from where we should start checking for any failed status records. JOIN ( -- to account for when there is are multiple log history SELECT job_id ,instance_id = MAX(instance_id) FROM msdb..sysjobhistory WHERE job_id = @jobid AND step_id = 0 GROUP BY job_id UNION -- to account for when you run the job for the first time, there is no history, there will not be any records where the step_id=0. SELECT job_id ,instance_id = MIN(instance_id) FROM msdb..sysjobhistory WHERE job_id = @jobid AND NOT EXISTS ( SELECT * FROM msdb..sysjobhistory WHERE job_id = @jobid AND step_id = 0 ) GROUP BY job_id ) sjh_Min ON sjh_Min.job_id = sj.job_id AND sjh.instance_id > sjh_Min.instance_id -- we only want the most recent error message(s). WHERE sj.job_id = @jobid AND sjh.step_id <> 0 --exclude the job outcome step AND sjh.run_status IN ( @stat_Failed ) --filter for only failed status ORDER BY sjh.instance_id; OPEN curFailedJobs; FETCH NEXT FROM curFailedJobs INTO @jobname, @stepname, @Err_severity, @DBname, @run_datetime, @command, @ErrMessage; WHILE @@fetch_status = 0 BEGIN -- Build the Email Body SET @Body = @Body + 'Step name= ' + @stepname + @CrLf + 'DB Name = ' + CONVERT(VARCHAR(50), ISNULL(@DBname, '')) + @CrLf + 'Run Date = ' + CONVERT(VARCHAR(50), @run_datetime) + @CrLf; IF ( @Err_severity <> 0 ) SET @Body = @Body + 'Severity = ' + CONVERT(VARCHAR(10), @Err_severity) + @CrLf; SET @Body = @Body + 'Error = ' + ISNULL(@ErrMessage, '') + @CrLf + @CrLf + 'Command = ' + ISNULL(@command, '') + @CrLf; FETCH NEXT FROM curFailedJobs INTO @jobname, @stepname, @Err_severity, @DBname, @run_datetime, @command, @ErrMessage; END; CLOSE curFailedJobs; DEALLOCATE curFailedJobs; -- Send the Email IF ( RTRIM(@Body) <> '' ) BEGIN SET @subject = 'ALERT: ' + @jobname + ' FAILED on \\' + @@servername; SET @Body = -- 'Server= ' + @@servername + @CrLf + 'Job_name = ' + @jobname + @CrLf + '--------------------------------------' + @CrLf + @Body; EXEC msdb.dbo.sp_send_dbmail @profile_name = 'DBMAILPROFILE', @recipients = @Email_To, @subject = @subject, @body = @Body; END; |
Once the code is in place, a slight change to the scheduled jobs needs to happen.
A new step has to be added to call this procedure, and all previous steps configured to call the final step on job failure.
What you have to do
This script uses the running job’s ID to pull info from the system databases to relay the detailed information.
When I was testing this by passing in values and running the procedure, it didn’t give any results, unless that job was actively running.
To make sure this worked nicely the following is how it is setup in the job as the final step of the job.
It is just a simple call to execute the stored procedure which I placed in the master database on this server. There are two parameters passed into the procedure. The first, $(ESCAPE_NONE(JOBID)), captures the current job ID that is called the procedure so the procedure can look up the error messages. The second parameter is something I have in my system for an upgraded proc. I’ll cover that in more depth in a later post. For now, you can just leave the ‘,2’ off completely.
The other thing I did for this new alert step was to configure it to fail on both success and failure. I did this so that when I look at the agent job activity monitor, I will still get a visual that the job failed. There isn’t much worse than hiding a failure behind a fake success message (sometimes when dealing with AAGs and jobs it works, but that’s a different post).
Then I went back to the job’s various steps and set their failure action to call the final step rather than just fail.
That’s it. Even though a little work is required to implement this method, it is a relatively straightforward and helpful solution.
Now I get a far more detailed alert that I can evaluate more efficiently and take actions on without having to dig through the Agent history for the job.
To SQL ALert:
Date Dec 11th at 2:49 AM
Job_name = Job 001 – Load Sales
————————————–
Step name = Load Fact – Prod Sales
DB Name = master
Run Date = Dec 11 2016 2:49AM
Error = Executed as user: Domain\User. Microsoft (R) SQL Server Execute Package Utility Version 12.0.5000.0 for 64-bit Copyright (C) Microsoft Corporation. All rights reserved. Started: 2:48:53 AM Error: 2016-02-15 02:49:54.07 Code: 0xC001000E Source: Load Fact – Prod Sales
Description: The connection “Connection 1” is not found. This error is thrown by Connections collection when the specific connection element is not found. End Error Error: 2016-02-15 02:49:54.07 Code: 0xC001000E Source: Load Fact – Prod Sales Description: The connection “Connection 2” is not found. This error is thrown by Connections collection when the specific connection element is not found. End Error Error: 2016-02-15 02:49:54.95 Code: 0x00000000 Source: Get Staging Dates Description: Invalid column name ‘modified’. End Error Error: 2016-02-15 02:49:54.95 Code: 0xC002F210 Source: Get Staging Dates Execute SQL Task Description: Executing the query “select convert(smalldatetime, convert(char(10), mi…” failed with the following error: “Invalid column name ‘modified’.”. Possible failure reasons: Problems with the query, “ResultSet” property not set correctly, parameters not set correctly, or connection not established correctly. End Error DTExec: The package execution returned DTSER_FAILURE (1). Started: 2:49:53 AM Finished: 2:49:54 PA Elapsed: 1.625 seconds. The package execution failed. The step failed.
Command = /FILE “\”\\placewherewestoreSSISpackages\SSIS\DW\ProdSales\Load Fact – PROD SALES.dtsx\”” /CONFIGFILE “\”C:\SSIS\DW Extract Configuration.dtsConfig\”” /MAXCONCURRENT ” -1 ” /CHECKPOINTING OFF /SET “\”\package.variables[RunbyJobName].Value\””;”\”Load Prod Sales\”” /REPORTING E
An automated retry has been scheduled for this job.
I even get nifty other alerts, like the actual command that executed and that an automate retry has been scheduled.
Wait! An automatic retry?
Yeah, you read that right. I have elaborated on this alert message so that, based on certain criteria configured within the job, an automated retry can fire off. That second parameter above that I said to ignore. Yeah, that’s my delay in minuted between retries.
For now, I want you to have the ability to get a better error message from your failed jobs. As far as retrying jobs, well that’s for another post (soon, I promise).
Also, it wouldn’t be terribly hard to create a script to add this to all your scheduled jobs by adding the new alert step and then adjust all the other steps to call it on failure. I will probably get around to that eventually, but if you do it before me, let me know!
Comments 15
Thanks so much for posting this! Could have used something this morning when the regular DBA was out and a server miss-behaved.
How do I set this up for a 15 step job, where failed steps move to “the next step”? That is, I do not want to fail THE JOB on the failed step, but I do want to alert that failed step happened.
When will your next post, where you detail this sproc’s tailoring?
Author
The only way, off the top of my head, I can think of maybe making that movement possible is by adding the new failure alert step between each original step then that failure step moves to the next step when it completes.
Something like:
Step 1: Do action – on success go to step 3, on failure go to step 2
Step 2: send email alert – on success, move to next step, on failure stop
Step 3: Do action 2 – on success go to step 5, on failure go to step 4
Step 4: send email alert – on success, move to next step, on failure stop
step 5: Do action 3 ….. and so on
That should give you the alert, but keep the process moving. It starts to become a bit cumbersome, but I think it would work.
As for the retry logic that might help (if customized), I am trying to get that article pulled together for next week but it will all depend on how things go this week with work and other tasks.
There is a problem with the conversion for run_datetime conversion if the run_time is 3 or 4 digits. Something like ,run_datetime = CONVERT(DATETIME, LEFT(run_date, 4) + ‘/’
+ SUBSTRING(run_date, 5, 2) + ‘/’ + RIGHT(run_date, 2) + ‘ ‘
+ CAST(STUFF(STUFF(REPLACE(STR(run_time,6),’ ‘,’0′),3,0,’:’),6,0, ‘:’) AS DATETIME)) would work better.
See also msdb.dbo.agent_datetime(run_date, run_time) function for conversion.
When you run the script for the first time you will get an error message:
I think most people only want to create the procedure, so it’s easier to change ‘ALTER PROCEDURE’ in ‘CREATE PROCEDURE’.
Author
Good catch 🙂
Changed it in the sample code above.
Hello. This version only needs one parameter:
ALTER PROCEDURE [dbo].[spDBA_job_notification]
@jobid UNIQUEIDENTIFIER
Thanks.
Author
Yes, this version only has 1 parameter, the second parameter in the sample shot of the setup is in reference to an update that will be shared in a later post.
Thanks for the script. Really useful.
During testing I came across a weird issue with how SQL Server logs the time in sysjobhistory. In most circumstances, the time is logged as a 5 digit integer e.g. 23000 represents 23:00. Your script pads this out to 6 characters. However, if the run time of the job is exactly midnight, the time in sysjobhistory is stored as 0 (not 00000 as you might expect)! This results in a datetime conversation error.
The fix is simple:
run_time = REPLICATE(‘0′, 6-LEN(run_time )) + CAST(run_time as varchar)
I’m running Microsoft SQL Server 2016 (RTM) – 13.0.1601.5 (X64). I haven’t’ tested other versions.
Hi Chris,
Saw you recently at SQL Saturday in Richmond VA.
I have just implemented this and created the script to add this to every sql agent job of specified types.
I started from another script I found on stackoverflow and it referenced here as well.
We did move this from master to our dba database where we manage all our custom jobs.
/*
Purpose: Add more detailed email of reason for sql agent job failure
Source for stored procedure: https://wateroxconsulting.com/archives/better-sql-agent-job-alerts/
Source for updating all jobs: https://stackoverflow.com/questions/298494/how-to-bulk-amend-the-job-step-command-in-all-sql-server-agent-jobs
Requirements:
Add [dbo].[usp_DBA_job_notification] to the DBA database
Steps:
Pull list of jobs and steps
create new step with stepid 1 greater than max stepid
update all previous steps with @on_fail_action=4, @on_fail_step_id=,
Notes:
Script only searches in ‘TSQL’, ‘Powershell’, ‘CMDEXEC’ job type.
History
Name Date Description
George Maxson 4/13/18 Initial creation
*/
SET NOCOUNT ON;
USE msdb;
GO
DECLARE @job_id VARCHAR(MAX);
DECLARE @step_idMAX VARCHAR(10);
DECLARE @step_idNEW VARCHAR(10);
DECLARE @i INT;
— collect list of job_id and number of steps in each
begin tran
— Show count of jobs and steps before the process starts
declare @JobsFoundBefore int
set @JobsFoundBefore = (select count(a.job_id)
from sysjobs a where a.job_id in(SELECT distinct a.job_id
FROM [dbo].[sysjobsteps] as b
INNER JOIN sysjobs as a
ON a.job_id = b.job_id
WHERE [subsystem] IN (‘TSQL’, ‘Powershell’, ‘CMDEXEC’)))
declare @StepsFoundBefore int
set @StepsFoundBefore = (SELECT count(b.step_id)
FROM [dbo].[sysjobsteps] as b
INNER JOIN sysjobs as a
ON a.job_id = b.job_id
WHERE [subsystem] IN (‘TSQL’, ‘Powershell’, ‘CMDEXEC’))
print ‘Before: Jobs found: ‘+ cast(@JobsFoundBefore as varchar(10))+’, Steps found: ‘+ cast(@StepsFoundBefore as varchar(10))
— List jobs and steps that will be updated
SELECT a.name as job_name, b.step_id, b.step_name,
case
when b.on_fail_action = 1 then ‘Quit the job reporting success.’
when b.on_fail_action = 2 then ‘Quit the job reporting failure.’
when b.on_fail_action = 3 then ‘Go to the next step.’
when b.on_fail_action = 4 then ‘Go to step.’
end as on_fail_action
, b.on_fail_step_id, b.subsystem,
b.command, b.database_name, a.enabled,
a.description, a.date_created, b.job_id
FROM [dbo].[sysjobsteps] as b
INNER JOIN sysjobs as a
ON a.job_id = b.job_id
WHERE [subsystem] IN (‘TSQL’, ‘Powershell’, ‘CMDEXEC’)
— Start update process
DECLARE db_cursor CURSOR
FOR SELECT b.job_id,
MAX(b.step_id)
FROM msdb.[dbo].[sysjobsteps] AS b
WHERE [subsystem] IN(‘TSQL’, ‘Powershell’, ‘CMDEXEC’)
GROUP BY b.job_id;
OPEN db_cursor;
FETCH NEXT FROM db_cursor INTO @job_id, @step_idMAX;
WHILE @@FETCH_STATUS = 0
BEGIN
— Create JobFailureResponse step as next after maximum current step
SET @step_idNEW = @step_idMAX + 1;
EXEC msdb.dbo.sp_add_jobstep
@job_id = @job_id,
@step_name = N’JobFailureResponse’,
@step_id = @step_idNEW,
@cmdexec_success_code = 0,
@on_success_action = 2,
@on_fail_action = 2,
@retry_attempts = 0,
@retry_interval = 0,
@os_run_priority = 0,
@subsystem = N’TSQL’,
@command = N’exec [dbo].[usp_DBA_job_notification] $(ESCAPE_NONE(JOBID))’,
@database_name = N’DBA’,
@flags = 0;
SET @i = 0;
— Update all jobs previous steps to go to new step on failure
WHILE @i < @step_idMAX
BEGIN
SET @i = @i + 1;
EXEC msdb.dbo.sp_update_jobstep
@job_id = @job_id,
@step_id = @i,
@on_fail_action = 4,
@on_fail_step_id = @step_idNEW;
END;
FETCH NEXT FROM db_cursor INTO @job_id, @step_idMAX;
END;
CLOSE db_cursor;
DEALLOCATE db_cursor;
— End Update Process
— Show changes
declare @JobsFoundAfter int
set @JobsFoundAfter = (select count(a.job_id)
from sysjobs a where a.job_id in(SELECT distinct a.job_id
FROM [dbo].[sysjobsteps] as b
INNER JOIN sysjobs as a
ON a.job_id = b.job_id
WHERE [subsystem] IN ('TSQL', 'Powershell', 'CMDEXEC')))
declare @StepsFoundAfter int
set @StepsFoundAfter = (SELECT count(b.step_id)
FROM [dbo].[sysjobsteps] as b
INNER JOIN sysjobs as a
ON a.job_id = b.job_id
WHERE [subsystem] IN ('TSQL', 'Powershell', 'CMDEXEC'))
print 'After: Jobs found: '+ cast(@JobsFoundAfter as varchar(10))+', Steps found: '+ cast(@StepsFoundAfter as varchar(10))
— List jobs and steps updated
SELECT a.name as job_name, b.step_id, b.step_name,
case
when b.on_fail_action = 1 then 'Quit the job reporting success.'
when b.on_fail_action = 2 then 'Quit the job reporting failure.'
when b.on_fail_action = 3 then 'Go to the next step.'
when b.on_fail_action = 4 then 'Go to step.'
end as on_fail_action
, b.on_fail_step_id, b.subsystem,
b.command, b.database_name, a.enabled,
a.description, a.date_created, b.job_id
FROM [dbo].[sysjobsteps] as b
INNER JOIN sysjobs as a
ON a.job_id = b.job_id
WHERE [subsystem] IN ('TSQL', 'Powershell', 'CMDEXEC')
–rollback tran
–commit tran
Hi Chris,
I have been able to get your sp working in our environment. My only issue or question, in the email output, I wish the DB Name would reflect the DB that the SQL Job was running against, not the master DB from which the SP is called. Anyway to make that portion of email more informative?
————————————–
Step name= Run Package
DB Name = master
Run Date = Apr 17 2018 11:00AM
Author
I would have to poke around and play with this more to recreate the issue and then figure out a possible solution. No guarantee when I’ll be able to get to that. Let me know if you figure something out before I do!
Hi Chris,
We use Ola Hallengren scripts for our backups among other things. Consequently the error messages generated are longer due to the number of parameters in the DatabaseBackup stored procedure. In looking into why the email body seemed to be cut off, I made this change to address the issue. Is there maybe another issue that is the reason for the smaller size originally?
,@ErrMessage VARCHAR(4000) –was 1024, mapped to dbo.sysjobhistory which is nvarchar(4000)
Author
No reason other than I tend to start with smaller values if I can and then grow rather than go the other way. I use Ola’s script in some places too and haven’t hit the limit but I don’t use a ton of variables in the call I guess 🙂