notifications

Better SQL Agent Job Alerts

Ever receive alerts from the SQL Agent?

 

They kinda suck. I mean it is nice that we have the option for the SQL Agent to tell us when a job fails or succeeds, but if you have ever looked at the message, there isn’t much detail in it.

Take a look at this (slightly edited) email I used to get from our systems:

From SQLServer01
To SQL ALert:
Date Dec 11th, 2016 at 2:46 AM
JOB RUN: ‘Job 001 – Load Sales’ was run on 12/11/2016 at 2:40:47 AM
DURATION: 0 hours, 5 minutes, 25 seconds
STATUS: Failed
MESSAGES: The job failed. The Job was invoked by User Domain/User. The last step to run was step 3 (Load Fact – Prod Sales). The job was requested to start at step 1 (Extract Fact – Prod Sales).

Not very helpful. Sure, I know the job failed, and what step it failed on, but now I have to connect to the agent and look up the history to determine if this is something I have to worry about.

It would be nice to receive the details seen in the history of the job showing up in the email alert received.

Recently I have been working with systems that had all the alert on failure configured so we knew when things failed and could jump on re-running them if needed. We even had them showing up into a data team slack channel, so we had a history as well as notification to everyone on the team at the same time. The problem is that there were not any details in the alerts we received so we had to be able to connect and figure out what to do next or hope that our paid monitoring service would act on something after reading the details of the failure.

It started 10 years ago

Way back in 2007, Jameel Ahmed wrote an article for SQLServerPro on how to get more detailed error messages from failed SQL Agent jobs. I downloaded his code and adapted it some to what I have below. I cleaned out some parts I didn’t need, and it worked like a charm!

 

Once the code is in place, a slight change to the scheduled jobs needs to happen.
A new step has to be added to call this procedure, and all previous steps configured to call the final step on job failure.

What you have to do

This script uses the running job’s ID to pull info from the system databases to relay the detailed information.
When I was testing this by passing in values and running the procedure, it didn’t give any results, unless that job was actively running.
To make sure this worked nicely the following is how it is setup in the job as the final step of the job.

Report Error

It is just a simple call to execute the stored procedure which I placed in the master database on this server. There are two parameters passed into the procedure. The first, $(ESCAPE_NONE(JOBID)), captures the current job ID that is called the procedure so the procedure can look up the error messages. The second parameter is something I have in my system for an upgraded proc. I’ll cover that in more depth in a later post. For now, you can just leave the ‘,2’ off completely.

The other thing I did for this new alert step was to configure it to fail on both success and failure. I did this so that when I look at the agent job activity monitor, I will still get a visual that the job failed. There isn’t much worse than hiding a failure behind a fake success message (sometimes when dealing with AAGs and jobs it works, but that’s a different post).

 

both to fail

Then I went back to the job’s various steps and set their failure action to call the final step rather than just fail.

That’s it. Even though a little work is required to implement this method, it is a relatively straightforward and helpful solution.

Now I get a far more detailed alert that I can evaluate more efficiently and take actions on without having to dig through the Agent history for the job.

 

From SQLServer01
To SQL ALert:
Date Dec 11th at 2:49 AM
Job_name = Job 001 – Load Sales
————————————–
Step name = Load Fact – Prod Sales
DB Name = master
Run Date = Dec 11 2016 2:49AM
Error = Executed as user: Domain\User. Microsoft (R) SQL Server Execute Package Utility Version 12.0.5000.0 for 64-bit Copyright (C) Microsoft Corporation. All rights reserved. Started: 2:48:53 AM Error: 2016-02-15 02:49:54.07 Code: 0xC001000E Source: Load Fact – Prod Sales
Description: The connection “Connection 1” is not found. This error is thrown by Connections collection when the specific connection element is not found. End Error Error: 2016-02-15 02:49:54.07 Code: 0xC001000E Source: Load Fact – Prod Sales Description: The connection “Connection 2” is not found. This error is thrown by Connections collection when the specific connection element is not found. End Error Error: 2016-02-15 02:49:54.95 Code: 0x00000000 Source: Get Staging Dates Description: Invalid column name ‘modified’. End Error Error: 2016-02-15 02:49:54.95 Code: 0xC002F210 Source: Get Staging Dates Execute SQL Task Description: Executing the query “select convert(smalldatetime, convert(char(10), mi…” failed with the following error: “Invalid column name ‘modified’.”. Possible failure reasons: Problems with the query, “ResultSet” property not set correctly, parameters not set correctly, or connection not established correctly. End Error DTExec: The package execution returned DTSER_FAILURE (1). Started: 2:49:53 AM Finished: 2:49:54 PA Elapsed: 1.625 seconds. The package execution failed. The step failed.

Command = /FILE “\”\\placewherewestoreSSISpackages\SSIS\DW\ProdSales\Load Fact – PROD SALES.dtsx\”” /CONFIGFILE “\”C:\SSIS\DW Extract Configuration.dtsConfig\”” /MAXCONCURRENT ” -1 ” /CHECKPOINTING OFF /SET “\”\package.variables[RunbyJobName].Value\””;”\”Load Prod Sales\”” /REPORTING E

An automated retry has been scheduled for this job.

I even get nifty other alerts, like the actual command that executed and that an automate retry has been scheduled.

Wait! An automatic retry?

Yeah, you read that right. I have elaborated on this alert message so that, based on certain criteria configured within the job, an automated retry can fire off. That second parameter above that I said to ignore. Yeah, that’s my delay in minuted between retries.

For now, I want you to have the ability to get a better error message from your failed jobs. As far as retrying jobs, well that’s for another post (soon, I promise).

Also, it wouldn’t be terribly hard to create a script to add this to all your scheduled jobs by adding the new alert step and then adjust all the other steps to call it on failure. I will probably get around to that eventually, but if you do it before me, let me know!

Comments 7

  1. Thanks so  much for posting this!  Could have used something this morning when the regular DBA was out and a server miss-behaved.

  2. How do I set this up for a 15 step job, where failed steps move to “the next step”?  That is, I do not want to fail THE JOB on the failed step, but I do want to alert that failed step happened.

    When will your next post, where you detail this sproc’s tailoring?

    1. Post
      Author

      The only way, off the top of my head, I can think of maybe making that movement possible is by adding the new failure alert step between each original step then that failure step moves to the next step when it completes.
      Something like:

      Step 1: Do action – on success go to step 3, on failure go to step 2
      Step 2: send email alert – on success, move to next step, on failure stop
      Step 3: Do action 2 – on success go to step 5, on failure go to step 4
      Step 4: send email alert – on success, move to next step, on failure stop
      step 5: Do action 3 ….. and so on

      That should give you the alert, but keep the process moving. It starts to become a bit cumbersome, but I think it would work.

      As for the retry logic that might help (if customized), I am trying to get that article pulled together for next week but it will all depend on how things go this week with work and other tasks.

  3. There is a problem with the conversion for run_datetime conversion if the run_time is 3 or 4 digits.  Something like  ,run_datetime = CONVERT(DATETIME, LEFT(run_date, 4) + ‘/’
    + SUBSTRING(run_date, 5, 2) + ‘/’ + RIGHT(run_date, 2) + ‘ ‘
    + CAST(STUFF(STUFF(REPLACE(STR(run_time,6),’ ‘,’0′),3,0,’:’),6,0, ‘:’) AS DATETIME)) would work better.

  4. When you run the script for the first time you will get an error message:

    I think most people only want to create the procedure, so it’s easier to change ‘ALTER PROCEDURE’ in ‘CREATE PROCEDURE’.

    1. Post
      Author

Leave a Reply

Your email address will not be published. Required fields are marked *