Real-Time Notification Monitoring: Best Practices
How to ensure your notifications actually reach users and get actioned
Sending notifications is only half the battle. According to Twilio's messaging research, 30% of business-critical notifications fail to reach users due to delivery issues, spam filters, or configuration errors. Without proper monitoring, these failures go unnoticed until customers complain.
This guide covers everything you need to monitor your notification infrastructure effectively, from basic delivery tracking to advanced observability patterns.
What you will learn:
- Set up real-time delivery tracking
- Configure alerts for critical failures
- Track key performance metrics
- Debug common delivery issues
- Build notification dashboards
- Implement proactive health checks
Why Notification Monitoring Matters
The Cost of Silent Failures
| Scenario | Without Monitoring | With Monitoring |
|---|---|---|
| Password reset emails failing | Users locked out, support tickets | Immediate alert, quick fix |
| SMS provider outage | Customers miss OTPs | Automatic failover triggered |
| Email marked as spam | 0% delivery, no visibility | Deliverability alert |
| Push token expired | Silent failure | Token refresh triggered |
Key Metrics to Track
| Metric | Description | Target |
|---|---|---|
| Delivery Rate | Notifications successfully delivered | > 98% |
| Latency | Time from request to delivery | < 5 seconds |
| Bounce Rate | Hard + soft bounces for email | < 2% |
| Error Rate | Failed delivery attempts | < 1% |
| Queue Depth | Pending notifications | Near zero |
Setting Up Delivery Tracking
NotiGrid Notification Logs
Every notification sent through NotiGrid is logged with full context:
// Send notification and capture result
const result = await notigrid.notify({
channelId: 'order-confirmations',
to: 'customer@example.com',
variables: {
orderId: 'ORD-123',
customerName: 'Jane'
}
})
console.log('Notification ID:', result.id)
console.log('Status:', result.status) // queued, sent, delivered, failedQuerying Notification Status
// Check status of a specific notification
const notification = await notigrid.notifications.get(notificationId)
console.log({
id: notification.id,
status: notification.status,
channel: notification.channelId,
createdAt: notification.createdAt,
sentAt: notification.sentAt,
deliveredAt: notification.deliveredAt,
error: notification.error
})Listing Recent Notifications
// Get recent notifications with filters
const notifications = await notigrid.notifications.list({
status: 'failed',
channel: 'order-confirmations',
from: new Date(Date.now() - 24 * 60 * 60 * 1000), // Last 24 hours
limit: 100
})
for (const notification of notifications.items) {
console.log(`${notification.id}: ${notification.status} - ${notification.error}`)
}Configuring Real-Time Alerts
Webhook Events
Configure webhooks to receive real-time delivery events:
// In your Express/Fastify app
app.post('/webhooks/notigrid', async (req, res) => {
const event = req.body
switch (event.type) {
case 'notification.delivered':
// Update your database
await db.notifications.update(event.data.id, {
deliveredAt: new Date(event.data.deliveredAt)
})
break
case 'notification.failed':
// Alert your team
await alertTeam({
channel: 'notification-alerts',
message: `Notification ${event.data.id} failed: ${event.data.error}`,
severity: 'high'
})
break
case 'notification.bounced':
// Handle email bounce
await handleBounce(event.data.recipient, event.data.bounceType)
break
}
res.status(200).send('OK')
})Alert Thresholds
Set up alerts based on thresholds:
// Example: Alert if failure rate exceeds 5%
async function checkFailureRate() {
const stats = await notigrid.stats.get({
period: '1h',
metric: 'failure_rate'
})
if (stats.failureRate > 0.05) {
await slack.send({
channel: '#alerts',
text: `:warning: High notification failure rate: ${(stats.failureRate * 100).toFixed(1)}%`,
attachments: [{
color: 'danger',
fields: [
{ title: 'Total Sent', value: stats.total, short: true },
{ title: 'Failed', value: stats.failed, short: true }
]
}]
})
}
}
// Run every 5 minutes
setInterval(checkFailureRate, 5 * 60 * 1000)Tracking Key Metrics
Essential Dashboard Metrics
Build a monitoring dashboard with these core metrics:
1. Delivery Success Rate
const successRate = await notigrid.stats.deliveryRate({
period: '24h',
groupBy: 'channel'
})
// Returns:
// {
// overall: 0.984,
// byChannel: {
// 'order-confirmations': 0.991,
// 'password-reset': 0.988,
// 'marketing': 0.962
// }
// }2. Latency Percentiles
const latency = await notigrid.stats.latency({
period: '24h',
percentiles: [50, 95, 99]
})
// Returns:
// {
// p50: 1.2, // 50% delivered within 1.2 seconds
// p95: 3.8, // 95% delivered within 3.8 seconds
// p99: 8.1 // 99% delivered within 8.1 seconds
// }3. Error Breakdown
const errors = await notigrid.stats.errors({
period: '24h',
groupBy: 'error_type'
})
// Returns:
// {
// 'invalid_recipient': 45,
// 'provider_error': 12,
// 'rate_limited': 8,
// 'template_error': 3
// }Per-Provider Metrics
Track each provider separately to identify issues:
const providerStats = await notigrid.stats.byProvider({
period: '24h'
})
// Returns:
// {
// 'ses': { sent: 5000, delivered: 4950, failed: 50, latency_p95: 2.1 },
// 'twilio': { sent: 500, delivered: 495, failed: 5, latency_p95: 1.8 },
// 'fcm': { sent: 2000, delivered: 1980, failed: 20, latency_p95: 0.8 }
// }Debugging Failed Notifications
Common Failure Categories
| Error Type | Cause | Solution |
|---|---|---|
invalid_recipient | Bad email/phone format | Validate before sending |
bounced | Email address doesn't exist | Remove from list |
complained | User marked as spam | Unsubscribe immediately |
rate_limited | Provider limit exceeded | Implement queuing |
provider_error | Provider API failure | Enable failover |
template_error | Missing variable | Fix template |
expired_token | Push token invalid | Request new token |
Debugging Workflow
async function debugFailedNotification(notificationId: string) {
// 1. Get notification details
const notification = await notigrid.notifications.get(notificationId)
console.log('Notification:', {
id: notification.id,
channel: notification.channelId,
recipient: notification.to,
status: notification.status,
error: notification.error,
createdAt: notification.createdAt
})
// 2. Get step-by-step execution log
const logs = await notigrid.notifications.logs(notificationId)
for (const log of logs) {
console.log(`[${log.timestamp}] ${log.step}: ${log.message}`)
if (log.error) {
console.log(' Error:', log.error)
console.log(' Provider Response:', log.providerResponse)
}
}
// 3. Check provider status
const integration = await notigrid.integrations.get(notification.integrationId)
console.log('Provider Status:', integration.status)
// 4. Suggest fix
return suggestFix(notification.error)
}
function suggestFix(error: string): string {
const fixes: Record<string, string> = {
'invalid_recipient': 'Validate email/phone format before sending',
'bounced': 'Remove recipient from mailing list',
'rate_limited': 'Reduce send rate or upgrade provider plan',
'template_error': 'Check template variables match sent data',
'expired_token': 'Request new push token from device'
}
return fixes[error] || 'Contact support for assistance'
}Retry Failed Notifications
// Retry a single failed notification
await notigrid.notifications.retry(notificationId)
// Bulk retry failed notifications
const failed = await notigrid.notifications.list({
status: 'failed',
from: new Date(Date.now() - 60 * 60 * 1000), // Last hour
limit: 100
})
for (const notification of failed.items) {
if (isRetryable(notification.error)) {
await notigrid.notifications.retry(notification.id)
}
}
function isRetryable(error: string): boolean {
const retryable = ['provider_error', 'rate_limited', 'timeout']
return retryable.includes(error)
}Building Notification Dashboards
Grafana Integration
Export metrics to Grafana for visualization:
// Prometheus metrics endpoint
app.get('/metrics', async (req, res) => {
const stats = await notigrid.stats.get({ period: '5m' })
const metrics = `
# HELP notifications_sent_total Total notifications sent
# TYPE notifications_sent_total counter
notifications_sent_total ${stats.total}
# HELP notifications_delivered_total Total notifications delivered
# TYPE notifications_delivered_total counter
notifications_delivered_total ${stats.delivered}
# HELP notifications_failed_total Total notifications failed
# TYPE notifications_failed_total counter
notifications_failed_total ${stats.failed}
# HELP notification_latency_seconds Notification delivery latency
# TYPE notification_latency_seconds histogram
notification_latency_seconds_bucket{le="1"} ${stats.latency.under1s}
notification_latency_seconds_bucket{le="5"} ${stats.latency.under5s}
notification_latency_seconds_bucket{le="10"} ${stats.latency.under10s}
notification_latency_seconds_bucket{le="+Inf"} ${stats.total}
`
res.set('Content-Type', 'text/plain')
res.send(metrics)
})Key Dashboard Panels
- Delivery Rate Over Time - Line chart showing success rate
- Latency Percentiles - P50, P95, P99 latency trends
- Error Distribution - Pie chart of error types
- Notifications by Channel - Bar chart comparing channels
- Provider Health - Status indicators per provider
Proactive Health Checks
Synthetic Monitoring
Send test notifications regularly to verify the system is working:
// Synthetic notification test
async function syntheticTest() {
const startTime = Date.now()
try {
const result = await notigrid.notify({
channelId: 'synthetic-test',
to: 'monitoring@yourdomain.com',
variables: {
testId: `test-${Date.now()}`,
timestamp: new Date().toISOString()
}
})
const latency = Date.now() - startTime
// Report success to monitoring
await metrics.record({
name: 'synthetic_test_latency',
value: latency,
tags: { status: 'success' }
})
return { success: true, latency }
} catch (error) {
// Alert on failure
await alertTeam({
channel: '#critical-alerts',
message: `Synthetic test failed: ${error.message}`,
severity: 'critical'
})
return { success: false, error: error.message }
}
}
// Run every 5 minutes
setInterval(syntheticTest, 5 * 60 * 1000)Provider Health Monitoring
async function checkProviderHealth() {
const providers = await notigrid.integrations.list()
for (const provider of providers.items) {
const health = await notigrid.integrations.healthCheck(provider.id)
if (!health.healthy) {
await alertTeam({
channel: '#infrastructure',
message: `Provider ${provider.name} is unhealthy: ${health.error}`,
severity: 'high'
})
}
}
}
// Check every 10 minutes
setInterval(checkProviderHealth, 10 * 60 * 1000)Best Practices Summary
Do
- Track delivery rates for every channel and provider
- Set up alerts for failures above threshold
- Log notification IDs in your application for correlation
- Use webhooks for real-time status updates
- Run synthetic tests to catch issues proactively
- Monitor queue depth to detect backlogs
Avoid
- Ignoring bounces - they harm sender reputation
- Silent failures - always log and alert
- Manual checking - automate everything
- Single provider - use failover for critical notifications
- Missing metadata - include context for debugging
Frequently Asked Questions
How long are notification logs retained?
NotiGrid retains logs for 30 days on Free/Pro plans and 90 days on Team/Scale plans. Enterprise customers can configure custom retention.
Can I export logs for compliance?
Yes, use the logs API to export notification history to your own storage for compliance requirements like GDPR or HIPAA.
How do I set up alerts for specific channels?
Use webhooks with filtering, or configure alert rules in the NotiGrid dashboard to trigger on specific channels or error types.
What is the best way to handle rate limiting?
NotiGrid automatically handles rate limiting with exponential backoff. For high-volume senders, enable provider failover to maintain delivery during limits.
How do I correlate notifications with my application events?
Include metadata in your notify calls (like orderId, userId) which will be preserved in logs and webhooks for correlation.
Summary
Effective notification monitoring requires:
- Delivery Tracking - Log every notification with status
- Real-Time Alerts - Know about failures immediately
- Key Metrics - Track delivery rate, latency, errors
- Debugging Tools - Quickly identify and fix issues
- Proactive Checks - Catch problems before users report them
With proper monitoring, you can achieve 99%+ delivery rates and catch issues before they impact users.
Next Steps
- Getting Started with NotiGrid - Set up your first notification
- Multi-Channel Notifications Guide - Build resilient notification systems
- Integration Guide - Connect your providers
- API Documentation - Full API reference
Need Help?
Email Support: support@notigrid.com Documentation: docs.notigrid.com Schedule a Demo: notigrid.com/demo
Ready to send your first notification?
Get started with NotiGrid today and send notifications across email, SMS, Slack, and more.