In the previous article we have covered high level design of Notification system, architecture, components including notification processing workflow. In this we are going to cover the scheduling notification workflow in deep.
Scheduling notifications in a distributed and highly scalable notification system requires careful consideration of several factors like managing large volumes of requests, ensuring timely delivery, handling failures, retries, and maintaining consistency. This process can be broken down into key components and architectural elements that allow the system to handle massive traffic while ensuring high availability.
1. Purpose of Notification Scheduling:
The Scheduler Service ensures that notifications are sent at the right time, based on user preferences, time zones, event-driven triggers, or system-defined schedules (e.g., marketing campaigns, event reminders, batch notifications). It decouples the notification request from the actual sending process, allowing for flexible and reliable delivery.
Core Requirements:
- Scalability: The system should be able to handle millions of notifications, especially during high-traffic periods (e.g., promotions, alerts, batch sends).
- High Availability: The system should be fault-tolerant, ensuring that notifications are sent on time even if some components fail.
- Resilience: It must gracefully handle failures and retry sending notifications when needed.
- Latency Minimization: Notifications should be scheduled and delivered in real-time or near real-time to avoid significant delays.
- Scheduling Policies: Support different notification strategies such as one-time, recurring, delayed, or burst schedules.
Design for Scale and High Availability
1. Distributed Scheduling System
A distributed scheduler is critical for handling the scale. The notification system must scale horizontally to support a large number of scheduled notifications. This can be achieved through sharding, where different scheduling nodes are responsible for managing subsets of the notifications.
- Master-Worker Pattern: The scheduling process can follow the master-worker model where:
- Master: Coordinates and manages the overall scheduling process. It assigns scheduling tasks to worker nodes.
- Workers: Process individual notification schedules. They take care of enqueuing notifications at the right time for delivery.
- Workers can scale horizontally to process more scheduled tasks.
2. Queue-Based Scheduling
- Job Queues: Use message queues like RabbitMQ, Kafka, or AWS SQS to manage the scheduling and delivery tasks asynchronously. This approach decouples the scheduling and sending of notifications, improving system resilience.
- Notifications that need to be sent are placed in different queues (e.g., SMS queue, email queue, push notification queue) according to the type of notification and user preferences.
- Each worker or consumer listens to the corresponding queue and processes the notification at the scheduled time.
- Priority Queues: Some notifications (e.g., critical alerts) may need to be delivered ahead of others. A priority queue system can help in ensuring that higher-priority notifications are processed first.
Components of the Scheduling System
1. Scheduler Service
Role: Coordinates when each notification needs to be sent based on predefined criteria, such as user preferences, event-driven triggers, or specific timings.
Design:
- Task Scheduler: Handles job scheduling with advanced features like delays, retries, and triggers. It ensures that tasks are executed at the right time.
- Task Distributor: Ensures that scheduled tasks are distributed across different worker nodes for load balancing.
- Cron-based Scheduler: Can be used for recurring notifications (e.g., daily updates, weekly newsletters).
Interactions:
- Receives notification requests from upstream services (Notification Service, APIs).
- Uses message queues to communicate scheduled jobs to workers responsible for delivery.
2. Worker Nodes
Role: Responsible for processing scheduled notifications and sending them at the right time.
Design:
- Polling System: Workers poll the message queues for scheduled jobs and process them.
- Horizontal Scalability: Workers can scale horizontally, meaning the more workers added, the more tasks can be processed concurrently.
- Failover Mechanism: If a worker node fails, other nodes can take over its scheduled tasks. Workers are stateless, which allows them to pick up tasks dynamically from the queue.
3. Notification Queues
Role: Act as the backbone for communication between the scheduler and the workers.
Design:
- Queue Sharding: To handle high traffic, queues can be sharded based on notification type, user segment, or region. This improves parallel processing and prevents bottlenecks.
- Priority Queuing: Supports sending high-priority notifications before regular ones. The system can ensure time-sensitive notifications are not delayed by low-priority tasks.
4. Retry Service Integration
Role: Handles retrying failed notifications as part of the scheduling process.
Design:
- Failed notifications are re-queued into the retry queue based on retry policies (e.g., exponential backoff).
- The Scheduler Service coordinates with the Retry Service to ensure retries happen at the right intervals and are not excessively retried.
Interactions and Workflow

- Client Request: The client (e.g., another service or end-user) sends a notification scheduling request via an API or internal system trigger.
- Schedule Storage: The Scheduler Service stores the scheduling metadata (e.g., user ID, time to send, notification type) in a database and, if necessary, caches it for fast retrieval.
- Queueing Job: Based on the scheduling policy (immediate, delayed, recurring), the Scheduler Service places the job in the appropriate queue.
- Worker Polling: Worker nodes listen to the queue, and when the time comes, they pick up the job for execution.
- Notification Sending: The worker retrieves user preferences from the cache/DB, fetches the appropriate notification template, and sends the notification through the corresponding channel (e.g., email, SMS, push notification).
- Handling Failures: If a notification fails (e.g., service downtime), it is re-queued in the retry queue and processed according to retry policies.
- Tracking and Logging: The result (success, failure) is logged for further analysis and reporting.
Interview Brain Teasers
Question #1:
For notification processing workflow and scheduling notification workflow, two different workers are responsible for sending messages to external service which is responsible for sending notifications to clients. Explain what is difference between these two workers ?
Answer:
Let's clarify the roles and responsibilities of the Notification Worker and Scheduling Worker, as they serve different purposes despite both potentially interacting with external notification channels.
Scheduling Worker is primarily responsible for managing scheduled notification tasks. It ensures that a notification is sent at a particular time based on a predefined schedule (delayed, recurring, one-time, etc.).
Example:
- A user schedules a notification to be sent at 8:00 AM the next day. The Scheduling Worker polls the schedule queue and at 8:00 AM picks up the job, processes it, and sends it to the external email/SMS/Push notification service.
Primary Role:
The Scheduling Worker focuses on delayed or time-based notifications. It ensures notifications are sent according to a specific schedule and operates as part of an asynchronous flow that triggers notification delivery at the right time.
Notification Worker is responsible for real-time processing of notifications. It doesn't care about schedules but rather reacts to notifications that need to be sent immediately or triggered by real-time events.
Example:
- A user performs a transaction, and an instant notification needs to be sent. The Notification Worker listens for this event in real time, processes it, and sends the notification without any delay.
Primary Role:
The Notification Worker focuses on real-time or event-driven notifications. It is designed to handle immediate, non-scheduled notifications that are triggered by system events or user actions and need to be sent out as soon as possible.
Question #2:
How priortized notifications could be taken care, what would be changes in design/components ?
Answer:
Handling prioritized notifications introduces an extra layer of complexity in the notification system, as some notifications need to be processed with higher urgency compared to others. This can affect the way tasks are handled within queues, workers, and retry logic. Let's break down how the design and components would need to change to support prioritized notifications:
Changes in Design and Components
1. Queue Design — Multiple Priority Queues
One of the key changes would be in the queue design. Instead of a single queue for all notifications, the system would have multiple priority queues to handle different priority levels (e.g., high, medium, low).
- High-Priority Queue: This queue would handle critical notifications (e.g., security alerts, transactional notifications, urgent updates).
- Medium-Priority Queue: This queue could handle less urgent but still time-sensitive notifications (e.g., event reminders).
- Low-Priority Queue: This queue would handle non-urgent notifications (e.g., promotional emails or general announcements).
Changes:
- Separate queues for each priority: E.g.,
HighPriorityQueue
,MediumPriorityQueue
,LowPriorityQueue
. - Workers would need to know which queue to poll based on the priority level.
2. Priority-Based Workers
Workers can be designed to:
- Poll specific priority queues: You may have workers dedicated to high, medium, or low-priority tasks. These workers would prioritize their queues and handle notifications based on the queue they are assigned to.
- Dynamic Queue Polling: Some workers can poll from multiple queues with a preference for high-priority tasks. If the high-priority queue is empty, they could then poll the medium or low-priority queues.
Changes:
- High-Priority Workers: Focus on high-priority queues.
- General Workers: Dynamically poll queues starting with the highest-priority queue first.
- Queue Selection Logic: Workers could implement a queue-polling policy (e.g., always poll from high-priority queue, then fallback to medium and low if empty).
3. Scheduler Changes for Prioritized Notifications
The scheduler responsible for scheduling notifications would need to take priority into account. This could include:
- Priority Scheduling: Notifications with higher priority should be scheduled first.
- Queue Placement Logic: The scheduler must place the notification request into the appropriate queue based on the priority assigned during task creation.
Changes:
- Modify scheduler logic to assign priority to notifications and place them in the appropriate queue.
- The scheduler would need to have awareness of different queues (e.g., HighPriorityQueue, LowPriorityQueue).
4. Retry Service and Retry Worker Adjustments
In the context of retrying failed notifications:
- Separate Retry Queues by Priority: Just like the main processing queues, there would need to be separate retry queues for each priority level.
- Retry Based on Priority: High-priority retries should take precedence. The retry worker would prioritize retrying tasks from the high-priority retry queue before moving on to medium or low-priority retry queues.
Changes:
- Multiple retry queues: High-priority retry queue, medium-priority retry queue, low-priority retry queue.
- Priority-based retry workers: Retry workers should retry high-priority notifications before retrying lower-priority ones.
5. Changes in Notification Service:
The Notification Service would be responsible for:
- Assigning Priority: When a notification request is made, the notification service would need to determine the priority of the notification based on predefined rules or user preferences.
- Queue Assignment: The service would place the notification request in the appropriate priority queue (e.g., high, medium, or low).
Changes:
- Priority Assignment: Notifications could be prioritized based on their type (e.g., critical notifications get higher priority), user preferences, or external rules.
- Queue Assignment Logic: The service must assign tasks to the appropriate priority queue after determining the notification's priority.
Summary of Changes for Prioritized Notifications:
- Multiple priority queues (e.g., HighPriorityQueue, MediumPriorityQueue, LowPriorityQueue).
- Workers prioritize processing high-priority notifications first.
- Scheduler assigns notifications to different priority queues.
- Retry Service and Retry Workers handle retry attempts based on priority.
- Notification Service assigns priorities to notification requests and routes them accordingly.
- Analytics and Reporting Service tracks metrics based on priority to ensure SLAs for critical notifications are being met.
This design ensures high-priority notifications are delivered quickly and efficiently, while lower-priority notifications are still processed without blocking critical ones.
Question #3:
But in scheduling, there exists 3 different queues already immediate, recurring, delayed, where does high priority queue will be taken into picture
Answer:
In the existing design with three different queues for scheduling (Immediate, Recurring, and Delayed), incorporating priority levels introduces additional complexity. To manage both scheduling and prioritization effectively, you can adopt a hybrid approach where priority queues are layered on top of the scheduling queues. This ensures that both time-sensitive and priority-sensitive requirements are addressed without bottlenecks.
Here's how this can be done:
Layering Priorities on Scheduling Queues
Each of the scheduling categories — Immediate, Recurring, and Delayed — can be divided into priority sub-queues. Instead of having only three main queues, you would have multiple queues for each combination of scheduling type and priority. For instance:
Immediate Notifications:ImmediateHighPriorityQueue, ImmediateMediumPriorityQueue, ImmediateLowPriorityQueue
Recurring Notifications:RecurringHighPriorityQueue, RecurringMediumPriorityQueue, RecurringLowPriorityQueue
Delayed Notifications:DelayedHighPriorityQueue, DelayedMediumPriorityQueue, DelayedLowPriorityQueue
Each of these priority-based sub-queues can then be processed accordingly, ensuring that:
- High-priority notifications are processed first within each scheduling category.
- Low-priority notifications are processed when higher-priority ones are exhausted, but still within their appropriate scheduling window (e.g., delayed or recurring).
Queue Interaction and Scheduling Adjustments
- Immediate Queue with Priority:
- High-priority tasks that require immediate delivery will be sent to the
ImmediateHighPriorityQueue
. - The worker assigned to handle immediate tasks will first check the high-priority queue. If the queue is empty, it will move to the medium and then low-priority queues.
2. Recurring Queue with Priority:
- Recurring tasks with high priority will be sent to
RecurringHighPriorityQueue
. - When the recurring task's execution time comes, workers will prioritize the high-priority tasks first, then medium, and then low-priority.
3. Delayed Queue with Priority:
- For delayed notifications, once the delay period is over, workers will begin by processing tasks in
DelayedHighPriorityQueue
, followed by medium and low-priority ones.
In this way, notifications that are both time-sensitive and high-priority will receive the necessary attention without competing with low-priority notifications that also happen to fall under the same scheduling category.
In the next article, we will be covering more workflows in depth, brain teasers, how to take care of scaling and load balancing across services and workers, Multi-Region and Geo-Distributed Architecture etc.
Happy System Designing !!!!!