Roadmap is growing, more data is coming in and more travelers are becoming dependend on our services which is a great! However, everything comes with a price. The huge amount of data that Roadmap is currently processing is going to increase with the time and therefore, it is absolutely important to have a stable platform in place that can handle and support the growth.
To achieve this goal we have taken a closer look into our backend processes and namely, how the current flow works with different types of data that we process as well as into the bottlenecks that we face when it comes to communicating to the travelers.
We have learned that we need to change a big part of our system to be stable, scalable and high available. Our main focus was based on two core processes – Timeline and Notifications.
Within this blog I will dive into the migration of Notifications; How did this work in the past, what kind of issues did we encounter and how did we gain control over these issues in the current process, bearing in mind we want to extend the process in a most efficient way in the (near) future.
How Roadmap notifies the traveler
In order to understand the issues that we were facing, it is important to first understand the flow of the old version of this process.
Roadmap uses several processes that create a request to send a notification to the traveler [img1]. An example of it could be a booking information that we have received from the traveler’s Travel Management Company. The notification contains a web link which provides the traveler access to the mobile website with his/her personal trip. Another example of notification could be updating the traveler about the status of a specific flight which might be delayed or even cancelled.
Img 1: Notification flow, December 2015
Since the process to send out a notification to the traveler is the same for every kind of notification-type, a specific process has been built.
Roadmap is using Messaging and Queueing to communicate between different processes in the entire architecture (rather called SOA). This means that any kind of process can ‘request’ to send out a notification to the traveler. This request will be dropped in a ‘notification-queue’. The process that handles the send-out of the notifications will pick up the request from the queue and execute the process until a notification has been sent to the traveler.
The advantage of these separated processes is that our deployment model is isolated per process. If we need to change any part of the logic which will hit the send-out the notification process, we can easily adjust the process and deploy it without touching, deploying or breaking any other processes. Next to that, if the process is unavailable due to a deployment, the requests for sending out a notification will be piled up in the queue. As soon as the process is successfully deployed it will again consume and execute the requests from the queue until the queue is empty. This means no requests of sending a notification will get lost, which offers us the possibility to release at any time when a deployment is needed. This makes Roadmap faster and flexibler.
But what if something goes wrong with or within the process? This might happen due to a broken connection to a database, a network disruption or a simple ‘failed’ deployment which caused the process to go offline. This means that no requests will be consumed and therefore no notifications will be sent out. The requests will pile up in the queue but also because several processies will be requesting to send out a notification, the queue will get filled up rapidly [img2].
Img 2: Notification flow, December 2015, queue is piling up.
To add even more pressure to that, timing is one of the most important key elements of our notifications. Travelers who want to use the process for the first time are requested to register first. A notification with an activation code will be sent to verify the identity of a traveler but we obviously do not want to let the traveler wait to many hours before he/she can identify him-/herself. Another example is when we need to inform the traveler about a flight cancellation. If we cannot manage to deliver this message on time, the message will become useless. The traveler will loose trust in the app and the service that Roadmap offers. Below you can see how this worse case scenario will look like [img3].
Img 3: Notification flow, December 2015, flight cancelled request coming in.
The above image indicates that the request no. 10.001 in the queue is the flight cancelled request. Once we fix and deploy the notification service on the Roadmap platform it will start consuming the requests from the queue again. In order to get to the flight cancelled request 10.000 queuing requests need to be handled as first which can take a lot of time. In fact, we might be already too late to send this message as the process that was taking care of handling these requests was offline. A worse case scenario might be also in the situation when the process was down for at least 5 hours which results in a delay of 5 hours.
Img 4: Notification flow, December 2015, the queue is empty again.
Although all requests are processed once the queue is back to 0 again, it might be that the (original) departure date of the flight is already in the past when we are trying to handle a specific request. In that case, we will not send these notifications as the information does not have any additional value for the traveler anymore. The scenario about the flight cancelled notification might result in the fact that the notification will never reach the traveler.
How can we gain control over this scenario? Investigating the scenario teaches us that there are 2 major problems in the process. First of all: the queue. The more requests are coming in, the bigger the delay will be for notifications which need to be send out when the process is offline. There is no way to prioritize these requests because you need to iterate over each request in order to discover if the notification is still important or not. Second issue is the process. If this process goes offline, none of the notification (of any type) will be send which causes the issue we started with. This ‘single-point-of-failure’ needs to be fixed.
Therefore, we decided to split up de notifications process into separate processes. Each consumer handles his own type of notifications, that need to be send out and runs completely isolated. As in our scenario the flight cancelled is one of the most important notification to send out (next to the activation token), it will not be blocked by other types of requests as these are processed by their own consumers. Also, the risk of an process going offline will only hit a specific group of notifications which narrows the scope to investigate why it went offline [img5].
Img 5: Notification flow, December 2016, seperated processes
We chose extending over changing…
There is one small catch though. The way of actually sending a notification to the traveler is always based on the same logic for each type of notification, defined by the merchant. For instance; A merchant wants to notify the traveler first via push notifications. If this fails (or the traveler has no push notifications enabled in the app) the merchant wants to notify the traveler by email. If this also fails the merchant can decide to try the last channel, which is in this case text message.
We decided to build this logic in a separate package and include this within each process. Therefore we have to maintain only one codebase which defines ‘how to send the notification’. But the information of the notification itself together with its type will be maintained seperately. This approach gives us the possibility to easily introduce and send new types of notifications. Or when a new communication channel will be introduced, we can simply update the codebase and update the packages of each separate process.
Working like this will make our work to maintain, update or extend the code much more efficient compared the previous approach, and can be easily integrated in the Roadmap platform.
Icons in this article are coming from: https://thenounproject.com/monstercritic