Our Check Availability(CA) service is responsible to determine whether a requested domain name is available for purchase under the given Top Level Domains. The scalability and availability of this service is very critical for our EIG brands like BigRock, HostGator, BlueHost, etc.
We will go through some of the key architectural decisions in building our CA service and also, the general approach for building applications that are scalable and highly available across multiple data centers.
In a monolithic architecture, one misbehaving component can bring down the entire system. With the microservices approach, if there is an issue in one of the services then that service will only be impacted and the other services will continue to work. Other benefits of microservices-based architecture include (a) polyglot programming and persistence, (b) developed and deployed independently, (c) decentralized continuous delivery
For the reasons mentioned above, we built Check Availability as a microservice with the tech stack – (a) Cassandra Database, (b) Jersey RESTful services, (c) Spring Dependency Injection
Choosing the appropriate data store
A complex enterprise application uses different kinds of data and we could apply different persistence technologies depending on how the data is used. This is referred to as Polyglot Persistence. We should use Relational Databases for transactional data and choose appropriate NoSQL databases for non-transactional data.
Horizontal Scaling or Scale-Out is the ability to increase the capacity of a system by adding more nodes and it is harder to achieve with Relational Databases due to their design(ACID model). Most of the NoSQL databases are cluster-friendly as they are designed with the BASE(Basically Available, Soft State, Eventual Consistency) model. Graph databases are an exception as they use ACID model.
We have non-transactional data for Check Availability service and we wanted to use appropriate NoSQL database rather than our PostgreSQL database. This would also help in reducing the huge traffic from CA service to our transactional database.
We have evaluated Redis, a Key-Value NoSQL database which guarantees very high consistency. Redis cluster uses master-slave model which would cause downtime when a master node is unavailable as there would be some delay in electing one of its slaves as the master. We have evaluated Cassandra, a Column-Family NoSQL database which guarantees very high availability. Cassandra cluster uses masterless model which makes it massively scalable.
We would need very high availability for our CA service compared to consistency and hence we decided to go with Cassandra cluster. Majority of the traffic to our Cassandra database are read requests. We have setup Cassandra cluster of 3 nodes with (a) Replication Factor as 2, (b) Write Consistency Level as LOCAL_QUORUM, (c) Read Consistency Level as ONE. So, all the read requests can be handled even if one of the nodes in the cluster is up.
Active-Active setup within a Data Center(DC)
We have our CA service setup in multiple DCs. Within a DC, we have 2 CA web nodes with active-active setup under HAProxy load balancer. All of the CA web nodes connect to Cassandra cluster within the same DC.
To deploy a newer version of CA service, we repeat the following steps for all the web nodes one by one – (a) remove the web node out of the load balancer, (b) deploy the latest version of CA service, (c) add it back to the load balancer. So, there would be zero downtime for deploying the service.
With the increasing traffic to CA web nodes/Cassandra cluster, we can easily scale-out by adding more nodes.
Active-Active setup across multiple data centers
In order to achieve zero downtime for the application/service even when there is disaster within a DC, we could go with active-active setup for the application/service across DCs. We can use Round-robin DNS, Cloudflare Traffic Manager, etc to manage the traffic to the web nodes across the DCs.
At the time of writing this blog, we are in the process of completing the active-active setup of our CA service across two of our DCs in US location.
Even though Cassandra cluster supports cross-dc replication, we have taken the decision to eliminate cross-dc dependencies as much as possible. Hence one of the DCs going down would not have any impact on the other DCs.
It is very important to pro-actively monitor the application/service availability, correctness and performance along with the hardware health. This would help in reducing the downtime and improving the customer experiences. We have automated tests scheduled to run periodically in our production environment to check the health of our applications/services. Also, we monitor the application logs to identify critical errors/issues and send alerts to relevant teams in near real-time.
We have seen in detail some of the key architectural decisions for building scalable and highly available applications(like our Check Availability service) across multiple DCs. Happy learning!