Bram working on the real-time communication feature

Creating a real-time communication channel for legacy API

El Niño

--

Earlier this year we were faced with an exciting technical challenge for one of our Webguru projects. The project in question is a series of online sex-chat websites (multi instance) and is among our longest running projects. The platform offers 1-on-1 instant messaging and was originally implemented using polling (so not quite as real-time). To give a rough idea about load: peak times has 1k concurrent active users. This means the polling is hitting the server with 1k requests at whatever interval we used. A lot has changed since the initial inception of the project, but we’ll keep it to the relevant parts.

Background

The application was first implemented in PHP, with the backend performing server-side rendering. Throughout the codebase we optimized the code in various ways to help lower the cpu load, but only two are of importance here.

  • The first optimization is heavy use of Redis to cache often performed database queries, it is faster to stick the data in redis and fetch it than it is to have the database run a full query.
  • Second is the caching of certain views that get requested often, specifically the user listings got rendered periodically using a cron-job.

A while ago we started developing a separate frontend using Vue.js, which ended up paving the way to implementing the real-time communication channel. With the introduction of the new frontend we already were able to offload the server side rendering to the client’s browser, where we also ended up getting rid of the user listings view cache.

However, as the internet becomes more advanced we see that user expectations also grow along. Many services and apps include real-time chat functionality, which led to our client also getting more interested in upgrading the value they were bringing to their users. Besides the value to users we also suspected that if we took the time to properly design and integrate the solution into the system, we could also make a serious improvement to the performance.

To put this metaphorically: imagine there is a mail service, a weird one that doesn’t do any deliveries. Instead you head to the post office every day and ask if they received mail for your address. Any mail they are holding at that point is handed over so you can take it home. To make the metaphor work it needs to start out impractical like this, but in hindsight that may also be telling of the polling method.

Websockets

What we realized early on was that the polling method was outdated and performing a lot of business logic that ended up making the operation difficult to optimize. Because of the business logic we couldn’t just replace the polling method with a real-time connection, as the existing behavior would need to be replicated also.

On the web a real-time connection is directly supported by browser through the Web Sockets API. This technology is widely available in modern browsers but can be tricky to work with (especially server side). Fortunately there are libraries that simplify the use of websockets and take care of all this. What we ended up using is a library called socket.io, which lets you build out a real-time messaging network. The great thing about socket.io is the resilience of the library, it has a wide range of communication methods that you can pick from. This means that if your user’s browser somehow doesn’t support websockets, your app will still work because socket.io will fall back to a polling method.

To bring in the metaphorical mail service here, we realized we’d be better off having some mailmen drive around and deliver mails (app updates) directly to the doorsteps of our users. This would save a lot of back-and-forth trips (the polling method), which reduces traffic on the road (internet) as well as ensure the mail gets delivered as soon as the post office received it.

Architecture

Armed with the websocket technology, we set out to update our architecture to support real-time communication. The change seemed pretty straightforward: add a NodeJS service, connect the web app to it and then somehow trigger broadcasts of system updates to the users. The authentication for the real-time service was easily done as we were already using JWT tokens for authentication. The tokens were self-signed and we added claims for the real-time service to use for some of the role-based business logic.

The next challenge we needed to solve was how to initiate the updates and broadcasts. We wanted to avoid some sort of complicated internal http api or usage of websockets from PHP as it would likely block the execution of other business logic. A closer look at our Redis setup revealed an opportunity: Redis has built-in support for real-time communication! Redis supports pub/sub messaging and most of our APIs were already using Redis somewhere along the way, so the connection would already be established. So we established a connection from the real-time service to the Redis instance (using node-redis) and subscribed to a single topic to run communication over. We decided to use a simple JSON based messaging protocol between the backend and real-time service, with one action field to indicate the packet type and a data field that could hold arbitrary json data to hold the relevant payload.

From here on, we were able to refactor the frontend to use the websocket api and update our Vuex managed state using the messages we were receiving from the real-time service. For the business logic that was built into the polling method we split this into other parts of the codebase, such as cron tasks and other APIs that still would be used regularly.

Metaphorically: the architecture introduced a sorting center (NodeJS real-time service) to the warehouse, which dispatched the mailmen on their routes (websockets). The office informs the sorting center of new packages through walkie-talkies (Redis) that they already used to shortcut some of the most tedious warehouse tasks.

Results

  • First and foremost, the real-time communication is a lot faster than the polling method, where in some cases users would have to wait tens of seconds before their message would be read. With the optimized functionality, all messages are transported as fast as possible to the target.
  • Better user experience. The real-time communication is a huge improvement for users as well: seeing the read-receipts fly through the system and getting those blue double checkmarks much quicker was well received.
  • Last but not least, the new functionality resulted in a 50% reduction in CPU and disk load, meaning our intentions to improve performance were achieved as well.

A look at the backend

--

--

El Niño

http://www.elnino.tech. Digital Development Agency building tailor made solutions, ensuring success by making it measurable.