Handling tens of thousands of Docker containers daily

In SourceLair, we are committed to provide the best user experience possible, while ensuring maximum security for our users. In order to achieve this goal we utilize Docker; an excellent piece of software that allows us to run isolated processes on behalf of our users. What happens though, when we have to handle tens of thousands of containers in a single day?

Identifying the problem

In SourceLair almost every action our users spawns a Docker container; from running their programs, to getting error reporting for their Python code in real time. Pretty soon we started getting our servers overloaded at certain times, because of heavy usage of SourceLair. Multiple users that where editing multiple files at the same time led to hundreds of Docker containers being created almost simultaneously to handle visual representation of the project’s Git status in the project’s file explorer, provide error reporting on the current Python code in real time, handle the users’ terminals and so on. So the really challenging part was handling the spikes of container spawning which could reach multiple hundreds of containers being spawned in less than 60 seconds.

Approaching the solution

Docker containers were already load balanced across multiple machines, so all we had to do is focus on optimizing a single Docker host and then replicate the procedure to the whole set of Docker host servers. Our approach on resolving this issue was making available as much CPU and memory resources as possible at a time on a single server. Our file system responded surprisingly well to such spikes, so we didn’t need to take any further measures to enhance file system I/O performance on each server. The first thing we did is start monitoring the time intervals between most container spawning spikes. This time interval was the safest one to start cleaning up space from ghost running containers and stopped ones. Stopped containers, despite not consuming any CPU and memory resources themselves, they constrained the ability of the Docker daemon to handle more containers, so they had to be completely removed from the system. After weeks of experimenting with the optimal time interval to clean up container space we eventually built a Python project powered by Celery that takes care of killing ghost containers and cleaning stopped ones, in two asynchronous time intervals. That way we managed to radically reduce the times that our servers would surpass 80% occupation of CPU or memory resources, thus making them always ready for container spawning spikes.

Next steps

Optimization of Docker containers isn't over yet. In the next days we will ship to production network interface sharing across Docker containers of the same project. This will lead to creating less virtual network interfaces by hundreds, thus saving a magnitude of resources occupied by the Docker daemon.

We are really happy that we serve such a magnitude of heavy user activity seamlessly and that we have more work to do further optimization on that. We can’t wait to start sharing with you more details on how we handle specific issues on SourceLair’s everyday life.