This is a guest post co-written by Sergio Delgado from YoucanBook.me. In their own words, “YouCanBook.me is a small, independent and fully remote team, who love solving scheduling problems all over the world.”
At YoucanBook.me, we like to say that we’re “a small company that does great things.” Many aspects of our day-to-day culture are derived from such a simple motto, but especially a great emphasis on the efficiency of our operations.
Although we’re far from the first years in which our CTO programmed the entire first version of our SaaS tool, when I joined the company, we were only five developers, of which only three were in charge of backend services, and none were dedicated to it 100%. The daily tasks of a programmer in a startup like ours are incredibly varied, from answering customer support requests to refining the backlog of tasks, defining infrastructure, or helping with requirements. The job is as demanding as it is rewarding, where the challenges never end, but that forces us to seek efficiency in everything we do. A project not very well defined, where we’re not very clear about the way forward and that can take months of research, is a challenge for a team like ours, and we’ll probably postpone it again and again to prioritize more urgent developments that bring value to our customers as soon as possible. For us, it’s very important to extract the maximum benefit from every development we make in the shortest possible time.
The result of this philosophy is our involvement with Amazon Web Services and its different platforms and tools. Although the early versions of our backend services didn’t run on AWS, the migration to the cloud allowed us to stop worrying about managing physical servers in a hosting company and focus our efforts on our business logic and code.
Currently, our backend servers are based on Java technology and run on AWS Elastic Beanstalk, while for the frontend we have a mix of JSP pages and React client applications. In JVMs, we deploy WAR files, which we compile from the source code that we store in AWS CodeCommit repositories using AWS CodeBuild scripts. In addition, all our monitoring is based on Amazon CloudWatch logs and metrics.
An interesting feature is that the different environments (such as development, pre-production, production, and analytics) are completely separate, but we manage them through AWS Organizations. AWS Identity and Access Management (IAM) users are created in a root account, and then assume roles to operate on the rest of them.
With all this, three people manage half-a-dozen services, across four different environments, running in dozens of instances and, quite simply, everything works.
Our problem
Although our services are all developed with Java technology, the truth is that not all of them share the same technology stack. Over the years, they have been migrating to more modern frameworks and standardizing their design, but not all of them have been able to update yet. We were aware that some services had clear performance issues and caused bottlenecks when there were high load spikes, especially the older code based on legacy technologies.
Our short-term solution was to oversize those specific services, with the consequent extra cost, and redeploy them in the long term following the architecture of our most modern applications. But we were sure that we could achieve very fast improvements if we invested in some performance analysis tool, or APM (Application Performance Monitoring). We knew there were many in the market and some of us had experience working with some of them, and good references from others. So we created a performance improvement project on our roadmap, researched a little of which products looked better and … not much more. We never found the time to spend weeks contacting suppliers, installing the tools, analyzing our services during the trial period, and comparing the results. That’s why performance tasks were constantly being postponed, always waiting for some time of the year where we didn’t have much else to do. Which was never going to happen.
Amazon CodeGuru Profiler arrives
One of the good habits we have is being very attentive to all the news of AWS, and we’re usually very quick to test them, especially when they don’t involve changes in our applications’ code.
In addition, relying on AWS products gives us an extra advantage. As a company policy, we love being able to define our security model about IAM users, roles, and permissions, rather than having to create separate accounts and users on other platforms. This rigorous approach to managing access and permissions for users of our infrastructure allows us to regularly undergo security audits and successfully overcome them without investing too much effort for a company of our size. In fact, our safety certifications are one of our differentials from our competitors.
That’s why we immediately recognized the opportunity Amazon CodeGuru Profiler offered us when it was announced at the re:Invent conference in late 2019. On paper, other APM tools we wanted to evaluate seemed to offer more information or a larger set of metrics, but the big question was whether they would be useful to us. What good were reporting screens if we weren’t sure what they meant or if they didn’t offer recommendations that we could implement immediately? Amazon CodeGuru seemed simple, but instead of seeing it as a disadvantage, we had the intuition that it could be a big benefit to us. By testing it, we could analyze the results in hours, not weeks, and find out if it really gave us value when it came to discovering the parts of the code that we needed to optimize.
The best thing about CodeGuru Profiler is that it would take us longer to discuss whether or not to use it than just install it and try it out. A single developer, the infrastructure manager, was able to install the CodeGuru agent on all our JVMs in one afternoon. We ran CodeGuru Profiler directly in the production environment, allowing us to analyze latencies and identify bottlenecks using actual production data, especially after a load peak. We realized that it’s much easier and more realistic for us than simulating a synthetic workload, and there’s no possibility of us defining it incorrectly or under untrue assumptions. All we find in CodeGuru is the authentic behavior of our systems
The following screenshot shows our systems pre-optimization.
The following screenshot shows our systems post-optimization.
Analysis
The flame graphs of CodeGuru Profiler were very easy for us to understand. We simply select the time interval in which we detected a scaling problem or peak workload and, for each server application, we see the Java classes and methods that contributed most to the latencies of our users’ requests. Because our business is based on integrating with different external calendar systems (such as Google, Outlook, or CalDAV) much of that latency is inevitable, but we quickly found two clear strategies for optimizing our code:
- Identify methods that don’t make requests to third-party systems but nevertheless add significant time to latencies. In these cases, CodeGuru Profiler also offered recommendations to optimize the code and improve its performance.
- See exactly what percentage of response times were due to which type of requests to the underlying calendars. Some requests (such as creating an event) don’t have much room for improvement, but we did find queries that were done much more frequently than we had estimated, and that could be largely avoided by a more appropriate search policy.
We got down to work and in a couple of weeks, we generated about 15 tickets in our backlog, most of which were deployed in production during the first month. Typically, each ticket requires hours of development, rather than days, and we haven’t undone any of them or identified any false positives in CodeGuru’s recommendations.
Aftermath
We optimized our oldest and worst-performing service to reduce its latency by 15% by the 95th percentile on a typical working day. In addition, our response time graphs are much flatter than before, because we eliminated latency spikes that occurred semi-regularly (see the following screenshot).
The improvement is such that, in one of the last peak loads we had on the platform, this service was no longer the bottleneck of the system. It supported all requests without problem or blocking the rest of our APIs.
This has saved us not only the cost of extra instances we no longer need (which we had running just to service in these scenarios), but dozens of work-hours in deeper refactoring over legacy code, which was just what we were trying to avoid.
Another of our backend services, which typically holds a very high workload during business hours, has improved further, reducing latency by up to 40%. In fact, on one occasion, we introduced an error in the configuration of our autoscaling and reduced the number of execution instances to only one machine. It took us a couple of hours to realize our failure because that single instance could handle all our users’ requests without any problem!
The future
Our use of CodeGuru Profiler is very simple, but it has been tremendously valuable to us. In the medium term, we’re thinking of sampling part of our servers or user requests instead of analyzing all production traffic, for efficiency. However, it’s not too urgent for us because our services are working perfectly well with performance analytics enabled, and the impact on response times for our users is imperceptible.
How long do we plan to have CodeGuru Profiler activated? The answer is clear: indefinitely. Improving problematic parts of our services that we more or less already knew is a very good result, but the visibility that it can offer us in future peak loads is extraordinarily valuable. Because, let’s not fool ourselves, we removed several bottlenecks but will have hidden ones, or introduce them with new developments. With CloudWatch metrics and alarms, we can detect when this happens and know what happened, but CodeGuru helps us know why.
If you have a problem similar to ours, or want to prevent it, we invite you to become more familiar with CodeGuru.
About YoucanBook.me
YoucanBook.me allows you to schedule meetings online for your business or team, any size. It Eliminates the need to search for free spaces by sending and answering emails, allowing your client to create the appointment directly in your calendar.
Since its inception in 2011, our company remains small, efficient, self-financing, 100% remote, and dedicated to solving agenda issues for users around the world. With just 15 employees from the UK, Spain, and the United States, we serve tens of thousands of customers, managing more than one million meetings each month.
About the authors
Sergio Delgado defines himself as a programmer of vocation. In 25 years of developing software, he has gone through doing C++ video games, slot machines in Java, e-learning platforms in PHP, fighting with Dreamweaver, automating calls in a call center, and running an R&D department. One day he started working in the cloud, and he no longer wants to go back to earth. He’s currently the leader of the Engineering team and backend architect at YoucanBook.me. He collaborates with the community, giving talks in meetups or interviews in various podcasts, and can be found on LinkedIn.
Rodney Bozo is an AWS Solutions Architect who has over 20 years of experience supporting customers with on-premises resource management as well as offering cloud-based solutions.