Programming

Basecamp's house of cards...

Photo by Sigmund on Unsplash

Photo by Sigmund on Unsplash

Character is like toothpaste - it really only comes out under pressure (unknown)

I don’t know where I first heard the above quote, but the events of the last week have really shown how true this saying is.

I will start by saying that I have been an interested (but not rabid) fan of the journey that Jason Fried and DHH have been, building their startup called Basecamp. I started on my journey of building online SaaS system 20 years ago, when they first started the company. I had a further 20 years more experience of running a software consulting business before that, so I was intrigued, and a little surprised that two young lads could start an online software business that just seemed to go gangbusters out of the gate.

I guess that was the main thing that caught my attention - that it seemed so effortless, and looked like revenues just snowballed from day 1. Heck they even said that they were earning revenue before they had a chance to build in a billing and payment interface into the system! Something that I had never experienced at that stage.

As a developer myself, I admired DHH’s work with building the Rails framework. Though even back then, I found the framework to be a little too opinionated - much like the founders themselves. I grew to love the Ruby language, but not the Rails framework. Indeed my current startup, which is successful beyond my dreams, is built using Ruby (but on a different framework).

I used to read Jason and David’s blog posts, and over the years, I even bought a few of their books. However I always read their advice with a touch of reservation. They always struck me as being the guys that fell into a pond and came up with a Koi carp in their mouth and went “Oh, hey, let’s tell everyone else how to be great Koi carp fishermen like ourselves”.

Don’t read me wrong - I did respect a lot of the things they did, and how they rattled some of the embedded habits of running a business, but somehow, it all seemed to come from a charmed existence, devoid of any real challenges or problems that required some really creative problem solving.

As far as I know, the never had to deal with situations like:

  • Having to make payroll for the team with $0 in the bank

  • Having an employee go rogue and either embezzle funds from the company or conduct systemic abuse of another employee over time

  • Having a founder/partnership separation that was acrimonious or came at the wrong time

  • Having to deal with data loss or accidental customer data deletion due to bad internal practices

I have had to deal with nearly all the above in my 40 odd years of running former businesses. Some of them more than once. I wanted advice that came from the painful viewpoint of having experienced this and and had you staring at the possibility of the end of your company as a consequence. But when you are busy raking in the cash as Jason and David were, it is hard to listed to the “Oh, just do as we do and you will be just as successful” mantra.

The current events at the company have shown the fragility of the foundations that have brought the company to these dizzying heights. Fundamental issues with employee management have been ignored and seemingly plastered over with benefits and reinforcement of “You are lucky to be working at such a great company and being lead by such great leaders!”.

It seems though, that when push came to shove, and employees expressed dissatisfaction with a number of policies and people within the company, it all became too hard, and instead of confronting the problem head on and dealing with the issue, the decision to avoid it and offer band aid solutions was put forward. This seems not to have worked this time around.

Indeed, it seemed that at the ultimate “all hands” meeting, one founder put his normally opinionated personality into hiding and said he would offer no opinions, and the other founder attended from his bed, turned his video off and muted his mic and seemingly posted a slew of anti-Apple retweets during probably what was the most important meeting of his company’s existence.

Basecamp is a wounded beast now, and it will take some time to recover. I genuinely hope that they find their feet again, and that they learn from this, and that their company sticks around for another few decades. I hope that the employee’s who left find gainful and happy employment elsewhere, and I hope that Jason and David approach any future “lessons” they they wish to impart with a lot more humility. I also hope they also see fit to step down from the pulpit every now and then and ask for advice from those that came before them, rather than roll over every business legacy with their own thoughts of how things should be done.


The case of WOPR and the lost code...

WOPR.jpg

I got an email earlier this week from a guy reaching out after he read a post of mine on a gaming forum that was posted over two decades ago!

I couldn't for the life of me remember what I had said, but he reiterated that it was about a simulation that I wrote in BASIC that was a replica of the WOPR computer from the movie "War Games".

All my old code was stored on floppy disks and ZIP drives, and have long since become landfill somewhere, and have over the course of time also faded from my memory.

It kind of makes me glad that we have a service like GitHub now, that can preserve old code for many, many years - possibly even once I am no longer around. I am lucky enough to be a GitHub Arctic Code Vault Contributor, and to have my code preserved, so that one day perhaps the successors of the human race can find and peruse them.

Speaking of Github, I did manage to find another avid coder's simulation of WOPR on there. Here it is: https://github.com/zompiexx/wargames

Now I just need to find a good BASIC interpreter for my Mac so I can play it again!

Finding and fixing slow queries

I love dashboards, and I run several on my system to monitor my HR startup in real time. I love collecting data points around subscriptions, user activity, and most importantly, server uptime and performance. Especially now that we are growing to thousands of customers and tens of thousands of employees using our system on a daily basis, I am finding that a lot of things that we built in the early days are starting to show cracks and issues.

Most of this would have been hidden from my knowledge, except for the fact that we have been using AppSignal for our application monitoring for just over a year now, and the AppSignal dashboards and one screen that I keep checking several times a day to ensure that everything is just ticking along fine.

One thing that I have noticed of late though, is the page rendering speed was slowly creeping up. Pages that were taking 1 to 2 seconds to build and render on the server side were now taking 5 to 8 seconds, sometimes 10 seconds plus. This was, of course, unacceptable, and would be severely impacting the user experience for our customers and their employees.

Hunting For The Problem

I was curious as to what was causing these slow downs, and luckily, AppSignal’s fantastic dashboard actually allows me to click on one of these peaks on the chart, and see exactly what was happening at this point, like so:

HRP_AppSignal_1.png

From this, it will show a breakdown of all the App processes that were running at this particular point in time, making it easy to identity the ‘slow process’:

HRP_AppSignal_2.png

And furthermore, clicking on the process name above, will show you an ‘execution timeline’, showing all the calls and database activity, along with the execution time for each:

HRP_AppSignal_3.png

This makes it so easy to see the actual database query that was causing the slowdown in the first place.

HRP_AppSignal_4.png

This is showing the actual SQL code that is running slowly in this instance.

Identifying The Underlying Cause

So what was the issue? The answer was actually quite easy to deduce by the above database query tracing, and it had a lot to do with how Ruby ORMs (Object Relational Mappers) work. What is an ORM? Well, to simplify the explanation, it is basically a way for Ruby to treat a database store as just another Ruby object, allowing you to manipulate it just like a normal Ruby object without worrying about knowing SQL (Structured Query Language) or any underlying database commands.

To explain it in layman’s terms - lets say you have a database table called ‘Employees’, and you also have another table called ‘Departments’. Now, each employee can belong to exactly one department, so you can set up a relationship between the two tables so that there is a department_id field in the Employees table, which links to the id within the Departments table.

SQL_Diagram_1.png

Now, because we have many different companies using our HR system, we need to also have a Company database, and link each Employee, AND each Department to a company by using a company_id field to separate them. Each company can have their set of employees, and each company can also have their own unique set of departments.

SQL_Diagram_2.png

Looks a bit complicated right? And handling this in SQL would be a tricky thing, however, the DataMapper Ruby ORM that we use to layer on top of our database actually makes this extremely easy. In fact, to access, for example, all employees in Company ‘Acme Industries’, who are in the Department called ‘Finance’, it would be as easy as executing this:

Ruby_Statement_1.png

And it would work straight away. No need to worry about table joining etc. - the ORM takes care of all of that for us, and gives us the results we desire… or does it?

Let’s look at the SQL query that we would normally do to get the same result, using a standard SQL INNER JOIN clause:

SQL_Statement_1.png

A lot more complicated, isn’t it? BUT… it is super fast, and able to pluck out employees from a table with thousands of companies and millions of employees in less than a second. That is the beauty of raw SQL.

However, when I look at the raw SQL being generated by our DataMapper ORM, it was a little different. Now, with SQL, there are many ways to skin a cat, and it looks like DataMapper was taking the easiest pathway. After all, multi table joins are complicated - just look at the query above! What DataMapper was doing was splitting the query into three parts based on the things being asked of it. Here was the resulting code:

SQL_Statement_2.png

It is important to realise that this is actually giving the EXACT SAME result as the raw SQL query above, but it is doing it in a slightly different way.

But why was it running so much slower that it should have? It didn’t really make sense.

But then, we realised it. It was all coming down to Step 2 of the DataMapper query, which was staring at us in the face when we looked at the AppSignal logs.

Fixing The Problem

You see, now that we had thousands upon thousands of customers in the system, almost all our customers had a Department called ‘Finance’ in their companies. So Step 2 was actually returning THOUSANDS of department IDs that Step3 was trying to look for. Of course, most companies only have ONE department called ‘Finance’, so the third step should have only been looking for one company_id and ONE department_id, but it was not. All because we had not specified a qualifier for the Departments table.

Changing the DataMapper query to the following helped to increase the speed of the query quite significantly:

Ruby_Selection_2.png

As you can see, we had to add the qualifier:

:company => {:name => “Acme Industries”}

to the Department file as well.

This resulted in the following conversion to SQL:

SQL_Statement_3.png

So while the Ruby method now looks like we are duplicating things a little, it actually works better in terms of database performance.

To show you the difference, here is the AppSignal dashboard for page rendering speed before and after we made the fix:

Page Response Fix.png

Look at the drop in average page responses from 4 to 6 seconds, down to 2 seconds or less.

This didn’t stop here. Motivated by these gains, I went ahead and kept looking at later spikes to see if there was anything else on our dashboard that could be optimised. Sure enough, I found a couple of other queries that had the same issue, and I fixed those too:

Page Response Fix 2.png

Most big spikes gone! What a win.

This just goes to show the importance of an application monitoring tools such as AppSignal or any other APM (Application Performance Monitoring) tool for your SaaS. A lot of time, you won’t realise that certain optimisations will be needed when you have a smaller data set. It is only as you grow that you start to see some of these bottlenecks, and a good tool will certainly help narrow down where you should be working on.

Most times, people just upgrade server capacity, or throw more caching solutions in the mix, but all that costs extra money, and if you a bootstrapped startup like us, this sort of fix cost $0. Actually, in terms of AppSignal subscription price, it has pretty much paid for this whole year’s subscription to their service in terms of cost savings from our hosting provider.

Making my code run 3000 times faster!

Photo by Abed Ismail on Unsplash

Photo by Abed Ismail on Unsplash

I first started using the MySQL database system way back when version 2 was released. I cut my teeth on writing SQL queries using that system, even when it still didn’t have nested SELECTs, which entailed me really getting to grips with hand optimised JOIN commands.

But over the years, as better tools and layers of abstraction evolved, I began to touch the raw SQL less and lees, and began to rely on these other tools to do the heavy lifting for me.

My latest startup uses the newest, shiniest version of MySQL to store data, but I am several steps removed from the data store via a Ruby ORM (Object Relational Mapping). Basically, I just define all the database elements as standard Ruby objects, and the system just takes care of everything for me.

It works well 98% of the time, but sometimes there is a problem, like the one I will describe below.

Plenty of space, really! 😅

Plenty of space, really! 😅

In my SaaS app, we allow our users to store document attachments against their employees. We have a handy little widget in the footer of every screen to show our users just how much of their allotted storage they are presently using.

Now, this widget used to do a quick calculation upon every page load. Initially it worked well, with only a few extra milliseconds being added to page load time, but as our system grew, and we allowed attachments in different areas (i.e. Company Library, Job Applicants, e-Signatures, Expense Claim receipts etc.), we started to find that this page load time was actually being increased to over a second - sometimes even a couple of seconds per view, which really impacts the overall snappy feel of the app.

What started as this…

What started as this…

Quickly became this…

Quickly became this…

How would we solve this? Well, we decided to transition from a live, real time calculation, to running a process every couple of hours (in the background) that would go through the file attachment tables and tally up the total storage amount and store that in a database field somewhere. The screen rendering process would then just present the value of that database field on the screen each time.

So we would lose the real time aspect, but customers would still see their storage allocations up to only a couple of hours of lag time. Best thing was that page loads were back up to nearly instantaneous again.

We built a utility routine that runs on our background worker server to run these calculations every 4 hours (on a CRON timer). But the problem was, the routine would have to run through our entire active customer list of 1700+ companies, and tally up all the attachments across the different files in different areas, and store it in the company masterfile. That routine would take up to 45 minutes to run! AND it was also spiking the CPU of the worker server so that other critical background tasks (like calculating leave, handling Slack notifications and important emails) was being impacted and delayed.

We tried several things like spawning several concurrent Threads to run part of the task in, but still no joy. I kept getting critical alerts in the middle of the night that our worker instances were down or struggling with the load. Clearly not acceptable.

This was a frustrating issue, and the team, including myself, were struggling to solve it. But then it dawned on me. We were getting lazy. Our immediate ‘go to’ solution was to stick with Ruby and the ORM and our background worker to solve a background task. What if there was another way?

Looking at the problem, there were no real external inputs required. The task was simply to step through the customer database row by row, then find the associated records (which was easy, because we have keys linking them all), and run a SUM query and then total up all those SUMs and save it in the main company database.

Why weren’t we just doing this in SQL directly? But also, MySQL now has the concept of ‘Scheduled Events’, where you can store regular housekeeping routines right within the database that can to routine queries on a fixed interval. No need to set up a timer on our worker server at all.

So I opened up a MySQL query console, and typed up a query that would do the totalling of all the attachment records and perform an UPDATE on the company row.

Nested selects…. nested selects everywhere!

Nested selects…. nested selects everywhere!

I ran the query.

It took…. 0.8 seconds!

Let me say that again - it took LESS than a second to run. Compared the worker thread we used to do the EXACT SAME thing that took 45 minutes+. That is about 1/3000th the time.

Best of all, it completely takes the load off our worker server. I think the database CPU spiked about 2% for that 1 second timeframe. This meant that instead of running every 4 hours, we could schedule the task to run every hour. Heck, we could even run it every 5 minutes if we wanted to, but I don’t want to get that crazy unless customers really demand it.

It was a valuable lesson to me that complacency and laziness can really be a detriment in a growing business. When you have a hammer, everything looks like a nail, and we were so wrapped up in the comfy perceived safety and convenience of our Ruby ORM that it hadn’t really occurred to us that we could take the covers off and get down to the nuts and bolts to solve the issue in the first place.

Next time there is a problem like this in my startup, I will go back several decades in my working life again, and see if I can roll up my sleeves and fix it the ‘good old fashioned way’.