Optimization

There are many different layers and perspectives for optimization, throughout my career as a software developer, I can’t mention a single project were optimization wasn’t mentioned in some shape or form, on early stages, as well as on later stages of the project, it’s a topic that constantly comes up from time to time… hopefully, not too late.

When I was styding, we were mainly taught about optimization, as an algorithmic execution time concern; the famous big O notation that more often than not, comes up, particularly on job interviews, and with good reason if I may add. Once I started a position as a software developer, I realize that most of the time, we also want to optimize on other dimensions involved during the process of software development and through the lifecycle of projects in general.

On a personal note, I trully enjoy optimization related tasks, particularly those grounded on execution time, memory usage, network latency, and other technical related strategies. Such tasks usually bring you back to the fundamentals of programming and they tend to be universal across any programming language, so it even gives you an opportunity to learn new languages through the application of known algorithms on a new language syntax.

Fast-forward to just a couple of months ago, I took on a task about optimizing a long running process; this post will cover the different strategies I’ve applied, this means that it will most likely be a more technical post aswell, filled with some javascript code snippets. I’d also like to point out that I’ll be using a made up problem, which will of course resemble the original problem I worked on, but hopefully it’ll give a rather generic perpective on it.

The Problem

Say we want to read information from a third party database, which stores information of students and their assistance history to their classes. Since we’re dealing with a third party database, we actually want to regularly synchronize this information into our own database, and perhaps mutate it during this process for our own needs and purposes.

So, we’ll start with what I call the basic intuitive approach, which I’ll represent in the following flow script.

async function syncAllStudents() {
  const students = await fetchStudents();
  for (student in students) {
    const records = await fetchAssistanceRecords(student.id);
    const profile = await fetchProfile(student.id);
    const mappedStudent = mapStudent(student, records, profile);
    await db.students.upsert(student.id, mappedStudent);
  }
}

As we can see, the implementation above is pretty straight forward, and should run without problems given a relatively small number of students, but, as the number of students grow, we will notice that this function takes more time to finish, to the point where we could find ourselves waiting for hours for its conclusion.

Analysis & Solution

We can easily infer that our function is affected by the number of students because we have to iterate over those students, and then fetch even more data for each student. We can’t escape the need of fetching all data, nor its iteration, since that’s the whole purpose of this synchronization function, so the question is, how to make it faster?.

Arguably, Input & Output operations (IO) tend to be way slower than operations running on memory, so, if we reduce the number of such operations, we would be effectively reducing the execution time of the function, so, for IO related operations, it is likely preferable to reduce the number of said operations to the minimum, let’s look at the following code snippet:

async function syncAllStudents() {
  const students = await fetchStudents();
  const ids = students.map(student => student.id);
  const recordsByStudent = await fetchAllAssistanceRecords(ids);
  const profilesByStudent = await fetchAllProfiles(ids);
  for (student in students) {
    const records = recordsByStudent.get(student.id);
    const profile = profilesByStudent.get(student.id);
    const mappedStudent = mapStudent(student, records, profile);
    await db.students.upsert(student.id, mappedStudents);
  }
}

In terms of Big O notation, both functions would have the same execution time, however, you should notice that the second function already runs faster than the first one. This is mainly cause we’ve reduced the number of database queries from n * 2 + 1 to 3. At this point, you may be wondering about the look up execution time for recordsByStudent.get(id) and profilesByStudent.get(id), this would be expensive unless stored on a Map, where look ups are cheap. Thus, our fetch functions should build a Map using the student’s id as key, efectivelly adding two O(n) operations before our loop, which is still cheaper than searching within an unsorted list.

We still have one IO operation left within the loop, luckily for us, many databases do support bulk operations, thus, we could instead build a list of upsert operations to later execute with one IO operation.

async function syncAllStudents() {
  const students = await fetchStudents();
  const ids = students.map(student => student.id);
  const recordsByStudent = await fetchAllAssistanceRecords(ids);
  const profilesByStudent = await fetchAllProfiles(ids);
  const operations = students.map((student) => {
    const records = recordsByStudent.get(student.id);
    const profile = profilesByStudent.get(student.id);
    const mappedStudent = mapStudent(student, records, profile);
    return { op: 'upsert', where: { id: student.id }, data: mappedStudent };
  });
  await db.students.bulk(operations)
}

Last but not least, async/await is a wonderful abstraction, it trully increases readability of our code, however, I would also argue that sometimes it leads us to forget about the fact that javascript is an asynchronous language, and that we’re actually just dealing with Promises, and promises can be parallelized (not exactly in parallel, but that’s another story). So finally, we can make use of Promise.all on those IO operations that do not depend on each other.

async function syncAllStudents() {
  const students = await fetchStudents();
  const ids = students.map(student => student.id);
  const [recordsByStudent, profilesByStudent] = await Promise.all([
    fetchAllAssistanceRecords(ids),
    fetchAllProfiles(ids)
  ]);
  const operations = students.map((student) => {
    const records = recordsByStudent.get(student.id);
    const profile = profilesByStudent.get(student.id);
    const mappedStudent = mapStudent(student, records, profile);
    return { op: 'upsert', where: { id: student.id }, data: mappedStudent };
  });
  await db.students.bulk(operations)
}

Conclusion

At first glance, looks like we’ve ended up with more lines of code and, arguably, a little bit less readability, but it will all be worth it once we see how fast the new approach is, in the original task that inspired this post, execution time went from around 7 hours, to around 10 minutes, I would say that’s a huge win, if you ask me.

There are other optimization concerns as you may figure out, for example, now that all the data is kept on live memory, what happens when we run out of it? However, this post has already grown way beyond my initial estimation, so I’ll leave those concerns for another day.