Immutable Infrastructure Meets Mutable Science
Over the last few years, I’ve worked across multiple AI labs/teams and have noticed a repeated pattern of confusion between people that come from the modern cloud/distributed systems background, and the people who are coming from traditional HPC or university-based AI/ML research labs.
Mutability is a bug
People who are trained in recent years to do any kind of DevOps, Site Reliability, or otherwise operations and cloud infra focused work are all taught that mutability is the enemy. Reproducible, immutable infrastructure is the goal. Before Docker, and containers in general became as common as they are in the broader software industry, I believe this was first popularized by engineering tools and blogs from Netflix in 2015, tools like Hashicorp’s Packer for creating immutable AMIs that included application artifacts quickly became the norm. Netflix did not create these patterns, though I believe it’s fair to say that their advocacy was taken to be very serious.
Build as much context and environment into your deployment artifacts as possible, which made environments much more reproducible, and made things like managing multiple environment targets much easier to create. Promotion pipelines for deployments through different environment targets were much easier to administer.
Platforms as a service like Heroku also contributed, having their bundles and Procfiles, before supporting containers as the format for artifact definition. Websites like the 12factorapp became standard documentation and practice around the world. The rise of microservice architectures also increased how important it was to be able to sustainably manage all of your infrastructure and dozens if not hundreds of services.
Automation of administration became a requirement to scale both at a technical level but also at a personnel growth perspective. You had no choice.
All of these tenets have been pounded repeatedly into anyone who has done distributed systems development over the last decade or so. These pipelines ultimately aim to provide simplicity for the actual administration of services once they’ve gone live, but the pipelines themselves are riddled with complexity. I’m a full-on religious disciple for Tesler’s Law being applicable here. Operations is a complex topic and understanding its intricacies is very hard. Finding abstractions that help people be productive without needing to understand everything is a fundamental requirement to make progress. It’s untenable to be able to learn and understand everything before being able to move forward.
That being said, you’re only taught the importance of these things when you’ve had to support infrastructure at scale.
How academia sees infrastructure
AI/ML researchers, sometimes called research scientists, being a role “in industry” is still so nascent in the grand scheme of things. Arguably, it was only around roughly the ~2020 or ~2021 time frame that AI/ML researchers ever had to care about multi-node or multi-host workloads. However, before them was the “big data” community, and even before them was the HPC community.
Let’s talk very briefly over “big data”. I don’t want to spend too much time on this. Although they were doing large batch jobs over huge amounts of data, the biggest key difference here is the stateful nature of the workloads. MapReduce, Spark, and similar classes of tools are heavily driven by workloads where the work itself is independently parallelizable. Map out the data into independent chunks that can be worked on, then reduce the results and combine them. While of course there are challenges to these workloads, these are “easy” in the sense that the workloads tend to be idempotent and easily made to be elastic.
The HPC community, maybe most commonly working in physics or biology simulations or perhaps the oil and gas industry is the original community who worked with interconnected workloads. The units of work were no longer considered to be independently parallelizable. They were the opposite. Computations of a simulation might be heavily dependent on other parts of the formulas. Protocols and schemes were developed to let multiple machines communicate and coordinate with each other.
In other words, “big data” traditionally meant that you had enough data that it could not fit on a single machine, but that you could shard the data across machines and shrink the problem space to fit into any given machine. Think of it like a recursive merge sort. But in HPC, the data does not fit in a single machine, but the combination of machines is fundamentally considered one machine. Aka, a singular super computer.
While there were industry companies that were doing high performance computing, they were largely still run and operated like a university research lab. The culture of a research lab, and research scientists, for many years is ultimately quite simple. The feedback loop was to do some research, publish a paper, present at a conference, and use that to justify funding for more research. Lather, rinse, and repeat.
Each iteration of this loop is typically independent. The code that you wrote for a given paper’s research is independent. You start over each time, and the scope of “support” might be as little as a few weeks but maybe as large as a few months. Maintenance doesn’t exist in this world in the same way that it does for any “live production product”. Once the paper is done, you simply throw away the work.
You always know that no matter what gross code you wrote, or what hacks you put in to support your use case, once your paper had its results, you didn’t have to care about the code anymore. Culturally, I believe that this leads to a very different kind of value system for the researchers. Automation brings little to no value, because in the grand scheme of your research, you’re very unlikely to be doing the same set of steps all that often.
Systems and infrastructure at universities tend to be very lightly used. Upgrades seldom happen. Security is often ignored, etc. Again, very little maintenance really matters to these groups of people. # Mutability is a feature
That being said, mutability is often a feature, because mutability is often a source of performance and short term productivity. When you’re iterating on research code in HPC, it’s quite common that there’s simply no way to truly test your code from a local development environment. Data sets might be so large, and system requirements might be so large to get any kind of valuable signal, that you only truly see bugs or behaviors at stateful scale. Speed of iteration is one of the most highly valued parts of development. You don’t want to have to wait for a full build and release cycle for your immutable application artifacts. You want to update some code and run your batch job and that should be the end of your development cycle.
Mutability isn’t treated as a bug or a critical security vulnerability. Reproducibility of the dev environment doesn’t matter, because you’ll throw the dev environment away after the paper is complete for your dissertation. Mutability is a feature, because the benefits of immutable infrastructure are considered irrelevant by most academics.
Your value system and their value system are different
All of this culminates in the same pattern that I’ve seen repeatedly. People from the different backgrounds consider the other side to be ignorant, and this creates a divisive relationship. “The researchers are bad engineers” is something that I’ve said myself and have heard so often. But in hindsight, it’s all a function of the reward system. And I think it’s important to take a step back and understand that. Building the resentment is not going to help anyone. Some engineers come from cloud distributed systems, and some come from HPC. These are being fused together and are now being called AI/ML engineers. Both backgrounds have valuable skills and we should learn from each other.
I, for one, believe that there’s a way to do both. In a Slurm-based HPC environment, you might never use a container to package your dependencies up. You might depend purely on your NFS being mounted to all compute worker nodes. Some slurm teams are adopting pyxis and enroot, which is a step towards modern immutable infrastructure practices, but unlike a modern backend application, these containers likely don’t contain application code, which will get synchronized in another way. Researchers don’t want to wait for complicated build pipelines for their code to get updated and distributed. But I think there are better ways. I think there are ways to optimize the tools and platforms for a modern stack to such a point where there’s practically no difference in terms of developer productivity. On my semi-retired technical newsletter, I wrote about having fast docker builds. When all you’re changing in a docker container is the application code in the final layer, you can get docker builds, pushes, and pulls down in under 10 seconds. Maybe that’s fast enough for what feels like a mutable development environment, but in fact, every single job submission is its own immutable container.
Alas, this doesn’t really solve the problems for the lineage and provenance of the data itself. I know there are tools like pachyderm or dvc that attempt to address this too. But maybe that’s for another day.