Vanguards of HPC-AI: Spack Builder Todd Gamblin of LLNL on the Why’s and How’s of Change in HPC

Todd Gamblin at SC19

In our continuing series on current and future leaders of HPC-AI, Lawrence Livermore National Lab’s Todd Gamblin has a well-deserved reputation in the HPC software community as a passionate engineer who enjoys rolling up his sleeves and diving into technical problems. It’s not a stretch to see how he got hooked on HPC.

If someone is going to kickstart an HPC career and is looking for a guide, they couldn’t do better than the combination of Dan Reed and Satoshi Matsuoka. In 2004, Gamblin was fortunate to have Reed as an advisor at the University of North Carolina at Chapel Hill (UNC) to discuss research projects around performance analysis for HPC simulations. Gamblin also benefitted from a summer research internship at the University of Tokyo where he was introduced to Matsuoka’s HPC research group at Tokyo Tech.

Today, Gamblin is a Distinguished Member of the Technical Staff at Lawrence Livermore National Laboratory, and he’s been recognized with several awards including the Early Career Research Award from the U.S. Department of Energy in 2014, an R&D 100 award in 2019 and the LLNL Director’s Science & Technology Award in 2020.

We congratulate LLNL’s Todd Gamblin on being named an HPC-AI Vanguard.

HPC-AI Current and Future Leaders – LLNL’s Todd Gamblin: Driving Broader Adoption of HPC Software

What was your first involvement in HPC or AI?
1. Year? 2004
2. Where?
  I spent my first year of graduate school at UNC researching asynchronous digital logic, but I wanted to switch to something a bit higher level (metastability wasn’t my thing!) without getting too far from the hardware. I met my future advisor, Dan Reed, in the spring of 2004, and we discussed potential research projects, mainly around performance analysis for HPC simulations. I decided to join Dan’s group in the fall, but I had already lined up a summer research internship at the University of Tokyo, through the NSF EAPSI (East Asia and Pacific Summer Institutes) program. Dan told me to visit Satoshi Matsuoka’s HPC research group at Tokyo Tech while I was there, to learn more about the field. So, my first exposure to HPC was a combination of research discussions with Dan in Chapel Hill and Satoshi at Tokyo Tech. I started doing HPC work in earnest in the fall of 2004.

What is your passion related to your career path?

My passion is building lasting tools and systems to help developers. From the start of my career, I wanted to create something that was widely used that could grow beyond just my own efforts. My original focus was on performance tools and profiling, but it seemed like building and installing these tools was often the biggest obstacle for users.

That drove me to start working on Spack. As the tool gained traction, I realized the need to scale our efforts and sustain both the core tool and the community behind package maintenance. It’s been rewarding to see the HPC community (including labs, academia and even industry) converge around Spack, and I think it’s helped to bring the broader HPC software ecosystem closer together. Previously, software projects may not have been seen as part of such an interconnected ecosystem.

My other projects are extensions of that vision—with the High Performance Software Foundation (HPSF) we’re working to grow HPC developer communities and drive broader adoption of key HPC projects through events, collaboration, and services like Continuous Integration and packaging. I’m also advocating for the integration of cloud technologies, like node virtualization and IaaS, in HPC centers, something I think will ultimately help developers build new types of software and workflows.

Please share with us a significant HPC-related event you’ve been involved with, an advance, a fresh insight, an innovation, an instance when you contributed to a step forward in computer science or scientific research.

One of the most significant events in my career has been building and growing Spack into a foundational tool for the HPC community. When I started Spack, I wanted to create a package manager that could handle the complexities of HPC software. Over the years, Spack has become an essential part of the HPC ecosystem, widely adopted across national labs, universities, and industry partners. Spack’s use within the Exascale Computing Project and for the software environment on the El Capitan system at LLNL were particularly rewarding for me.

Spack was critical for ECP in that it helped teams collaborate and build on exascale applications and libraries, and it enabled over 600 tools to be integrated as part of ECP’s Extreme Scale Scientific Software (E4S) stack. Beyond the technical milestones, I’m also proud of the community Spack has built. Seeing over 1,400 contributors help to build Spack as a shared platform has been deeply fulfilling.

Do you prefer working as an individual contributor or a team leader?

I like both. I’m a software engineer at heart, and I enjoy getting deep into technical problems. Modeling software ecosystems is at the core of Spack, and there are technical challenges in that space that I think could keep me occupied for years. Managing the tradeoff between performance, productivity and complexity has been an interesting journey.

At the same time, scaling up effort for a large project, building adoption and building communities around software, requires leadership. I enjoy coordinating technical teams around common goals, and it’s been rewarding to see team members grow to the point where many aspects of the day-to-day work on Spack can happen without my involvement. I’ve also enjoyed the process of getting software teams, labs, and companies behind HPSF.

Who or what has influenced you the most to help you advance your career path in this advanced computing community?

The most significant influences on my career have been Bronis de Supinski and the opportunity to work at LLNL. Bronis, who served on my Ph.D. committee, encouraged me to join LLNL, and his mentorship has been invaluable. His guidance and connections across LLNL, the DOE, and industry have opened up important opportunities for me. Additionally, LLNL’s environment, where computational scientists collaborate closely with researchers and HPC facility staff, has exposed me to real-world HPC challenges that have shaped my focus on applied research.

I think LLNL has a unique culture in that it is very supportive of impactful, long-term open-source software projects. The lab has provided me with the freedom to pursue projects that wouldn’t have been possible elsewhere.

What are your thoughts on how we, the nation, build a stronger and deeper pipeline of talented and passionate HPC and AI professionals?

This is challenging due to the specialized education and skills required, coupled with strong competition from industry for top talent. However, with the recent AI explosion, the alignment between the scientific community and industry around large-scale computing has never been closer.

I think this presents a unique opportunity to leverage the alignment for the public good. The scientific community needs to enhance its visibility; currently, much of our recruitment happens through personal connections, and we don’t market as widely as we could. We should work to ensure that the wide range of HPC and AI applications for science is well known among undergraduates and make career pathways clearer. The scientific mission is inherently exciting and impactful to the nation, and better marketing can inspire a new generation of scientists to join us.

I may be biased by my personal experience, but I also think that highly visible open-source projects are a way that we can attract talent interested in impactful scientific work. The more we can fund people to work in the open, leverage open-source communities, and engage with major projects (e.g., PyTorch, LLVM, Spack, Ultra Ethernet Consortium), the more visible we become and the more talent we can recruit. HPC is in many ways converging with AI and cloud and I think we need to embrace that.

Todd Gamblin and family

What does it take to be an effective leader in HPC and AI?

Deep technical understanding and a broader awareness of the industry landscape are both important. HPC systems and the AI industry are increasingly interconnected, as both aim to run large-scale, tightly coupled applications. Success in HPC will increasingly rely on the ability to leverage industry advancements rather than to work in isolation.

The HPC community can be resistant to change, and effective leadership requires not only a vision of what HPC could become but also the skill to communicate and persuade. Leaders must bridge the gap between the established HPC mindset and emerging opportunities, and often that requires a deep technical argument not only for why we should make a given change on the HPC side, but how we can do it.

For example, bridging the increasing gulf between the software environments of on-prem HPC machines and cloud systems is going to require us to get creative with how we manage on-prem machines and with the types of hardware we push vendors to deliver.

What is the biggest challenge you face in your current role?

On the technical side, my biggest challenge is tackling the increasing complexity of the HPC hardware and software environment. HPC by its nature must support a very wide range of CPU and accelerator architectures, and that doesn’t lend itself to simplicity. HPC still requires a great deal of expertise on the part of the user to get the fastest build of an application or to integrate disparate software stacks, and we have work to do before users can easily install all the tools they want for these machines.

From a leadership perspective, my biggest challenge is bridging the gap between traditional HPC users and those looking to leverage cloud technologies. In Livermore Computing, part of my role is to explore how HPC can incorporate cloud capabilities. Users increasingly want to deploy distributed applications, services, and automated workflows, which are often easier to set up using VMs or containers. However, HPC centers aren’t always optimized for multi-tenancy, especially for these types of workloads, despite the growing demand. I frequently find myself mediating between these two groups, and aligning their needs can be challenging. I would like to create a future where all these workloads can coexist seamlessly on the same system.

What changes do you see for the HPC / AI community in the next five-10 years, and how do you see your own skills evolving during this time frame?

There is a big gap, currently, between the type of automation possible in cloud environments and the automation possible in on-prem HPC facilities. There are also fundamental economic challenges around shifting workloads to cloud — it is likely that the economics will not make sense for the largest HPC facilities to move their compute to a cloud-like charging model. I would like to see a big investment in bringing on-prem computing on par with the capabilities clouds are offering, at least for a subset of key services. VMs, network virtualization, block volume services, encrypted network traffic for multi-tenancy all seem necessary for giving HPC users more control over their environment. I think this will become increasingly necessary as AI and cloud workloads become more common at HPC centers, and it will require shifts in the skillsets of HPC center staff, as well as shifts in the composition of science application teams.

What is your view on the convergence and co-dependence of HPC and AI?

My view on AI is that it is a type of HPC workload, and it’s quickly grown to dominate large-scale computing across the world. It’s not clear to me how much more the current AI bubble can grow, or whether LLMs will remain the main AI application over the next five or 10 years. LLMs are amazing, but they don’t reason so much as recognize and recombine patterns, and I think it will take more than increasingly large models to bridge those gaps. AI will be used to accelerate simulation workloads, and I think we’ll increasingly see AI being trained on the output of traditional simulations to accelerate design optimization or to build more interactive interfaces to simulate problem spaces. All that is to say, we’ll see HPC and AI used together more and more.

Do you believe science drives technology or technology drives science?

Both. Some science is fundamental and enables new types of technology: the transistor, integrated circuits, neural nets, deep learning. Improvements in technology enable us to build bigger, more productive systems and to drive discovery—this is the core promise of HPC.

Would you like to share anything about your personal life?

Outside of work, my life revolves around my family. We have two girls, ages 4 and 7, and they keep my wife and me very busy. You can find us shuttling them to playdates, gymnastics class, dance class, performances and school events.

Sponsored Guest Articles

‘It’s Vertical’: Lenovo’s New Rack Server Chassis Turns HPC-AI Liquid Cooling on its Edge

White Papers

Big Data Clusters

Featured RSS Feed

More News from insideAI News