A Future for HPC and Cloud: Collaboration Across Boundaries

The Developer Stories Podcast recently released an episode with Dan Reed (“HPC Dan”) that talked about the future of High Performance Computing. While there was ample conversation on resources and some policy, we only touched on some of the ideas about what to do it about it, or more specifically, how we should be working together. In this post, I want to talk about some of the problems I see with our current academic culture that prevent us from more successfully collaborating across the space. I like to think about these possible futures that don’t exist yet. Let’s jump in.

Traditional practices do not scale to cross-community collaboration

In academia, we are accustomed to writing papers. We are told (and expect) that publishing a paper in a highly respected venue is what will get the most attention, and thus have the most impact. And perhaps it still is true that this means to share information will be distributed to the academic community, and be a sound strategy to give us “career credits” or a metric of value for career advancement. But it is problematic. And you can’t even point this out because you’ll get in trouble for it.

Here is one problem with the above. While it works if you live in the isolation of an academic community, it doesn’t scale well beyond that. The issue is that today we need not just be talking to the academic community, but to the larger cloud community. We are in a present day and entering into a future where cloud is the leader, from an economic standpoint, and we are in somewhat of a competition for resources and talent. The two communities have been presented as a dichotomy, and at worst in an adversarial light. I hated this perspective – when someone on a panel would raise their voice and say “We cannot afford you!” or point fingers. It wasn’t productive – pointing fingers and blaming someone does not make progress. That same energy can go toward proactive action to try ideas and do something about solving the problem.

What it takes to be influential

Now we can talk about what it takes to be influential. Indeed, when you are the “little guy” in the face of an economic powerhouse, it ‘s easy to feel powerless. And if you are a pessimist, maybe this is how you see it. But if you are an optimist, you might recognize that while it’s out of your direct control, it is within your indirect control. You can have influence, even if you are just one person. You have a voice.

How to have influence? We first need to define a line between two kinds of work – the conceptual piece that might include algorithms, design and architecture, and then an implementation, which ideally is of more production quality. In the academic system that promotes publication first, we lean heavy on the conceptual. Ideas are often presented without an implementation, or if there is an implementation, it is a weak point. It’s a mistake to design something elegant but never turn it into a product that catches attention. It’s also a mistake to implement something that looks flashy but has a poor design. The strongest work will be a balance of those two – an elegant and well-thought out algorithm and a production-level implementation that further adds evidence (via implementation) that the idea works. This brings up another problem about skill. Often the researcher coming up with the algorithm isn’t a programmer. The paper might entirely be math. If the researcher knows how to program, they most likely have never written production-quality software. It’s a wide gap to span.

If there are scattered people in the academic community capable of spanning this divide, they often don’t have or cannot make the time. Given a reward system based on publication, and that a publication is sufficient with the algorithm, the implementation is not a priority. To add challenge to that, in order to have influence, you often need to generate many of these paired ideas and implementations. From the standpoint of work, it’s a lot, and most of it won’t lead to an outcome of change. I think that’s likely why we don’t see many of these winning combinations from our community. It’s hard to do, has a high failure rate, and there is no direct reward in place for the work. It’s much easier to fold the self into the more quick, turn around reward systems to write papers that show incremental conceptual improvement that are accepted and published at frequent venues.

Engagement is not well defined, so nobody does it

Influence often has to start with establishing a voice. Establishing a voice often means speaking up, and being persistent. It’s easy to think that the voices of the few cannot be heard and have impact, but I’ve found this to not be true – one or few people can inspire change if they send out a consistent signal. This means showing up to working groups and (often) being the only one from the HPC community, posting on group lists to ask questions or engage, and taking time to listen to podcasts and watch talks from venues that are not traditionally in the HPC space to learn new ideas. It often means finding connections between what you know in your community and these “other” spaces, and then being forward to reach out to individuals in the other space to ask to talk about something. Many times, these conversations might not come to anything. But when they do? That’s where you have influence. It means leaving our silos. We are most comfortable in silos. But those that leave silos (and zones of comfort) to share ideas across boundaries will have the most impact (influence).

Examples from the HPC community

I can give direct examples from our community for individuals that I think have bridged this gap and had great success. The first is Ricardo Rocha of CERN, who is (obviously) firmly rooted in HPC – CERN “The European Organization for Nuclear Research” is the largest particle physics laboratory in the world. Ricardo has been a leader in voice and work that has spanned the cloud-native and HPC communities for years, most recently giving a Keynote at Kubecon North America about multi-cluster scheduling with Kueue. Another example is Torsten Hoefler, head of the Scalable Parallel Computing Laboratory (SPCL) that I’ve stumbled on recently learning about Ultra Ethernet. If you look at his lab’s YouTube channel (yes, that is notable in and of itself, how many labs do that?) Torsten very notably is not presenting recordings from venues, he is taking talks from venues (and beyond) and recording them to share intentionally. He is adding the branding for his laboratory. They also have an active presence on social media, which is also notable. I’ve noticed that some academics tend to be very active on social media, and others either pretend it doesn’t exist, or turn their nose up to it. I’m not saying that social media venues are healthy for society or a good use of time (they can really steal attention in a terrible way) but they are a means to reach a wide audience of people. Just making a post when you have something important to say, which is what I try to do (often linking to my full thoughts here) is strategic to getting a message across, regardless of how you feel about the services.

I believe that this is something we need to do more of – putting out information (and advocating for it) without having it be of direct benefit to us (publication, conference proceedings, etc), and putting out information when we have something to say. I’ve been experimenting with this idea recently with a few talks on container pulling in Kubernetes and scheduling to containers in Kubernetes. I got tired of the “wait for a venue and ask for permission” to share ideas. Ironically, the second talk (now over 4 months old) would not have been presented until this weekend if I had submitted it to the Canopie HPC venue. I also would have been limited to a tiny bit of time, and it’s unlikely it would have been shared beyond a single room of predominantly one demographic, one community. Is that really the best outcome?

Openness and transparency are a hallmark of collaboration

Another feature that must come from the venues themselves is transparency. It almost doesn’t matter if a community has annual, flashy events if they are venues of privileged – you must pay to enter, and to access information, and beyond that, it’s closed. From the example above with Torsten, the first talk I watched was his recording of a talk he gave at Salishan. This (to me) comes across as one of these high-end, invite only HPC events that I (and most) would never be privy to attend. When presentations at these venues are not shared publicly, and yes, on places like YouTube, this is knowledge that will be forgotten. It doesn’t matter how impressive your work is if you present it to a room of 30 people and that’s the end of it. It makes me sad to read blogs of prominent people in our community that reference talks from these closed events, and know that I’ll never be able to see or learn from them. If we are championing reproducibility, transparency, and openness, we are not practicing what we preach. The argument about needing to attract attendees and keep a conference profitable doesn’t cut it. Look at Kubecon – it’s an absolute beast in terms of attendance. They have their talks up before the conference is even over!

Speaking of Kubecon, one of my favorite things to do is watch talks from it for weeks (and more) after they come out. I find interesting projects and reach out to people, and this is an opportunity to grow network and thinking space. If the talks weren’t on YouTube, my portal to that world would be closed. We are missing on that opportunity for others to reach us by not sharing. I feel that I get to experience some of the learning of the event despite not being there. The organizers of Kubecon I suspect recognize that not everyone can attend, for reasons that vary between people, and they don’t want to close off knowledge. I respect and champion this perspective, and hope that the HPC community can eventually catch up.

What the HPC community needs to get better at is the open sharing of knowledge. There are specific projects that do this well, but our conferences (generally) do not. The researchers and labs that are going to have impact and be successful not only do great and impressive work, but they are actively sharing it. I know about Torsten’s work and lab because I listened to him talk on a Podcast about Ultra Ethernet, and then I found his YouTube channel and Twitter feeds. My network and space for learning has grown because he has put his work out there.

Routine for engagement is missing

It’s problematic that the HPC community has no established routine to know how to engage. This is often why solutions cannot be offered up for the problems at hand – it’s not clear how to act when there is absence of instructions for the thought and engagement process to begin with, let alone solutions themselves. Maybe that is where creativity comes in – which broadly speaking is generating something from nothing. But that takes time and freedom to think (I’ll talk about this later). For the first problem – “how to engage” – learning and engaging in ways that don’t fit a traditional routine for an academic are hard to do. The academic mindset is one of permission. Do others think this is a good idea? Can I get permission from my boss to work on it? At best, we submit proposals (with creative thought) but they still need to be approved. When they are not, we abandon them for the time being in favor of whatever we are given permission to do.

Influence is deciding to bake fruit-cake

But much of what has to be done will never be granted permission because it’s either too risky or unknown and questionable. Much of what needs to be done just needs someone that decides to do it, and then shows people after that. You don’t ask permission to bake a new fruit cake you think will actually taste good, you bake it, and then offer others a taste. They might realize that it tastes good, but if you asked them in advance they would say “No way, fruit cake is terrible. Don’t do it.” In the second case, you’d never had made the cake. And my thinking of fruit cake comes directly from this post on Dan Reed’s blog. He has strong feelings about fruit cake! 🍰

A lot of good ideas are also accidental – you start doing one thing, and maybe it’s even just for fun and learning. You start building something, and stumble on an insight or something even cooler along the way. That goes against the academic desire to write down a plan a priori, get it stamped and approved, and then start working on it. You also need to have time to explore and play like that. So high level:

the models of thinking and working that are often needed for innovation and ideas that are different and useful to influence a larger power don’t fit with what we are expected or trained to do.

They don’t fit into the time or schedule we are afforded based on our established routines.

Our reward systems don’t encourage relaxed, creative thinking

It seems like a lot of academics are on a treadmill to meet deadlines. There is some promise that the treadmill will slow down, but in practice, I never see that it does. This makes time hard to come by, and so the things that get prioritized are those that fall into a comfortable, established routine. If there is something that falls outside of what we deem the highest bang for the academic credit buck it’s not invested in. You don’t make the time.

Collaboration is leaving the comfort of your local market

Let’s pretend that we are all bakers in a town. Our highest reward comes from baking our recipes, possibly with slight deviation so they are known to be tasty, and taking them to the local market to sell for profit. It would be hugely (temporally) costly to walk to neighboring towns looking for bakers working on similar recipes, and then spending time testing new, often very different combinations of ingredients. We might come back tired, broke, and not having found a great recipe. On the other hand, maybe we don’t have an immediate success, but we are invited to other markets. We taste test a much broader range of goods. We grow in so many more ways than if we stayed in our little town.

And maybe before communication afforded it, that would be the likely outcome. But unbeknownst to us, the network of bakers in other town have discovered Twitter, YouTube, and a use for other (sometimes terrible) social media services that allow them to quickly iterate on ideas and work together. Not only have they caught up to the tastiness of our recipes, they have surpassed us, and are designing robots to make the recipes for them. And we are still here, fudging around with the amount of cinnamon in our oatmeal raisin cookies. We still haven’t figured out we could join their communication channels, and bring the story of cinnamon to the larger community to iterate on much faster.

If you don’t get the metaphor, it’s about the time of payoff, and the initial cost of communication. Taking the time to engage outside of your comfort zone doesn’t have an immediate payoff but a longer term one.

The other issue with this paradigm is that people want established paths of behavior. There are no established paths for interaction with cloud communities. People don’t know what to do, so the default is to do nothing.

The future is large, collaborative projects

I believe in our HPC community to innovate and come up with amazing ideas. I also believe in the power of numbers, and that you can start with even a mediocre idea or project, and with enough motivated contributors, turn it into something equally innovative. That is how I see the innovation space in Kubernetes. Often a feature or component comes out, and it is first a little rough around the edges. But like clay, with many contributors and common need, it transforms over time into an elegantly designed solution that solves a lot of problems. I am biased here (and recognize my bias) that I have more faith in large, collaborative efforts to solve some of the most challenging problems than say, a small group that are isolated in academia. Sitting in these small groups, I think we will have the most success through engagement – bringing out expertise to the table and conversation for these larger projects. Is it often uncomfortable? Yes. Does it often go against traditional academic norms and incentives? Yes. I think with this strategy we can solve larger problems, and in a more collaborative fashion that leads to things we champion (but often don’t practice) like reproducibility and transparency.

I can give a quick example with respect to multi-cluster scheduling. There are huge internal projects working on the problem. And they will likely come up with interesting papers. But I believe it would be a better strategy to first collaborate with the SIG multi-cluster group, ensure they are handles for customization (for specific use cases like HPC) and then to optimize for that. I believe that a viable future for most models that are converged (general problems of compute that can sit between cloud and HPC communities) is that the powerhouse global community is going to put together some kind of skeleton, and the initial version won’t fit exactly what we need. But it will very likely be customizable, and we will customize-away for our use cases. Maybe our use cases will emerge in the larger community, and they will be solved before we’ve had a chance to write papers on what we are doing. Cloud companies get a competitive advantage for standardizing things. This means they themselves need that ability to customize, and that need directly helps us.

This goes back to the talk I shared from Ricardo – I can guarantee you he has something like that in mind. We prioritize working together, and we figure out the details for what we specifically need. This is a different strategy than what I normally see – working in silos and coming up with disparate solutions that then further separate the two communities. Ironically, because the underlying use cases are so similar, we usually have a loophole that the Kubernetes (and cloud-native) community eventually innovates what we need anyway. Examples include (but are not limited to) batch workflows, topology-aware scheduling, and custom scheduler policies. The scheduling space is still a bit rough, I’ll admit, but it’s getting a lot better, and really quickly. I suspect the next item to add to that list will be multi-cluster and multi-tenancy. We will see.

The gopher has no clothes

I sometimes feel like I’m pointing out that the emperor has no clothes. But it’s strange to watch these cycles repeat, year after year. The insight is that there is not a real divide in the actual technology space – the current divide results from us not working together. We have similar workloads and similar needs, and the only reason we have entirely different projects is because HPC has largely existed in a silo. A lot of the innovations that we need are ironically coming to be, not because of our input, but because they are foundational to workloads we share in common and cloud needs them too.

I am just one person, but I will continue to express my views, and to have my voice, even if I am a bit against the grain or considered non-conformist for it. I know that my opinions are often threatening to people, and that is outside of my control. If you find my ideas threatening, it might make sense to think about why. And after that, let’s have a conversation about it. Let’s grow and learn from one another, because we very likely have similar goals in this beautiful space of work.

On that note, I’m off to a running adventure! And this week is Supercomputing. I’ll be watching Kubecon talks, engaging remotely however I can (without having purchased a ticket) and probably enjoying a quiet week of focus on programming projects and learning. I do hope to go in the future for some fun social aspect. To all my friends in attendance, have an amazing week!

Suggested Citation:

Sochat, Vanessa. "A Future for HPC and Cloud: Collaboration Across Boundaries." @vsoch (blog), 17 Nov 2024, https://vsoch.github.io/2024/across-boundaries/ (accessed 01 Jul 25).

« My Old Friend For Coach »