½ an Exabyte at SLAC. What does that “scale” really mean?
EXABYTE. We all know it’s an insane amount of data. Many software companies make a big commotion when their customer’s cumulative data passes the exabyte milestone (hey, we’re guilty too). Far more impressive though, is when a single organization reaches ½ an exabyte on a single project. Yes, I’m aware that’s half, but I equate it to a half marathon… Still very much worth celebrating and quite impressive.
What could an organization possibly be doing to create that much data?
At SLAC National Accelerator Laboratory, they’re working on the Vera C. Rubin Project. If you’re not familiar with it highly recommend a quick listen, “Actually it IS, Rocket Science.” A brief overview, they’re essentially mapping the sky for planetary defense. Yeah, you read that right PLANETARY DEFENSE as in saving the world from meteors among other things.
Back to the data, SLAC is leading the creation of the largest digital camera ever built, which is currently under construction in Chile. The telescope will generate an enormous volume of data, about 20 terabytes (TB) of raw data per night. The total amount of data collected over the 10 years of operation will be about 60 petabytes (PB). The processing of data will produce a 20 petabytes (PB) catalog database. This means that the total data after processing will be hundreds of PB.
There are very few companies in the world who have reached this milestone, the rapid pace of growth obviously brings the issue of scale into question.
Before we get to that, let’s back up and define “scale.”
Scale is a verb, not a noun.
The industry (WEKA included) often uses phrases like “at scale” which implies that it’s a noun or a destination companies can reach. But what constitutes scale? Is scale a terabyte, 5 petabytes, or the infamous exascale?
We like to think of scale as a verb, a qualifier of growth. As your organization’s data grows (scales up), do you have the ability to keep up with the pace to drive innovation.
How SLAC Handles Scaling (notice it’s a verb now).
At any scaling point you are bound to hit bottlenecks and is almost always a give and take. As you optimize for one bottleneck, say compute, another, perhaps the network, can’t keep up.
The trick is minimizing the bottlenecks as much as possible, maintaining a pace, and optimizing for your end goal.
At SLAC they need to process a ton of large files very fast, as there’s no benefit of detecting a meteoroid once it has already reached Earth.
WEKA creates that foundational layer of the stack, as the data is ingested from the telescopes, WEKA scales linearly, meaning the more data they have the faster it performs. Furthermore, WEKA tiers to economical object storage, namely Ceph storage, which both lowers the total cost and allows for scaling capacity independently from compute for better cost control.
Lastly, SLAC has a fairly small team of administrators, so ease of use is pertinent given the sheer amount of data. With WEKA’s Zero-tuning, zero-copy architecture, they don’t have to spend their time moving data around or tuning that is typically required with legacy environments like Lustre.
It’s a Journey.
½ an exabyte is no small feat, the sheer volume of data requires a ton of pieces working together to accelerate research outcomes.
More importantly than the magnitude is what this data can unlock. I’m no scientist but I’m going to guess there is a whole lot, we as mankind don’t know about space.
But when you go somewhere you’ve never been before, you have the opportunity to learn more than you ever have before.
And just maybe even save the world.