Last week, I was fortunate enough to be able to attend Tech Field Day 26 in San Jose. While we met with several companies, this was a bit of a special edition of Tech Field Day, and we attended the CXL Forum at the Open Compute Project conference. In case you don’t know the Open Compute Project is project from a consortium of large scale compute providers like AWS, Microsoft, Meta, and Google, amongst others. They aim to optimize hyperscale data centers in terms of power and cooling, deployments, and automation. So how does CXL fit into that equation?
CXL stands for Compute Express Link, which is a standard developed by Intel, but also includes a large number of both cloud providers and hardware manufacturers. The CXL standard defines three separate protocols (definition source Wikipedia) :
- CXL.io – based on PCIe 5.0 with a few enhancements, it provides configuration, link initialization and management, device discovery and enumeration, interrupts, DMA, and register I/O access using non-coherent loads/stores
- CXL.cache – allows peripheral devices to coherently access and cache host CPU memory with a low latency request/response interface
- CXL.mem – allows host CPU to coherently access cached device memory with load/store commands for both volatile (RAM) and persistent non-volatile (flash memory) storage
The main area focus for cloud vendors like Microsoft and Amazon is CXL.mem, which would allow them to add additional memory to cloud VM hosts. Why is this such a big deal? Memory represents the largest expense to cloud providers, and the requirements for memory keeps increasing.
Beyond that—supporting a mix of workloads means memory can become “stranded”. If you are a database administrator, you can think of this like index fragementation—which leads to wasted space. Ideally, cloud vendors would like to completely disaggregate memory and CPU, which is one of the goals of CXL (memory being tied to a rack and not a specific host), but will likely not occur for 3-5 years.
However, CXL is real, and on-board CXL memory sockets are coming soon. The best explanation of CXL’s use cases I saw last week were from Ryan Baxter, the Senior Director of Micron (Micron has some interesting solutions in the space). You can a version of that talk here. Effectively, you can have additional memory on a server on a CXL bus (which uses PCI-E for its transport mechanism)—this memory will be slightly slower than main memory, but still much faster than any other persistent storage.
Another interesting talk was from Meta, who described their performance testing with CXL. Since memory is remote, there is a performance cost, which was around 15% with no optimizations to their software. However, Meta wrote an application to perform memory (on Linux) management which reduced the overhead to < 2%.
You might imagine a database engine, would be aware of this remote memory configuration, and might age pages it did not think were going to be reused outside of main memory and into remote memory.
I learned a lot last week—hardware is still a very robust business, even though most of the focus is still on the needs of the cloud providers. CXL promises some foundational changes to the way servers get built, and I think it will be exciting. Stay tuned for more posts from Tech Field Day 26.
Pingback: Understanding CXL for Servers - Tech Field Day