CXL: A Basic Tutorial
Here is a brief introduction to Compute Express Link (CXL). This is a new high-speed CPU interconnect that enables a loftier-speed, efficient functioning between the CPU and platform enhancements and workload accelerators.
00:21 Hugh Curley:
Welcome to this 15-infinitesimal introduction to CXL, that new interface that runs on PCIe v or later. It is designed for high-end, ultra-high bandwidth and ultra-low latency demands. Information technology is without a doubt, the interface that belongs at the Wink Retentivity Elevation. I am Hugh Curley, consultant with KnowledgeTek, and CXL is Compute Express Link.
CXL moved shared system retentivity in cache to be near the distributed processors that will be using information technology, thus reducing the roadblocks of sharing memory jitney and reducing the time for retentiveness accessors. I retrieve when a one.viii microsecond memory admission was considered adept. Here, the engineers are shaving nanoseconds off the fourth dimension to admission retentiveness.
You might say that graphic processor units (GPUs) and NVMe devices share system memory with the processor, and y’all are correct. But with the GPU, the sharing is ane fashion only, from the host to a single graphic unit. With NVMe, both the host and the device can access the retentivity, but there is merely a single instance of a given retention location. It may exist on the host or on the NVMe device with both the host and device able to access it, and the memory of both the GPU and NVMe devices is controlled past the host memory manager, so no conflicts or coherency problems develop, and neither the GPU nor the NVMe devices are peers of the host.
With CXL, multiple peer processors can exist reading and updating any given memory location or cache location at the same fourth dimension to manage coherency. If any processor writes to a memory location, all other copies of that location, are marked as invalid. Processors accessing that retentivity location must refetch that information before acting on it. This requires a lot of advice, and that advice is called CXL.
Why add this extra complexity and advice? CXL allows the system designer to move the retentivity and enshroud physically closer to the processor that is using it to reduce latency. When you add remote processors or processing devices, each device brings the retentiveness and cache it needs. This allows the arrangement owners to balance performance versus cost. System administrators can add more system retention past adding retention expansion units. There are additional requirements we must address.
The reasons for CXL are loftier bandwidth and low latency. It must be scalable to address applications with dissimilar demands. But the elephant in this room is coherency. We are moving to new ground and it must be designed correctly. CXL is probably a bad choice for cake access or where there is just a single instance of memory addresses. It is as well probably a bad choice for mesh architectures with multiple accelerators, all working on the same problem. CCIX would probably be better. CXL coherency is our issue.
Then how does it work? I mentioned information technology in 1 sentence earlier, which probably caused more than questions than it answered. This unmarried folio may exercise the same, which is good. It means you’ll exist ready for the in-depth presentation and discussions later in this conference. Data is copied from the host processor memory to the device’s memory, and there may be multiple devices with the same retention addresses.
When a device updates a memory location, that location is marked as invalid in all memories or caches. When whatsoever device wants to read or write a memory location that is either invalid or not in this retentiveness, information technology must read it from main retentiveness. Memory tables and coherency logic is in the host. This keeps downwardly the cost and complexity for device designers and manufacturers.
There are a few host evolution companies, just many device evolution companies, so this asynchronous approach should reduce incompatibilities. Nosotros mentioned the CPU and PCIe before. CXL has three protocols which we volition accost:
Used to maintain coherency among shared memories
Used to maintain coherency among shared caches. This is the most complex of the three.
Used for administrator functions of discovery, etcetera. It is basically PCIe five with a non-posted write transaction added.
There are likewise iii types of CXL topologies.
Type ane is for cache, merely no shared memories. Of course, CXL.io is used for setup and air treatment. This slide shows some usages.
Type 2 is for shared cache and shared memories. “HBM” is high bandwidth memory.
Type 3 is for shared memory, simply no shared enshroud, such as for memory expansion units. The Blazon 3 does non take externally visible processors or cache in the device.
Ane subject we must accost is, who manages the memory, the host or the device, and what does direction mean? The piece of cake style to reply that is the device-managed memory is not accessible to the host. Therefore, CXL does not run across information technology or manage it. Remember that the management, logic and tables for CXL are in the host. Host-managed retentiveness is memory on the host or device that the host and CXL can monitor and manage.
2 other concepts are host bias coherency model and device bias coherency model. A system tin utilise either or be switchable between them, peradventure using host bias for transferring commands and status and switching to device bias for transferring information.
Notice the cherry-red line in this picture showing that the device must go to and through the host to address host managed retentiveness on the device. This is non very efficient for the device to access data. This shows the device bias coherency model. Observe how efficiently the device can access the host-managed retention on the device now.
The purple line is to exercise the bias flit, if the system supports both host bias and device bias coherency models. Find as well that the host in these pictures has a coherency bridge and the device has a DCOH, which is device coherency bridge, a simplified home agent and coherency span that are on the device, instead of the host.
What is CXL’due south relationship with PCIe 5? CXL concrete connector is the same as PCIe and can run in a PCIe 5 slot. If either the host or device is PCIe, they can operate every bit PCIe. If both are CXL, they will negotiate CXL and operate with that protocol. In PCIe 5, the training sequences, TS1 and TS2 have an additional field called “alternating protocol.” The start, and then far only, alternate protocol defined is CXL. If both devices claimed to support CXL, there are other fields negotiated betwixt the host and device that define CXL parameters. The physical cables and connectors are PCIe 5. The logical sub-block in the physical layer is chosen the flex bus, and that it tin operate as PCIe or CXL.
A new cake called ARB/MUX is added to arbitrate which request gets serviced beginning. CXL is divers as PCIe, Gen 5 x sixteen lanes wide. If y’all drop down to eight lanes or iv lanes at Gen v, the specification calls it “bifurcation.” If you drop below Gen 5 x four in either speed or lanes, it is chosen a “degraded style” link layer and flits. All CXL transfers are 528-bit packets, made upward of four slots of sixteen bytes each, and two bytes of CRC, slot cypher contains the header, the other iii are called generic slots.
Let’s meet what they incorporate. The flit header contains information such equally the blazon, is this a protocol transferring data, commands or status or a control waltz? ACK is that the sender is acknowledging 8 flits that information technology has received from the other device.
BE — byte enable — is this flit accessing memory or greenbacks on a byte level or on a slot level? Slot, what kind of information is in slot nothing, one, two and iii? In that location are half dozen format types for header slots, H0 through H5, and 7 format types for generic slots, G0 through G6. The slot encoding identifies if their contents are, for instance, cash requests, cash responses, memory asking, memory header or data. CRD — credit — how many credits are being sent? Each credit granted allows one transfer regardless of size. We will meet the byte enable on the side by side slide.
This slide shows ii formats of data flits, the one on the left is used for byte updates, and the one on the right is used for updating 16 bytes of data. We covered why and how for CXL, types one, two and three. The three protocols of CXL.mem, CXL.cache and CXL.io. Host bias coherency and device bias coherency, host-managed retentivity and device-managed memory. PCIe alternate protocol, normal, bifurcated and degraded modes and flits. There is a lot of information. As this is recorded, you can go dorsum and review the entire lesson or any specific function of information technology. I hope this has prepared you for a very beneficial Wink Memory Elevation. This nautical chart shows comparing of some new interfaces. Thanks.
Dig Deeper on Flash memory and storage
IntelliProp CEO describes how CXL will change memory market
Intel pulls the plug on Optane
Four key need-to-knows nearly CXL
MemVerge, Liqid create composable retentiveness pools earlier CXL