《Introduction to Infiniband for End Users》 阅读笔记。


原文地址

1. Basic Concepts

1.1. InfiniBand vs traditional networking

传统的计算机网络

  • “network centric” view,重点关注在硬件以及其管理上
  • 应用程序共享的硬件资源统一由操作系统管理,无法 direct access
    • 网络通信时,数据包每次需要复制 2 到 3 次:I/O -> anonymous buffer pool -> Application’s virtual memory space
  • byte stream-oriented,传递字节流,一条消息完整地传递至对端时,接收端可能会触发多次系统调用以读取完整的消息
  • 数据中心网络可能由三个独立的网络组成
    • one for networking
    • one for storage
    • one for IPC

InfiniBand

  • “application centric” view,为应用程序提供 messaging service,重点关注在便于应用程序之间简单、高效的通信上
  • direct access, stack bypass,可以不依赖操作系统
    • avoiding operating system calls
    • avoiding unnecessary buffer copies
  • message-oriented,传递消息,单个消息大小可达 $2^{31}$ bytes
    • 发送端 InfiniBand 硬件自动将 outbound message 分成多个 packets 传递至接收端应用程序虚拟缓冲区
    • 只有在接收端收到完整的 messsage 时才会通知应用程序
    • 可以发现,在传递消息的过程中,不需要发送端和接收端应用程序参与
  • InfiniBand 会统一处理 storage, networking 和 IPC I/O,因此只需单一底层网络即可

1.2. InfiniBand Architecture

InfiniBand 通过在应用程序之间创建 channel 连接 (故可称为 Channel I/O) 以提供 messaging service 进行通信,如下图所示:

  • Queue Pairs(QPs): 指 channel 的端点,每个 QP 包含一个 Send Queue 和一个 Receive Queue
  • 通过将 QPs 映射为应用程序的虚拟内存空间,以实现应用程序直接访问 QPs
  • InfiniBand 提供两种 transfer semantics 用于传递消息
    • channel semantic: SEND / RECEIVE
      • 接收方预先在自己的 Receive Queue 中定义数据结构
      • 发送方无需知道接收方 RQ 中的数据结构,仅负责调用 SEND 发送消息
    • memory semantic: RDMA READ / RDMA WRITE
      • 接收方在自己的虚拟内存空间中注册一个 buffer,并将控制权交予发送方
      • 发送方通过调用 RDMA READRDMA WRITE 读写该 buffer

channel

InfiniBand Achitecture 如下图所示:

  • SoftWare transport interface 层包含创建/管理 channel 和使用 QPs 传递消息的 APIs and libraries
  • InfiniBand transport 层提供 reliablity & delivery guarantees,类似 TCP transport
  • Network 层类似于 IP 层
  • Link & Physical: wires & switches

architecture

1.3. Hardware Components

为了获取 InfiniBand Architecture 所提供功能,需要以下特定硬件的支持:

  • Host Channel Adapter(HCA)

    • 将 InfiniBand end node 连接至 InbiniBand network
    • 提供 address translation mechnisms,以让应用程序可直接访问物理内存
  • Target Channel Adapter(TCA)

    • 嵌入式环境下专用的 channel adapter
  • Switches

    • Designed to be “cut through” for performance and cost reasons and they implement InfiniBand’s link layer flow control portocol to avoid dropped packets
  • Routers

    • 可将规模较大的网络切分为多个子网,为 InfiniBand network 提供可扩展性
    • 用来连接物理距离较远的两个 InfiniBand subnet
  • Cables and Connectors

2. InfiniBand for HPC

InfiniBand 架构的特性可为 HPC 带来以下好处:

  • Ultra-low latency for
    • Scalability
    • Cluster performance
  • Channel I/O delivers:
    • Scalable storage bandwidth performance
    • Support for shard disk cluster file systems and parallel file systems

3. InfiniBand for the Enterprise

Devoting Server Resources to Application Processing

  • 虚拟化只是提升了服务器资源的利用率
  • 使用 InfiniBand 由于避免了无必要的系统调用和内存拷贝,可将服务器资源充分用于 Applications

A Flexible Server Architecture

  • 现有服务器会将 I/O 带宽资源按照一定比例预分配给 storage fabric(HBA) 和 ethernet network(NIC),可能并不适合应用程序所需
  • InfiniBand 则会统一处理 storage, networking 和 IPC,无需对 I/O 带宽资源进行预分配

4. Designing with InfiniBand

如下图所示,应用程序通过使用 verbs 将 work request 置于 work queue (也就是 QPs) 中。

verbs 只是应用程序使用/管理 InfiniBand 服务的规范,而非具体的 APIs.

access_level

实现 InfiniBand 所需的软件部分大致可分为以下三类,全部源码可从 OFED 获取。

  • Upper Layer Protocols(ULPs) and associated libraries
  • mid-layer functions: 用于配置和管理底层 InfiniBand Fabric,并为 ULPs 提供所需的服务
  • hardware specific device drivers

ULP 有两种接口:

  • upward-facing interface: 供应用程序调用
  • downward-facing interface: 通过 QPs 使用底层的 InfiniBand messaging service

OFED 提供了多种 ULPs 以满足应用程序的不同需求:

ulps

  • SDP: Sockets Direct Protocol. This ULP allows a sockets application to take advantage of an InfiniBand network with no change to the application
  • SRP: SCSI RDMA Protocol. This allows a SCSI file system to directly connect to a remote block storage chassis using RDMA semantics. Again, there is no impact to the file system itself.
  • iSER: iSCSI Extensions for RDMA. iSCSI is a protocol allowing a block storage file system to access a block storage device over a generic network. iSER allows the user to operate the iSCSI protocol over an RDMA capable network.
  • IPoIB: IP over InfiniBand. This important part of the suite of ULPs allows an application hosted in, for example, an InfiniBand-based network to communicate with other sources outside the InfiniBand network using standard IP semantics. Although often used to transport TCP/IP over an InfiniBand network, the IPoIB ULP can be used to transport any of the suite of IP protocols including UDP, SCTP and others.
  • NFS-RDMA: Network File System over RDMA. NFS is a well-known and widely-deployed file system providing file level I/O (as opposed to block level I/O) over a conventional TCP/IP network. This enables easy file sharing. NFS-RDMA extends the protocol and enables it to take full advantage of the high bandwidth and parallelism provided naturally by InfiniBand.
  • Lustre support: Lustre is a parallel file system enabling, for example, a set of clients hosted on a number of servers to access the data store in parallel. It does this by taking advantage of InfiniBand’s Channel I/O architecture, allowing each client to establish an independent, protected channel between itself and the Lustre Metadata Servers (MDS) and associated Object Storage Servers and Targets (OSS, OST).
  • RDS: Reliable Datagram Sockets offers a Berkeley sockets API allowing messages to be sent to multiple destinations from a single socket. This ULP, originally developed by Oracle, is ideally designed to allow database systems to take full advantage of the parallelism and low latency characteristics of InfiniBand.
  • MPI: The MPI ULP for HPC clusters provides full support for MPI function calls.

5. InfiniBand Architecture and Features

5.1. Address Translation

  • HCA 占有一块物理地址,应用程序可通过内存注册过程请求必要的地址转换,然后由 HCA 使用地址转换表执行所需的虚拟地址到物理地址转换
  • I/O channel 由 HCA 创建,QP 可以理解为应用程序访问 HCA 的接口,单个 HCA 可支持最多 $2^{24}$ 个 QPs

得益于 HCA 提供地址转换功能,可实现 InfiniBand 的两个核心功能:

  • 用户态访问 InfiniBand messaging service
  • 使用 InfiniBand messaging service 可“直接”访问其他应用程序的 virtual address spaces
    • 使用对端提供的 key 和 virtual address,执行 RDMA Read/RDMA Write 操作

5.2. The InfiniBand Transport

InfiniBand transport 可提供以下几种 message transport services:

  • Channel semantic operations: a reliable or an unreliable SEND/RECEIVE service,类似于 TCP/UDP
    • 可靠性由 transport recovery 和 notifies application 机制实现
    • SEND/RECEIVE operations 常用于传输 short control messages
    • 执行流程:
      1. 接收端应用程序使用 Post Receive Request verb 将 WRs 置于 RQ,其中每个 RECEIVE WR 代表应用程序虚拟内存空间中的一块 buffer
      2. 发送端应用程序使用 Post Send Request verb 将 WRs 置于 SQ,其中每个 SEND WR 代表一条消息,SEND 操作的目标为接收端 RECEIVE WR 关联的 buffer
  • Memory semantic operations: RDMA Read and RDMA Write service
    • 常用于传输 bulk data
  • Atomic Operations
  • Multicast services
  • InfiniBand link layer lossless flow control:利用硬件检测接收端缓冲区使用量,仅在可容纳消息时才会传递消息,保证消息不会被丢失
  • TCP lossy flow control: 并不会在发送前检测下游是否可容纳 packets,因此 packets 可能会丢失,事实上 TCP 还会利用丢包信息判断下游状态

5.4. Managment and Servcices

与 autonomous Ethernet fabric 相反,InfiniBand 为集中式管理,其管理主要分为以下几个方面:

  • Subnet Management(SM) and Subnet Administration(SA): 发现、初始化和维护 InfiniBand fabric
  • Communication management: 将一对 QPs 关联以便在两个应用程序之间创建 channel 连接
  • Performance management
  • Device management
  • Baseboard management
  • SNMP tunneling
  • Vendor-specific class
  • Application specific classes

6. Achieving an Interoperable Solution

可以不看

7. InfiniBand Performance Capabilities and Examples

2010 年的文档,注意时效性

throughput

  • server to server: $40 Gb/s$
  • switch to switch: $120 Gb/s$

latency

  • application: $1 \mu s$
  • swicth: $100 ns$ to $150 ns$

当前数据可查看 InfiniBand Roadmap