sync reader
P4-TAS:基于 P4 的 TSN Time-Aware Shaper
P4-TAS: P4-Based Time-Aware Shaper for Time-Sensitive Networking · 2025-11-13
本页提供英文原文段落与中文逐段译稿。译稿包含自动复核状态;标记为需人工复核的段落应回到 PDF/HTML 校对公式、表格和符号。
- 本站范围
- 全文逐段对照
- 内容来源
- 本地英文段落 + 中文译稿
- 阅读规模
- 123/123 段已生成译稿
本地保存的可公开原文段落,随页面一起滚动
自动复核标记:32 段需要人工回看公式、表格或符号
Time-critical applications in industrial automation and automotive systems rely on networks that provide deterministic guarantees such as low latency, minimal jitter, and virtually zero packet loss. To meet these stringent requirements, two complementary technologies have emerged: Time-Sensitive Networking (TSN) and Deterministic Networking (DetNet). TSN is a suite of IEEE 802.1 standards that enhances Ethernet to support real-time communication by introducing mechanisms for traffic shaping [ 1, 2, 3 ] and reliability [ 4 ]. In contrast, DetNet is a Layer 3 technology standardized by the IETF that extends these capabilities to routed networks by enabling bounded latency and high reliability across multiple IP hops [ 5 ].
工业自动化和汽车系统中的时间关键型应用依赖于能够提供确定性保证的网络,例如低时延、最小抖动以及几乎为零的丢包。为满足这些严格要求,两种互补技术已经出现:时间敏感网络(Time-Sensitive Networking,TSN)和确定性网络(Deterministic Networking,DetNet)。TSN 是一套 IEEE 802.1 标准,通过引入用于流量整形 [ 1, 2, 3 ] 和可靠性 [ 4 ] 的机制来增强以太网,使其支持实时通信。相比之下,DetNet 是由 IETF 标准化的第 3 层技术,它通过在多个 IP 跳之间实现有界时延和高可靠性,将这些能力扩展到路由网络 [ 5 ]。
术语 TSN、DetNet、IEEE 802.1、IETF、Layer 3 均已保留并翻译;数字、引用编号 [ 1, 2, 3 ]、[ 4 ]、[ 5 ] 未遗漏;“virtually zero packet loss”译为“几乎为零的丢包”准确;逻辑对比关系“相比之下”已体现。未发现明显问题。
Scheduled traffic is a concept in TSN where transmission times of talkers are coordinated to avoid queuing in intermediate nodes so that frames traverse the network with minimal delay. This coordination is called scheduling and yields a network-wide schedule that ensures deterministic forwarding. Various traffic shaping mechanisms, such as Credit-Based Shaper (CBS), Asynchronous Traffic Shaper (ATS), and the Time-Aware Shaper (TAS) exist in TSN. Among them, the TAS stands out by leveraging a Time Division Multiple Access (TDMA)-like approach to protect scheduled traffic from interference, e.g., by best-effort flows, thereby ensuring low latency and bounded delay. This is achieved through transmission gates which periodically open and close queues. Further, Per-Stream Filtering and Policing (PSFP) is a TSN mechanism that combines rate and time-based policing to drop out-of-schedule frames.
调度流量是 TSN 中的一个概念,其中 talker 的传输时间被协调,以避免在中间节点中排队,从而使帧以最小延迟穿越网络。这种协调称为调度,并产生一个网络范围的调度表,以确保确定性转发。TSN 中存在多种流量整形机制,例如基于信用的整形器(Credit-Based Shaper,CBS)、异步流量整形器(Asynchronous Traffic Shaper,ATS)以及时间感知整形器(Time-Aware Shaper,TAS)。其中,TAS 通过利用一种类似时分多址(Time Division Multiple Access,TDMA)的方法来保护调度流量免受干扰,例如来自尽力而为流的干扰,从而确保低时延和有界延迟。这是通过周期性打开和关闭队列的传输门来实现的。此外,按流过滤与监管(Per-Stream Filtering and Policing,PSFP)是 TSN 的一种机制,它结合基于速率和基于时间的监管,以丢弃不符合调度的帧。
CBS、ATS、TAS、TDMA、PSFP 缩写及全称均已保留;“talkers”按 TSN 术语保留为 talker;“best-effort flows”译为“尽力而为流”;“transmission gates”译为“传输门”;“out-of-schedule frames”译为“不符合调度的帧”。未发现明显问题。
DetNet leverages TSN concepts in combination with technologies such as MPLS or IP. In DetNet deployments, TSN can be employed as a sub layer to provide deterministic forwarding in sub networks. In practice, TSN implementations are typically hardware-based and optimized for bandwidths up to \qty 1. However, DetNet targets broader use cases, including high-throughput applications and data center backbones. This hierarchical composition allows for scalable designs that combine the high-speed capabilities of DetNet with the precise timing mechanisms of TSN. The integration of both technologies enables sophisticated traffic shaping and reliability mechanisms across backbone infrastructure.
DetNet 将 TSN 概念与 MPLS 或 IP 等技术结合加以利用。在 DetNet 部署中,TSN 可以作为一个子层使用,以在子网络中提供确定性转发。在实践中,TSN 实现通常是基于硬件的,并针对最高可达 \qty 1 的带宽进行了优化。然而,DetNet 面向更广泛的使用场景,包括高吞吐量应用和数据中心骨干网。这种层次化组合允许形成可扩展的设计,将 DetNet 的高速能力与 TSN 的精确定时机制结合起来。两种技术的集成使得复杂的流量整形和可靠性机制能够跨骨干基础设施应用。
术语 DetNet、TSN、MPLS、IP 已保留;逻辑上“However”转折已体现。“up to \qty 1”疑似从 LaTeX 单位命令解析不完整,缺少单位或数值上下文,可能原文应为某个带宽值。该处需人工核对原 PDF 或源文件。
The contribution of this paper is manifold. We present P4-TAS, a P4-based implementation of the TAS on the Intel Tofino™ 2 switching ASIC that enables TSN-compliant shaping and policing in the data plane. Our design introduces a novel mechanism for periodic queue control using a continuous stream of internally generated TAS control frames. It builds upon a mechanism from our prior P4-PSFP work [ 6 ] which uses the internal packet generator as a clock source. P4-TAS also incorporates PSFP and includes an MPLS/TSN translation layer [ 7 ], enabling TSN traffic shaping to be applied to DetNet flows at line rates up to 400 Gb/s. A key contribution of this work is the identification and quantification of internal processing delays that affect scheduling precision. Such delays are typically undocumented in commercial TSN-capable switches but are crucial for accurate traffic scheduling [ 8, 9 ]. Our implementation reveals multiple delay sources within the data plane and provides corresponding measurements on a nanosecond scale. We demonstrate that our approach achieves internal delays orders of magnitude smaller than those reported for some commercial platforms [ 9 ], offering transparency. Finally, we evaluate the scalability of P4-TAS and compare it to available TAS implementations.
本文的贡献是多方面的。我们提出 P4-TAS,这是在 Intel Tofino™ 2 交换 ASIC 上基于 P4 实现的 TAS,能够在数据平面中实现符合 TSN 的整形和监管。我们的设计引入了一种新的周期性队列控制机制,该机制使用连续的、内部生成的 TAS 控制帧流。它建立在我们此前 P4-PSFP 工作 [ 6 ] 中的一种机制之上,该机制使用内部数据包生成器作为时钟源。P4-TAS 还集成了 PSFP,并包含一个 MPLS/TSN 转换层 [ 7 ],从而能够以最高 400 Gb/s 的线速将 TSN 流量整形应用于 DetNet 流。本工作的一个关键贡献是识别并量化影响调度精度的内部处理延迟。此类延迟通常不会在商用 TSN 能力交换机中被公开记录,但对于精确的流量调度至关重要 [ 8, 9 ]。我们的实现揭示了数据平面中的多个延迟来源,并提供了纳秒尺度的相应测量结果。我们证明,我们的方法实现的内部延迟比某些商用平台所报告的延迟低数个数量级 [ 9 ],并提供了透明性。最后,我们评估了 P4-TAS 的可扩展性,并将其与现有 TAS 实现进行了比较。
P4-TAS、Intel Tofino™ 2、ASIC、P4、TAS、TSN、P4-PSFP、PSFP、MPLS/TSN、DetNet 均已保留;400 Gb/s、纳秒尺度、引用 [ 6 ]、[ 7 ]、[ 8, 9 ]、[ 9 ] 未遗漏;“orders of magnitude smaller”译为“低数个数量级”准确。未发现明显问题。
The rest of the paper is structured as follows. In Section II, we provide background information on TSN, DetNet, and the P4 programming language. In Section III, we review related work on combining TSN and DetNet systems, and on simulations and hardware implementations of those technologies. Section IV introduces the P4 implementation, including our system architecture and the P4-TAS mechanism. In Section V, we evaluate internal delays, measure the transmission gate accuracy externally, analyze the scalability, and compare P4-TAS to other implementations. Finally, in Section VI, we conclude the paper.
本文其余部分的结构如下。在第 II 节中,我们提供关于 TSN、DetNet 和 P4 编程语言的背景信息。在第 III 节中,我们回顾关于结合 TSN 与 DetNet 系统的相关工作,以及关于这些技术的仿真和硬件实现的相关工作。第 IV 节介绍 P4 实现,包括我们的系统架构和 P4-TAS 机制。在第 V 节中,我们评估内部延迟,从外部测量传输门精度,分析可扩展性,并将 P4-TAS 与其他实现进行比较。最后,在第 VI 节中,我们对本文作出总结。
章节编号 II、III、IV、V、VI 均已保留;TSN、DetNet、P4、P4-TAS 术语一致;“externally”已译为“从外部”。未发现明显问题。
In this section, we provide technical background on TSN, DetNet, and the P4 programming language.
在本节中,我们提供关于 TSN、DetNet 和 P4 编程语言的技术背景。
术语 TSN、DetNet、P4 已保留;句意完整。未发现明显问题。
We first give a brief overview of TSN, explain scheduled traffic, and then summarize the concepts of the TAS and PSFP.
我们首先简要概述 TSN,解释调度流量,然后总结 TAS 和 PSFP 的概念。
TSN、TAS、PSFP 缩写已保留;段落为章节引导句,逻辑顺序“首先、然后”已体现。未发现明显问题。
TSN is a suite of IEEE 802.1 standards that augment traditional Ethernet to support deterministic communication with strict Quality of Service (QoS) guarantees. TSN networks are built from interconnected bridges and end stations. A data flow, referred to as a TSN stream, originates from a talker (sending station) and is directed to one or more listeners (receiving stations). A TSN stream is identified based on its VLAN tag, its Layer 2 destination address, and optionally other header fields [ 4 ]. Before a stream is allowed to transmit, it has to undergo admission control [ 4 ]. This process involves the talker advertising its traffic characteristics, e.g., latency requirements, through a stream descriptor. The network then decides whether to admit the stream by evaluating resource availability and making reservations accordingly.
TSN 是一套 IEEE 802.1 标准,用于扩展传统以太网,以支持具有严格服务质量(Quality of Service,QoS)保证的确定性通信。TSN 网络由互连的网桥和端站构成。一个数据流被称为 TSN 流,它源自一个 talker(发送站),并被导向一个或多个 listener(接收站)。TSN 流基于其 VLAN 标签、第 2 层目的地址以及可选的其他头部字段来识别 [ 4 ]。在允许一个流进行传输之前,它必须经过准入控制 [ 4 ]。这一过程涉及 talker 通过流描述符通告其流量特征,例如时延要求。随后,网络通过评估资源可用性并相应进行预留,来决定是否接纳该流。
IEEE 802.1、QoS、TSN stream、talker、listener、VLAN、Layer 2、admission control 等术语已准确处理;引用 [ 4 ] 两处均保留;“optionally other header fields”中的可选限定词已保留。未发现明显问题。
In TSN, streams can be scheduled, i.e., their sending times at talkers are coordinated such that frames experience minimal delay at intermediate bridges. This coordination is computed offline and yields a network-wide schedule. The calculation of such schedules is outside the scope of this work. More information on scheduling in TSN can be found in a survey by Stüber et al. [ 10 ]. Time synchronization on a sub-microsecond scale is critical for scheduling in TSN. For that purpose, protocols like the Precision Time Protocol (PTP) are employed [ 11 ].
在 TSN 中,流可以被调度,也就是说,它们在 talker 处的发送时间会被协调,使得帧在中间网桥处经历最小延迟。这种协调是离线计算的,并产生一个网络范围的调度表。此类调度表的计算不在本文工作的范围之内。关于 TSN 中调度的更多信息,可参见 Stüber 等人的综述 [ 10 ]。亚微秒尺度的时间同步对于 TSN 中的调度至关重要。为此,会采用精确时间协议(Precision Time Protocol,PTP)等协议 [ 11 ]。
“i.e.”解释关系已体现;“offline”译为“离线计算”;“sub-microsecond scale”译为“亚微秒尺度”;Stüber et al.、PTP、引用 [ 10 ]、[ 11 ] 已保留。未发现明显问题。
Scheduled streams in TSN are typically assigned the highest priority and must be protected from lower-priority traffic, e.g., best-effort traffic. This ensures that scheduled frames reach each intermediate node at their scheduled times. TAS and PSFP are mechanisms to protect scheduled traffic using gating mechanisms. Both are illustrated in Figure 1 and explained in the following.
TSN 中的调度流通常被分配最高优先级,并且必须受到保护,以免受较低优先级流量的影响,例如尽力而为流量。这确保调度帧在其预定时间到达每个中间节点。TAS 和 PSFP 是使用门控机制来保护调度流量的机制。二者如图 1 所示,并在下文中解释。
TAS、PSFP、TSN 术语已保留;“highest priority”“lower-priority traffic”“best-effort traffic”逻辑关系清晰;Figure 1 译为“图 1”。未发现明显问题。
The TAS, standardized in IEEE Std 802.1Qbv [ 1 ], provides time-based shaping at the egress. Each egress port provides eight FIFO queues associated with frame priorities from the VLAN tag [ 12 ]. These queues are controlled by transmission gates which are controlled by a gate control list (GCL). A GCL is a periodic sequence of entries, each specifying a time slice and a corresponding gate state. In the TAS, we call this the transmission GCL (tGCL). Each tGCL entry specifies a duration and an eight-bit vector indicating which of the eight transmission gates are open or closed. Frames in queues with an open transmission gate are transmitted in FIFO order while frames in queues with a closed transmission gate remain buffered. After all entries have been processed, the sequence repeats periodically with a cycle length h h.
TAS 在 IEEE Std 802.1Qbv [1] 中被标准化,在出口侧提供基于时间的整形。每个出口端口提供八个 FIFO 队列,这些队列与来自 VLAN 标签的帧优先级相关联 [12]。这些队列由传输门控制,而传输门又由门控列表(gate control list,GCL)控制。GCL 是一个周期性的条目序列,每个条目指定一个时间片以及相应的门状态。在 TAS 中,我们将其称为传输 GCL(transmission GCL,tGCL)。每个 tGCL 条目指定一个持续时间以及一个八位向量,该向量指示八个传输门中的哪些处于打开或关闭状态。位于传输门打开的队列中的帧按 FIFO 顺序传输,而位于传输门关闭的队列中的帧则保持缓冲。所有条目处理完之后,该序列以周期长度 h h 周期性重复。
术语 TAS、IEEE Std 802.1Qbv、FIFO、VLAN、GCL、tGCL 均已保留并译出;数字“八个”“八位”一致;逻辑上区分了打开门传输、关闭门缓冲。末尾“cycle length h h”疑似公式或 OCR/抽取重复,应人工核对原文公式表示。
PSFP, standardized in IEEE Std 802.1Qci [ 3 ], enforces per-stream conformance at the ingress by combining rate policing with time-based policing. In this way, PSFP ensures adherence to the resource bounds established by admission control. While rate policing is a well-known mechanism, time-based policing targets scheduled traffic and is the focus of this work. For time-based policing, each stream is associated with a stream gate controlled by a periodic GCL which we call the stream GCL (sGCL). The sGCL defines the gate state over time and thereby the admitted transmission windows of the stream. Frames arriving outside their admitted window are dropped immediately, i.e., before queuing, preventing them from consuming reserved resources.
PSFP 在 IEEE Std 802.1Qci [3] 中被标准化,通过将速率监管与基于时间的监管相结合,在入口侧强制实施逐流一致性。通过这种方式,PSFP 确保遵守由准入控制建立的资源边界。虽然速率监管是一种众所周知的机制,但基于时间的监管面向调度流量,并且是本文工作的重点。对于基于时间的监管,每个流都与一个由周期性 GCL 控制的流门相关联,我们将该 GCL 称为流 GCL(stream GCL,sGCL)。sGCL 定义门状态随时间的变化,并由此定义该流被允许的传输窗口。在其允许窗口之外到达的帧会被立即丢弃,即在排队之前丢弃,从而防止它们消耗预留资源。
PSFP、IEEE Std 802.1Qci、GCL、sGCL 等缩写处理一致;“per-stream conformance”译为“逐流一致性”合理;“admitted transmission windows”译为“被允许的传输窗口”保留了准入含义;逻辑和因果关系未发现明显问题。
In Figure 1, two streams enter the TSN bridge. Based on their sGCLs, the first stream gate is open while the second is closed. Accordingly, frames of stream 1 are forwarded while frames of stream 2 are dropped by PSFP. Both streams then share the same egress port queues which are controlled by the tGCL. Here, only the first transmission gate is open, so only frames stored in queue 1 are transmitted.
在图 1 中,两个流进入 TSN 网桥。根据它们的 sGCL,第一个流门处于打开状态,而第二个流门处于关闭状态。因此,流 1 的帧被转发,而流 2 的帧被 PSFP 丢弃。随后,这两个流共享同一组出口端口队列,这些队列由 tGCL 控制。此处,只有第一个传输门处于打开状态,因此只有存储在队列 1 中的帧会被传输。
图号、流编号、门状态、PSFP 丢弃、tGCL 控制均保持一致;“queue 1”译为“队列 1”;未发现明显问题。
Stream gates in PSFP differ from transmission gates in TAS in three ways. First, stream gates apply per stream whereas transmission gates apply per egress port and queue. Second, an sGCL entry defines the state of a single stream gate while a tGCL entry defines the states of all eight queues. Third, closed stream gates drop frames before queueing while closed transmission gates buffer frames in the queue.
PSFP 中的流门与 TAS 中的传输门在三个方面不同。第一,流门按流应用,而传输门按出口端口和队列应用。第二,一个 sGCL 条目定义单个流门的状态,而一个 tGCL 条目定义全部八个队列的状态。第三,关闭的流门在排队之前丢弃帧,而关闭的传输门将帧缓冲在队列中。
三点差异完整保留;“per stream”“per egress port and queue”关系明确;“queueing/queue”前后语义一致;未发现明显问题。
The DetNet architecture enables real-time applications with extremely low packet loss rates and a bounded latency [ 5 ]. It is standardized by the IETF DetNet working group. DetNet operates on the networking, e.g., IP, layer and provides QoS and reliability to the lower layer, e.g., to MPLS and TSN. DetNet is applicable to networks under a single administrative control, e.g., to private WANs, or campus-wide networks.
DetNet 架构支持具有极低分组丢失率和有界时延的实时应用 [5]。它由 IETF DetNet 工作组标准化。DetNet 运行在网络层,例如 IP 层,并向较低层提供 QoS 和可靠性,例如向 MPLS 和 TSN 提供 QoS 和可靠性。DetNet 适用于处于单一管理控制之下的网络,例如私有 WAN,或覆盖整个园区的网络。
DetNet、IETF、QoS、MPLS、TSN、WAN 等术语保留;“bounded latency”译为“有界时延”;“to the lower layer”语义略不寻常,但按原文译为向较低层提供能力;未发现明显问题。
The bounded latency in DetNet is achieved by eliminating packet loss resulting from queue congestion within a node. For that purpose, bandwidth and buffer resources are reserved at each node. Resource reservations can be made using the Resource reServation Protocol (RSVP). For traffic engineering within DetNet, mechanisms defined by the IEEE 802.1 working group such as the TAS are applicable.
DetNet 中的有界时延是通过消除节点内部由队列拥塞导致的分组丢失来实现的。为此,在每个节点上预留带宽和缓冲区资源。资源预留可以使用资源预留协议(Resource reServation Protocol,RSVP)完成。对于 DetNet 内部的流量工程,可以应用 IEEE 802.1 工作组定义的机制,例如 TAS。
“packet loss resulting from queue congestion within a node”译为“节点内部由队列拥塞导致的分组丢失”,因果准确;RSVP 原文大小写“reServation”保留在英文全称中;未发现明显问题。
The DetNet architecture separates the data plane functions into two sub-layers. First, the service sub-layer provides DetNet QoS mechanisms such as bounded latency and service protection, e.g., by adding sequence number information to packets. Second, the forwarding sub-layer provides connectivity between Detnet service sub-layer processing nodes [ 13 ]. Various data plane technologies for DetNet exist, e.g., DetNet over MPLS [ 13 ], and DetNet over IP [ 14 ]. With DetNet over MPLS, the forwarding and service sub-layers are identified by MPLS labels, called the Forward label (F-Label), and the Service label (S-Label). One or more F-Labels are used to forward the packet through the DetNet domain. The S-Label follows after the F-Labels and is used to identify the DetNet flow. Based on the identified DetNet flow, QoS mechanisms are applied. Further, a DetNet control word (d-CW) follows after the MPLS stack. This control word contains a sequence number for protection mechanisms of DetNet.
DetNet 架构将数据平面功能划分为两个子层。第一,服务子层提供 DetNet QoS 机制,例如有界时延和服务保护,例如通过向分组添加序列号信息来实现。第二,转发子层在 DetNet 服务子层处理节点之间提供连接性 [13]。DetNet 存在多种数据平面技术,例如基于 MPLS 的 DetNet [13],以及基于 IP 的 DetNet [14]。对于基于 MPLS 的 DetNet,转发子层和服务子层通过 MPLS 标签来标识,这些标签分别称为转发标签(Forward label,F-Label)和服务标签(Service label,S-Label)。一个或多个 F-Label 用于将分组转发通过 DetNet 域。S-Label 位于 F-Label 之后,用于标识 DetNet 流。基于所识别出的 DetNet 流,应用 QoS 机制。此外,DetNet 控制字(DetNet control word,d-CW)位于 MPLS 栈之后。该控制字包含一个用于 DetNet 保护机制的序列号。
两个子层、DetNet over MPLS/IP、F-Label、S-Label、d-CW 的结构关系完整;“follows after”均译为“位于……之后”;“Detnet”原文大小写不一致,译文统一为 DetNet;未发现明显问题。
Standards exist that interconnect TSN networks using the DetNet MPLS data plane [ 15, 7 ]. For DetNet MPLS over TSN, DetNet flows are identified based on the S-Label at the DetNet / TSN domain border and are translated into TSN streams. For that purpose, IEEE Std 802.1CBdb [ 16 ] defines an MPLS DetNet flow identification which identifies the S-Label and pushes a new VLAN ID. Then, TSN stream identification is applied based on the new VLAN ID. With those interconnected data planes, TSN services such as the TAS and PSFP can be applied to DetNet flows.
已有一些标准使用 DetNet MPLS 数据平面来互连 TSN 网络 [15, 7]。对于基于 TSN 的 DetNet MPLS,DetNet 流在 DetNet / TSN 域边界处基于 S-Label 进行标识,并被转换为 TSN 流。为此,IEEE Std 802.1CBdb [16] 定义了一种 MPLS DetNet 流标识方法,该方法识别 S-Label 并压入一个新的 VLAN ID。随后,基于新的 VLAN ID 应用 TSN 流标识。借助这些互连的数据平面,TAS 和 PSFP 等 TSN 服务可以应用于 DetNet 流。
标准号 IEEE Std 802.1CBdb、引用 [15, 7]、S-Label、VLAN ID 均保留;“pushes a new VLAN ID”译为“压入一个新的 VLAN ID”符合网络封装语境;“DetNet MPLS over TSN”译为“基于 TSN 的 DetNet MPLS”可能需结合全文确认技术栈顺序。
Programming Protocol-independent Packet Processors (P4) is a domain-specific programming language to implement custom data planes in P4-programmable switches [ 17 ]. A P4 program can manipulate packets and make forwarding decisions to implement custom algorithms. In the following, we describe the concepts of the P4 pipeline, the packet generator, and a feature called advanced flow control (AFC). A survey by Hauser et al. provides more information on P4 [ 18 ].
Programming Protocol-independent Packet Processors(P4)是一种领域专用编程语言,用于在可编程 P4 的交换机中实现自定义数据平面 [17]。P4 程序可以操作分组并作出转发决策,以实现自定义算法。下文中,我们描述 P4 流水线、分组生成器,以及一种称为高级流控制(advanced flow control,AFC)的特性的概念。Hauser 等人的一篇综述提供了关于 P4 的更多信息 [18]。
P4 全称、P4-programmable switches、packet generator、AFC 均已保留;“manipulate packets”译为“操作分组”;作者引用 Hauser et al. 处理为“Hauser 等人”;未发现明显问题。
P4-programmable switches are called targets and implement a specific architecture. The Intel Tofino™ 2 switching ASIC is a hardware-based P4 target. Typically, a P4 architecture follows a pipelined structure. The pipeline of the Tofino Native Architecture (TNA), the architecture used by the Intel Tofino™, is illustrated in Figure 2.
可编程 P4 的交换机被称为目标,并实现一种特定架构。Intel Tofino™ 2 交换 ASIC 是一种基于硬件的 P4 目标。通常,P4 架构遵循流水线式结构。Tofino Native Architecture(TNA)的流水线,即 Intel Tofino™ 所使用架构的流水线,如图 2 所示。
“targets”译为“目标”并保留 P4 语境;Intel Tofino™ 2、ASIC、TNA 均保留;最后一句中“the architecture used by the Intel Tofino™”指代 TNA,译文已体现;未发现明显问题。
The TNA consists of an ingress block and an egress block, each with a programmable parser, control blocks, and a deparser. After processing frames in the ingress control block, frames are queued in the traffic manager component of the TNA. This component is configurable but not programmable [ 19 ].
TNA 由一个入口块和一个出口块组成,每个块都具有可编程解析器、控制块以及反解析器。在入口控制块中处理帧之后,帧会在 TNA 的流量管理器组件中排队。该组件是可配置的,但不是可编程的 [19]。
术语 TNA、ingress block、egress block、parser、control block、deparser、traffic manager 均已按 P4/交换芯片语境翻译;引用 [19] 保留;“configurable but not programmable”的对比关系已保留。未发现明显问题。
Control blocks in a P4 program define the logic of the algorithm. They leverage metadata for packet processing. A P4 program defines two different types of metadata. First, user-defined metadata stores information during the pipeline processing. Second, intrinsic metadata contains information given by the architecture, e.g., the ingress timestamp of a frame, and the ingress port. Control blocks are composed of match+action tables. The concept of a MAT is illustrated in Figure 3 [ 20 ] and explained in the following.
P4 程序中的控制块定义算法的逻辑。它们利用元数据进行数据包处理。一个 P4 程序定义两种不同类型的元数据。第一,用户定义元数据在流水线处理期间存储信息。第二,固有元数据包含由架构给出的信息,例如,一个帧的入口时间戳以及入口端口。控制块由 match+action 表组成。MAT 的概念如图 3 [20] 所示,并在下文中解释。
metadata 译为“元数据”,intrinsic metadata 译为“固有元数据”;match+action tables 保留关键英文表达并译为“表”;MAT 缩写保留;图号和引用 [20] 保留。未发现明显问题。
In a MAT, selected packet header fields and metadata form a composite key. Each packet is matched in the MAT according to the selected key fields. On a match in the table, an associated action is executed which can manipulate packet data or make a forwarding decision. The data plane defines the structure of a MAT, i.e., the key fields, and the actions. However, the content of these MATs is filled by the control plane. Further, registers are a commonly used feature in P4 that allow for stateful processing of packets.
在一个 MAT 中,选定的数据包头字段和元数据形成一个组合键。每个数据包都会根据所选的键字段在 MAT 中进行匹配。当在表中发生匹配时,会执行一个关联的动作,该动作可以操作数据包数据,或者做出转发决策。数据平面定义 MAT 的结构,即键字段和动作。然而,这些 MAT 的内容由控制平面填充。此外,寄存器是 P4 中一种常用特性,允许对数据包进行有状态处理。
composite key 译为“组合键”;data plane/control plane 分别译为“数据平面/控制平面”;stateful processing 译为“有状态处理”;逻辑上保留了结构由数据平面定义、内容由控制平面填充的区分。未发现明显问题。
P4 control blocks support logical and simple arithmetic expressions but do not support loops to maintain line rate processing. To enable iterative algorithms, packets can be recirculated. Modified headers from the first pass are available in the second. Recirculation introduces delay and requires dedicated ports. Architectures like the TNA offer internal recirculation ports, or can provision physical ports for recirculation.
P4 控制块支持逻辑表达式和简单算术表达式,但不支持循环,以维持线速处理。为了支持迭代算法,可以对数据包进行再循环。第一次通过流水线后被修改的头部在第二次通过时可用。再循环会引入时延,并且需要专用端口。像 TNA 这样的架构提供内部再循环端口,或者可以为再循环配置物理端口。
loops 译为“循环”,line rate processing 译为“线速处理”,recirculation 译为“再循环”;“Modified headers from the first pass are available in the second”中的 first/second pass 语义已保留。未发现明显问题。
The Intel Tofino™ natively supports time synchronization using PTP [ 11, 19 ]. Further, Kannan et al. [ 21 ] propose a data plane implementation of PTP which can be leveraged to achieve high-precision time-synchronization.
Intel Tofino™ 原生支持使用 PTP 进行时间同步 [11, 19]。此外,Kannan 等人 [21] 提出了一种 PTP 的数据平面实现,可以利用该实现来实现高精度时间同步。
Intel Tofino™、PTP、Kannan et al. 均保留;“natively supports”译为“原生支持”;引用 [11, 19] 和 [21] 保留。未发现明显问题。
The TNA provides an internal packet generator which can be configured to generate packets through a dedicated internal port. Generated packets are processed in the pipeline. Multiple applications with different triggers, such as a periodic trigger, can be configured to trigger packet generation. Further, the packet generator can be configured to generate B B batches with a batch size of K K packets each to enable packet bursts. A generated packet contains a packet generation header added by the traffic generator. This packet generation header identifies the application, the batch number, and the packet number in the batch [ 19 ].
TNA 提供一个内部数据包生成器,该生成器可以被配置为通过一个专用内部端口生成数据包。生成的数据包会在流水线中处理。可以配置多个具有不同触发器的应用来触发数据包生成,例如周期性触发器。此外,数据包生成器可以被配置为生成 B B 个批次,每个批次的大小为 K K 个数据包,以支持数据包突发。生成的数据包包含一个由流量生成器添加的数据包生成头部。这个数据包生成头部标识应用、批次编号以及该批次中的数据包编号 [19]。
“B B”和“K K”疑似由 PDF/公式抽取导致的重复或格式识别问题,可能原意为变量 B 和 K;packet generator 与 traffic generator 在原文中分别出现,已分别译为“数据包生成器”和“流量生成器”;batch number、packet number 语义保留。需人工确认 B B、K K 的公式格式。
A feature specific to the Intel Tofino™ 2 is advanced flow control (AFC) which enables control over the queues, i.e., dispatching or holding back frames, of an egress port in the traffic manager. The queue state is manipulated by writing an AFC value into a packet’s intrinsic metadata during pipeline processing. As this operation must be triggered by an incoming packet, each queue state change is initiated by packet arrival. A single packet can control exactly one queue. The AFC value is computed based on the egress port, queue ID, and the desired queue state. Importantly, the controlled queue does not need to correspond to the egress port or queue assigned to the processed packet itself.
Intel Tofino™ 2 特有的一项功能是高级流控制(AFC),它能够控制流量管理器中一个出口端口的队列,即分派帧或暂缓帧。在流水线处理期间,通过向数据包的固有元数据中写入一个 AFC 值来操纵队列状态。由于该操作必须由一个传入数据包触发,因此每一次队列状态变化都由数据包到达来发起。单个数据包可以精确控制一个队列。AFC 值基于出口端口、队列 ID 以及期望的队列状态来计算。重要的是,受控队列不需要对应于被处理数据包自身所分配的出口端口或队列。
advanced flow control 译为“高级流控制”,AFC 保留;dispatching or holding back frames 译为“分派帧或暂缓帧”;intrinsic metadata 译为“固有元数据”;最后一句的“不需要对应于被处理数据包自身所分配的出口端口或队列”准确保留了控制对象与当前包转发目标可分离的含义。未发现明显问题。
In this section, we review related work on the combination of TSN and DetNet systems, and on simulations and hardware implementations of those technologies.
在本节中,我们回顾关于 TSN 与 DetNet 系统结合的相关工作,以及关于这些技术的仿真和硬件实现的相关工作。
TSN、DetNet 保留;simulations and hardware implementations 译为“仿真和硬件实现”;段落功能为章节引导,逻辑清楚。未发现明显问题。
The integration of TSN and DetNet has received considerable attention in recent years due to their critical role in facilitating ultra-low latency communication in 5G networks. Nasrallah et al. [ 22 ] provide a comprehensive overview of TSN and DetNet technologies, emphasizing their importance for time-critical applications in 5G environments. Building on this foundation, Abuibaid et al. [ 23 ] conduct a case study that measures the performance of TSN and DetNet in a practical 5G setting. Furthermore, Wüsteney et al. [ 24 ] propose a latency model for time-sensitive communication traversing networks that integrate TSN and DetNet. Menendez et al. [ 25 ] present a software-based implementation of the TAS using XDP and eBPF. Further, they integrate TSN functionality into DetNet environments with a MPLS over UDP/IP data plane. While their open-source implementation represents a significant step towards TSN / DetNet integration, their evaluation does not consider internal timing behavior and only measures traffic rates up to \qty 600.
由于 TSN 和 DetNet 在促进 5G 网络中的超低时延通信方面发挥关键作用,二者的集成近年来受到了相当多的关注。Nasrallah 等人 [22] 对 TSN 和 DetNet 技术进行了全面综述,强调了它们对于 5G 环境中时间关键型应用的重要性。在此基础上,Abuibaid 等人 [23] 开展了一项案例研究,测量了 TSN 和 DetNet 在实际 5G 场景中的性能。此外,Wüsteney 等人 [24] 提出了一个用于时间敏感通信的时延模型,该通信穿越集成了 TSN 和 DetNet 的网络。Menendez 等人 [25] 提出了一种使用 XDP 和 eBPF 的 TAS 软件实现。此外,他们通过一个基于 MPLS over UDP/IP 数据平面的方案,将 TSN 功能集成到 DetNet 环境中。尽管他们的开源实现代表了向 TSN/DetNet 集成迈出的重要一步,但他们的评估没有考虑内部定时行为,并且只测量了最高到 \qty 600 的流量速率。
作者名、引用编号、TSN、DetNet、5G、TAS、XDP、eBPF、MPLS over UDP/IP 均保留;“internal timing behavior”译为“内部定时行为”;末尾 “\qty 600” 明显缺少单位或 LaTeX 参数,可能为抽取残缺,无法确定是 Mbit/s、kpps 或其他单位。需人工结合原 PDF/上下文确认。
Despite these advances, challenges remain in realizing efficient hardware implementations that integrate TSN and DetNet functionalities while identifying and quantifying internal timing behaviour which this work aims to address.
尽管取得了这些进展,在实现高效硬件实现方面仍然存在挑战:这些硬件实现需要集成 TSN 和 DetNet 功能,同时还要识别并量化内部定时行为,而这正是本工作旨在解决的问题。
“Despite these advances”的转折关系已保留;“realizing efficient hardware implementations”与“integrate TSN and DetNet functionalities”已完整表达;“identifying and quantifying internal timing behaviour”译为“识别并量化内部定时行为”;英式拼写 behaviour 不影响含义。未发现明显问题。
Numerous simulation frameworks have been developed to model TSN [ 26, 27 ], DetNet [ 28 ], or both [ 29 ]. In particular, Addanki et al. [ 29 ] offer a simulator that integrates building blocks for DetNet at the network layer and TSN at the link layer. Polverini et al. [ 30 ] describe a P4-based DetNet implementation for the BMv2 software target leveraging an SRv6 data plane for reliability. While such simulations are valuable for exploring the interaction between TSN and DetNet, they do not fully address the challenges of real-world deployment. Implementing time-sensitive mechanisms in hardware introduces additional complexity due to resource constraints and timing precision requirements. Ahmed et al. [ 31, 32 ] provide FPGA-based implementations of the CBS and ATS of TSN, while we presented a P4-based hardware implementation of the PSFP mechanism on an ASIC [ 6 ].
已经开发出许多仿真框架,用于对 TSN [26, 27]、DetNet [28],或二者同时进行建模 [29]。特别是,Addanki 等人 [29] 提供了一个仿真器,该仿真器集成了网络层 DetNet 的构建模块以及链路层 TSN 的构建模块。Polverini 等人 [30] 描述了一种面向 BMv2 软件目标的、基于 P4 的 DetNet 实现,该实现利用 SRv6 数据平面来实现可靠性。虽然这类仿真对于探索 TSN 与 DetNet 之间的交互很有价值,但它们并不能充分解决真实世界部署中的挑战。在硬件中实现时间敏感机制会因资源约束和定时精度要求而引入额外复杂性。Ahmed 等人 [31, 32] 提供了 TSN 中 CBS 和 ATS 的 FPGA 实现,而我们此前则展示了在 ASIC 上对 PSFP 机制进行的基于 P4 的硬件实现 [6]。
术语 TSN、DetNet、P4、BMv2、SRv6、CBS、ATS、PSFP、FPGA、ASIC 均已保留;引用编号完整;“network layer/link layer”译为“网络层/链路层”准确;“reliability”译为“可靠性”无明显风险。未发现明显问题。
Several commercial hardware platforms support TAS and PSFP. NXP’s automotive-grade SJA1105TEL switch ASIC provides eight egress queues per port with a time granularity of \qty 200. This ASIC supports time-gated transmission of up to 1024 flows [ 33, 34 ]. Similarly, Microchip’s SparX-5i [ 35 ] and PD-IES008 [ 36, 37 ] families expose time interval configuration with nanosecond granularity and support up to 10,000 tGCL entries. These platforms demonstrate that TAS is available in hardware, but published information typically stops at high-level feature descriptions such as queue counts, GCL sizes, or time granularity. A summary of these capabilities is provided in Section V-D and compared against P4-TAS in Table II. Although these platforms claim nanosecond configuration granularity, reliable queue updates at this scale are not feasible in practice due to undocumented internal delays and hardware limitations. A recent work by Eppler et al. [ 9 ] quantified such undocumented timing behavior inside commercial TSN switches. Their measurements reveal internal scheduling and gate transition delays in the order of hundreds of nanoseconds to several microseconds which is significant for schedule synthesis and can lead to missed transmission windows if not accounted for.
若干商用硬件平台支持 TAS 和 PSFP。NXP 的车规级 SJA1105TEL 交换机 ASIC 为每个端口提供 8 个出口队列,时间粒度为 \qty 200。该 ASIC 支持最多 1024 条流的时间门控传输 [33, 34]。类似地,Microchip 的 SparX-5i [35] 和 PD-IES008 [36, 37] 系列提供纳秒级粒度的时间间隔配置,并支持最多 10,000 个 tGCL 条目。这些平台表明 TAS 已可在硬件中获得,但已发表的信息通常止步于高级别功能描述,例如队列数量、GCL 大小或时间粒度。第 V-D 节提供了这些能力的摘要,并在表 II 中将其与 P4-TAS 进行比较。尽管这些平台声称具有纳秒级配置粒度,但由于未公开说明的内部延迟和硬件限制,在实践中无法可靠地以这种尺度进行队列更新。Eppler 等人 [9] 最近的一项工作量化了商用 TSN 交换机内部这类未公开说明的定时行为。他们的测量揭示了数量级为数百纳秒到数微秒的内部调度和门转换延迟;这对调度综合具有重要意义,并且如果不加以考虑,可能导致错过传输窗口。
“\qty 200”疑似源文本中单位缺失或 LaTeX 识别不完整,无法确认是 200 ns 还是其他单位;数字 8、1024、10,000、数百纳秒至数微秒均保留;tGCL、GCL、TAS、PSFP 术语一致;“schedule synthesis”译为“调度综合”可能需结合论文领域确认,但可接受。因公式/单位残缺,需人工复核。
In this work, we present a hardware implementation of selected TSN and DetNet mechanisms on a programmable ASIC. Unlike prior academic or commercial platforms, our design enables a transparent evaluation of internal delays, thereby offering deeper insights into their behavior and integration in real systems.
在本文中,我们提出了在可编程 ASIC 上对若干选定 TSN 和 DetNet 机制的硬件实现。不同于以往的学术平台或商用平台,我们的设计能够对内部延迟进行透明评估,从而为其在真实系统中的行为和集成提供更深入的认识。
“selected TSN and DetNet mechanisms”译为“若干选定 TSN 和 DetNet 机制”保留限定;“transparent evaluation of internal delays”译义准确;逻辑关系清晰。未发现明显问题。
In this section, we describe the implementation of the P4- TAS switch incorporating the PSFP and the TAS mechanism on the Intel Tofino™ 2 switching ASIC. First, we describe the system architecture and integration into DetNet domains. Then, we present the implementation of the TAS mechanism in P4. Finally, we explain improvements to the P4-PSFP implementation. The source code is publicly available on GitHub [ 38 ].
在本节中,我们描述 P4-TAS 交换机的实现,该交换机在 Intel Tofino™ 2 交换 ASIC 上结合了 PSFP 和 TAS 机制。首先,我们描述系统架构以及其与 DetNet 域的集成。然后,我们介绍 TAS 机制在 P4 中的实现。最后,我们解释对 P4-PSFP 实现所作的改进。源代码已在 GitHub 上公开提供 [38]。
P4-TAS、Intel Tofino™ 2、PSFP、TAS、P4-PSFP、GitHub 均保留;“incorporating”译为“结合了”准确;章节结构顺序完整。未发现明显问题。
The P4-TAS implementation is designed as an Ethernet switch that provides TSN functionality. It performs TSN stream identification as defined in IEEE Std 802.1CB [ 4 ] and applies traffic shaping with the TAS as well as policing with PSFP. These mechanisms allow P4-TAS to operate natively inside a TSN domain and provide deterministic forwarding for TSN streams. In addition to its role within a pure TSN network, P4-TAS can also act as a border element between a DetNet domain and a TSN domain. In this case, it processes incoming MPLS-encapsulated DetNet flows and translates them into TSN streams based on IEEE Std 802.1CBdb [ 16 ]. This enables DetNet to leverage the TSN sub-layer for scheduling and shaping. The integration scenario is illustrated in Figure 4.
P4-TAS 实现被设计为一个提供 TSN 功能的以太网交换机。它执行 IEEE Std 802.1CB [4] 中定义的 TSN 流识别,并应用基于 TAS 的流量整形以及基于 PSFP 的监管。这些机制使 P4-TAS 能够在 TSN 域内部原生运行,并为 TSN 流提供确定性转发。除了在纯 TSN 网络中的角色之外,P4-TAS 还可以充当 DetNet 域与 TSN 域之间的边界元素。在这种情况下,它处理传入的 MPLS 封装 DetNet 流,并基于 IEEE Std 802.1CBdb [16] 将其转换为 TSN 流。这使 DetNet 能够利用 TSN 子层进行调度和整形。该集成场景如图 4 所示。
IEEE Std 802.1CB、IEEE Std 802.1CBdb、MPLS、DetNet、TSN、TAS、PSFP 均保留;“policing”译为“监管”符合网络 QoS 语境;“border element”译为“边界元素”可接受。未发现明显问题。
At the ingress to the TSN domain (step 1), the P4-TAS switch translates DetNet flows into TSN streams by pushing a VLAN tag based on the S-Label. Afterwards, TSN stream identification is applied using the destination MAC address and the pushed VLAN tag (step 2). Subsequently, the identified TSN stream is subjected to traffic shaping and policing with TAS and PSFP (step 3), and the frame is forwarded through the TSN domain. At the egress (step 4), the VLAN tag is removed to restore the original DetNet flow.
在进入 TSN 域的入口处(步骤 1),P4-TAS 交换机基于 S-Label 压入一个 VLAN 标签,从而将 DetNet 流转换为 TSN 流。随后,使用目的 MAC 地址和所压入的 VLAN 标签来应用 TSN 流识别(步骤 2)。接着,已识别的 TSN 流接受基于 TAS 和 PSFP 的流量整形与监管(步骤 3),并且该帧通过 TSN 域转发。在出口处(步骤 4),移除 VLAN 标签,以恢复原始 DetNet 流。
步骤 1-4 完整;S-Label、VLAN、MAC、TAS、PSFP 保留;“pushing a VLAN tag”译为“压入一个 VLAN 标签”符合封装处理语境;“restore the original DetNet flow”译为“恢复原始 DetNet 流”准确。未发现明显问题。
The TAS defined in IEEE Std 802.1Qbv periodically opens and closes multiple egress queues according to a tGCL. Periodic behavior, such as in a GCL, is not natively supported by P4. Further, queue states on the Intel Tofino™ 2 can be controlled with AFC, but such changes can only be triggered by the arrival of a frame, and each frame can update only a single queue of one egress port. To implement the TAS under these constraints, P4-TAS combines three building blocks: a periodic time model for the tGCL, a dedicated stream of continuous TAS control frames to trigger AFC updates, and a tGCL MAT that maps control frames to queue state changes. They are described in the following. Finally, an overview is given of how they operate together in the pipeline.
IEEE Std 802.1Qbv 中定义的 TAS 根据 tGCL 周期性地打开和关闭多个出口队列。周期性行为,例如 GCL 中的周期性行为,并非 P4 原生支持。此外,Intel Tofino™ 2 上的队列状态可以用 AFC 控制,但这类改变只能由帧的到达触发,并且每个帧只能更新一个出口端口的单个队列。为了在这些约束下实现 TAS,P4-TAS 结合了三个构建模块:用于 tGCL 的周期性时间模型、用于触发 AFC 更新的专用连续 TAS 控制帧流,以及一个将控制帧映射到队列状态变化的 tGCL MAT。下面将对它们进行描述。最后,将概述它们如何在流水线中协同运行。
IEEE Std 802.1Qbv、TAS、tGCL、GCL、P4、Intel Tofino™ 2、AFC、MAT 均保留;“a dedicated stream of continuous TAS control frames”译为“专用连续 TAS 控制帧流”准确;“one egress port”未误译为多个端口。未发现明显问题。
Modeling the periodicity of GCLs in programmable P4 hardware is challenging since periodic behavior is not natively supported. Timestamps in the Intel Tofino™ are absolute, i.e., their values continuously increase, whereas the time slices in GCLs are relative and follow a periodic pattern. Thus, each frame’s absolute timestamp must be mapped to its corresponding position within the current GCL period. While this could be achieved with a modulo operation, such operations are too complex to perform at line rate in the data plane.
在可编程 P4 硬件中建模 GCL 的周期性具有挑战性,因为周期性行为并非原生支持。Intel Tofino™ 中的时间戳是绝对时间戳,即其值会连续增加,而 GCL 中的时间片是相对的,并遵循一种周期性模式。因此,每个帧的绝对时间戳都必须映射到其在当前 GCL 周期内的相应位置。虽然这可以通过取模运算实现,但这类运算过于复杂,无法在数据平面中以线速执行。
“absolute/relative timestamps”译为“绝对/相对时间戳”准确;“modulo operation”译为“取模运算”准确;“line rate”译为“线速”符合术语;逻辑完整。未发现明显问题。
In our previous work [ 6 ], we described an approach to model the periodicity of sGCLs in a P4-based PSFP implementation. In this approach, we leveraged the internal packet generator of the Intel Tofino™ as a clock source. At the end of each GCL cycle, the internal packet generator generates a period-completion frame. The ingress timestamp of this frame is stored in a register and references the timestamp of the last completed period. For all other frames, i.e., non-period-completion frames, the ingress pipeline subtracts this stored value from the frame’s absolute timestamp to obtain a relative timestamp within the ongoing sGCL cycle. In this way, every frame is mapped into a relative time window of one cycle length. The absolute hardware clock of the switch can be used consistently while the sGCL is treated as a repeating list of entries. We leverage this mechanism to implement the periodicity of tGCLs for the TAS. However, unlike sGCLs in PSFP, where each sGCL entry opens or closes a single stream gate, i.e., admits or drops a frame, a tGCL entry must control multiple transmission gates by opening and closing queues. Thus, the periodicity mechanism serves as the basis for the TAS, but additional mechanisms are required to continuously update the gate states of all egress queues during each tGCL entry.
在我们先前的工作 [6] 中,我们描述了一种在基于 P4 的 PSFP 实现中建模 sGCL 周期性的方法。在该方法中,我们利用 Intel Tofino™ 的内部数据包生成器作为时钟源。在每个 GCL 周期结束时,内部数据包生成器会生成一个周期完成帧。该帧的入口时间戳被存储在一个寄存器中,并引用最后一个已完成周期的时间戳。对于所有其他帧,即非周期完成帧,入口流水线从该帧的绝对时间戳中减去这个已存储的值,以获得正在进行的 sGCL 周期内的相对时间戳。通过这种方式,每个帧都被映射到一个周期长度的相对时间窗口中。交换机的绝对硬件时钟可以被一致地使用,同时 sGCL 被视为一个重复的条目列表。我们利用这一机制来实现 TAS 中 tGCL 的周期性。然而,不同于 PSFP 中的 sGCL,在那里每个 sGCL 条目打开或关闭单个流门,即接纳或丢弃一个帧;tGCL 条目必须通过打开和关闭队列来控制多个传输门。因此,该周期性机制作为 TAS 的基础,但还需要额外机制,以便在每个 tGCL 条目期间连续更新所有出口队列的门状态。
sGCL、PSFP、Intel Tofino™、GCL、tGCL、TAS 均保留;“period-completion frame”译为“周期完成帧”一致;“stream gate/transmission gate”分别译为“流门/传输门”,术语区分清楚;因长句中“references the timestamp”译为“引用……时间戳”可接受但略偏直译。未发现明显问题。
Queues on the Intel Tofino™ 2 can be opened or closed by processing intrinsic AFC metadata of a frame in the pipeline. The controlled queue does not need to correspond to the egress port or queue assigned to the frame itself. A single frame can control exactly one queue of one port of the switch. In this section, we explain the concept of timely control of all queues which implements the tGCL.
Intel Tofino™ 2 上的队列可以通过在流水线中处理一个帧的内在 AFC 元数据来打开或关闭。被控制的队列不需要对应于该帧自身被分配到的出口端口或队列。单个帧可以精确控制该交换机上一个端口的一个队列。在本节中,我们解释对所有队列进行及时控制的概念,该概念实现了 tGCL。
AFC、Intel Tofino™ 2、tGCL 保留;“intrinsic AFC metadata”译为“内在 AFC 元数据”可能需结合 Intel/P4 术语确认,也可译为“固有 AFC 元数据”;“exactly one queue of one port”译为“一个端口的一个队列”准确;“timely control”译为“及时控制”语义基本准确。未发现明显问题。
To implement tGCL queue state changes with AFC on the Intel Tofino™ 2, each queue state update must be triggered by the arrival of a frame. For this purpose, P4-TAS employs the internal packet generator to continuously produce TAS control frames. These frames are generated in back-to-back batches of eight so that each queue is assigned one frame. Each TAS control frame carries intrinsic metadata with the identifiers of the queue and egress port it controls. Upon arrival, its position in the tGCL cycle is calculated based on the frame’s arrival timestamp as described in Section IV-B1. Then, the position in the tGCL cycle and the intrinsic metadata are matched against the tGCL MAT which specifies whether the corresponding queue should be opened or closed at that point in time.
为了在 Intel Tofino™ 2 上使用 AFC 实现 tGCL 队列状态变更,每一次队列状态更新都必须由一个帧的到达来触发。为此,P4-TAS 采用内部数据包生成器来连续产生 TAS 控制帧。这些帧以连续的、背靠背的 8 帧批次生成,使得每个队列都被分配一个帧。每个 TAS 控制帧都携带内部元数据,其中包含它所控制的队列和出口端口的标识符。到达后,该帧在 tGCL 周期中的位置会基于该帧的到达时间戳计算,具体如第 IV-B1 节所述。然后,将 tGCL 周期中的位置以及内部元数据与 tGCL MAT 进行匹配;该 tGCL MAT 指定在该时间点对应队列应当打开还是关闭。
术语 tGCL、AFC、TAS、MAT、intrinsic metadata 已按技术语境保留或译为“内部元数据”;“back-to-back batches of eight”译为“背靠背的 8 帧批次”,数字无误;逻辑为控制帧到达触发状态更新,未发现明显问题。
The TAS control frames are minimally sized (\qty 64B) and contain no payload beyond intrinsic metadata. They are continuously generated with a minimal inter-arrival time to ensure that queue states follow the configured tGCL precisely. This mechanism does not consume bandwidth for user traffic since the internal packet generator and a dedicated internal port are exclusively used for the TAS control traffic. In practice, there is a short delay between consecutive TAS control frames. Since a queue can only change state when its associated control frame arrives, delayed opening or closing may occur. We evaluate the impact of this behavior in Section V-A3.
TAS 控制帧采用最小尺寸(\qty 64B),并且除内部元数据之外不包含任何载荷。它们以最小到达间隔被连续生成,以确保队列状态精确遵循已配置的 tGCL。该机制不会消耗用户流量的带宽,因为内部数据包生成器和一个专用内部端口被专门用于 TAS 控制流量。在实践中,连续的 TAS 控制帧之间存在一个很短的延迟。由于队列只能在其关联的控制帧到达时改变状态,因此可能发生延迟打开或延迟关闭。我们在第 V-A3 节评估这一行为的影响。
“\qty 64B”疑似 LaTeX 数量宏识别结果,按原符号保留;若排版目标需要,可人工确认是否应呈现为“64 B”。“inter-arrival time”译为“到达间隔”,逻辑无误;状态更新延迟的因果关系保留完整。
The tGCL is encoded as a MAT in the egress pipeline and is shown in Figure 5. TAS control frames from Section IV-B2 are matched against it.
tGCL 在出口流水线中被编码为一个 MAT,并如图 5 所示。第 IV-B2 节中的 TAS 控制帧会与其进行匹配。
“egress pipeline”译为“出口流水线”;“matched against it”指与 tGCL MAT 匹配,指代关系清楚。未发现明显问题。
Each entry in the MAT corresponds to one of the eight queues of a tGCL entry, i.e., eight MAT entries per tGCL entry are required. The entry specifies whether a queue should be currently open or closed. The lookup key is composed of the relative timestamp in the tGCL which is calculated according to Section IV-B1, the queue identifier, and the egress port. The MAT action writes a precomputed AFC value which encodes the queue, the egress port, and the state, into the frame’s intrinsic metadata. This triggers the queue state update. The queue state update has a small delay which is evaluated in Section V-A2.
MAT 中的每个条目对应于一个 tGCL 条目中的 8 个队列之一,也就是说,每个 tGCL 条目需要 8 个 MAT 条目。该条目指定一个队列当前应当处于打开状态还是关闭状态。查找键由 tGCL 中的相对时间戳、队列标识符以及出口端口组成,其中相对时间戳按照第 IV-B1 节计算。MAT 动作会把一个预先计算好的 AFC 值写入该帧的内部元数据;该 AFC 值对队列、出口端口和状态进行编码。这会触发队列状态更新。队列状态更新存在一个很小的延迟,该延迟在第 V-A2 节中进行评估。
数字“8 个队列 / 8 个 MAT 条目”无误;lookup key、action、precomputed AFC value 的结构关系保留完整;“state”在上下文中为打开/关闭状态。未发现明显问题。
The mechanisms in Section IV-B1 – IV-B3 operate together within the P4-TAS pipeline as illustrated in Figure 6.
第 IV-B1 节至第 IV-B3 节中的机制在 P4-TAS 流水线内协同运行,如图 6 所示。
章节范围 “IV-B1 – IV-B3” 翻译准确;“operate together”译为“协同运行”。未发现明显问题。
First, generated period-completion frames mark the boundaries of the tGCL and sGCL cycles and maintain the reference for relative timestamp calculation in step 1. Here, a single frame is generated at the end of each period with duration h h, and its timestamp of the j j -th period t j h t^{h}_{j} is stored in a register for subsequent processing. Afterward, those frames are dropped.
首先,所生成的周期完成帧标记 tGCL 和 sGCL 周期的边界,并在步骤 1 中维护用于相对时间戳计算的参考。在这里,每个持续时间为 h h 的周期结束时生成一个单独的帧,并且第 j j 个周期的时间戳 t j h t^{h}_{j} 会被存储在一个寄存器中,以供后续处理。随后,这些帧会被丢弃。
“h h”“j j”“t j h t^{h}_{j}”明显像公式抽取或 OCR/LaTeX 解析残缺,已尽量按原形式保留;需人工结合论文公式确认其正确写法,可能应为周期长度 h、第 j 个周期时间戳 \(t^h_j\)。其余逻辑为周期完成帧用于更新相对时间参考。
Second, TAS control frames are continuously generated by the packet generator with a minimal inter-arrival time. For those frames, the timestamp relative to the last elapsed period of the tGCL is calculated in step 2. This timestamp is used to match the TAS control frame to the corresponding entry of the tGCL MAT. After queuing the TAS control frame in a dedicated queue of the traffic manager, the AFC mechanism is applied in the egress in step 3. Here, the corresponding queue is opened or closed based on the current tGCL entry using the MAT described in Section IV-B3. Afterward, the frames are dropped.
其次,TAS 控制帧由数据包生成器以最小到达间隔连续生成。对于这些帧,在步骤 2 中会计算其相对于 tGCL 上一个已流逝周期的时间戳。该时间戳用于将 TAS 控制帧匹配到 tGCL MAT 的对应条目。在将 TAS 控制帧排入流量管理器的专用队列之后,AFC 机制会在步骤 3 的出口侧被应用。在这里,会使用第 IV-B3 节所描述的 MAT,并基于当前 tGCL 条目打开或关闭对应队列。随后,这些帧会被丢弃。
“last elapsed period”译为“上一个已流逝周期”,保留相对时间语义;traffic manager 译为“流量管理器”;AFC 在出口侧应用的顺序与原文一致。未发现明显问题。
Third, TSN data frames are policed in the ingress pipeline by the PSFP mechanism to enforce conformance with admitted rates and transmission times. For those frames, the timestamp relative to the last elapsed period of the sGCL is calculated in step 4, and PSFP is applied in step 5. In this step, the frames are policed and either dropped or queued according to their priority. The queue states are either in an open or a closed state based on the tGCL. Frames are forwarded in a FIFO manner as soon as their queue opens.
第三,TSN 数据帧在入口流水线中由 PSFP 机制进行监管,以强制其符合已准入的速率和传输时间。对于这些帧,在步骤 4 中会计算其相对于 sGCL 上一个已流逝周期的时间戳,并在步骤 5 中应用 PSFP。在该步骤中,这些帧会受到监管,并根据其优先级被丢弃或排入队列。队列状态则基于 tGCL 处于打开状态或关闭状态。一旦队列打开,帧就会以 FIFO 方式转发。
“policed”译为“监管”,符合 PSFP/流量监管语境;“admitted rates and transmission times”译为“已准入的速率和传输时间”;FIFO 保留。未发现明显问题。
The Intel Tofino™ 2 ASIC used in this work provides hardware support for IEEE 1588 PTP [ 19 ]. PTP enables sub-microsecond synchronization accuracy by exchanging timestamped messages between network nodes to align their local clocks. This functionality can be implemented entirely with on-board resources of the ASIC [ 19 ]. However, the P4- TAS implementation does not include a PTP synchronization mechanism since integrating such functionality is beyond the scope of this work and not required for the evaluations presented in this paper. Prior work has demonstrated that precise PTP synchronization on Tofino-based switches can be achieved by combining hardware timestamping with control plane clock management [ 39, 19 ] or even entirely within the data plane [ 21 ]. Nevertheless, the TAS functionality and internal delay characteristics we evaluate are largely independent of network-wide time alignment. Future work will explore the integration of P4-TAS into a synchronized multi-hop TSN testbed to enable coordinated, time-aware scheduling across multiple devices.
本工作中使用的 Intel Tofino™ 2 ASIC 为 IEEE 1588 PTP [19] 提供硬件支持。PTP 通过在网络节点之间交换带时间戳的消息来对齐它们的本地时钟,从而实现亚微秒级同步精度。该功能可以完全使用 ASIC 的板载资源来实现 [19]。然而,P4-TAS 实现并不包含 PTP 同步机制,因为集成此类功能超出了本工作的范围,并且对于本文所给出的评估并非必需。已有工作表明,在基于 Tofino 的交换机上,可以通过将硬件时间戳与控制平面时钟管理相结合来实现精确的 PTP 同步 [39, 19],甚至也可以完全在数据平面内实现 [21]。不过,我们所评估的 TAS 功能和内部延迟特性在很大程度上独立于全网范围的时间对齐。未来工作将探索把 P4-TAS 集成到一个同步的多跳 TSN 测试平台中,以便在多个设备之间实现协调的、时间感知的调度。
IEEE 1588 PTP、ASIC、Tofino、TSN 等缩写保留;引用编号 [19]、[39, 19]、[21] 无误;“sub-microsecond”译为“亚微秒级”;“network-wide time alignment”译为“全网范围的时间对齐”。未发现明显问题。
P4-TAS incorporates the previous P4-PSFP implementation [ 6 ]. The PSFP components stream filter, stream gate, and flow meter are implemented according to IEEE Std 802.1Qci [ 3 ]. The functionality of P4-PSFP has been extensively evaluated in [ 6 ]. In this section, we describe improvements to P4-PSFP that eliminate recirculation, and increase the time resolution of GCLs.
P4-TAS 纳入了先前的 P4-PSFP 实现 [6]。PSFP 组件中的流过滤器、流门控器和流量计按照 IEEE Std 802.1Qci [3] 实现。P4-PSFP 的功能已在 [6] 中得到广泛评估。在本节中,我们描述对 P4-PSFP 的改进,这些改进消除了再循环,并提高了 GCL 的时间分辨率。
“stream filter、stream gate、flow meter”分别译为“流过滤器、流门控器、流量计”,符合 TSN/PSFP 组件语义;IEEE Std 802.1Qci、引用 [3]、[6] 保留;“recirculation”译为“再循环”。未发现明显问题。
P4-PSFP recirculates TSN traffic for two reasons. First, calculating the relative position in a sGCL does not fit in a single pipeline iteration. Second, the optional maximum frame size filter defined in IEEE Std 802.1Qci [ 3 ] requires frame size info only available in the egress block while drops must occur in the ingress block. Thus, recirculation is necessary, adding a known constant delay. For P4-TAS, we ported the implementation of P4-PSFP from Intel Tofino™ to Tofino™ 2 where the larger pipeline allows the GCL position to be computed in one pass. We also removed the optional maximum frame size filter, eliminating the need for recirculation. If required, the filter can be re-added, at the cost of a recirculation.
P4-PSFP 出于两个原因对 TSN 流量进行再循环。第一,计算在一个 sGCL 中的相对位置无法放入单次流水线迭代中完成。第二,IEEE Std 802.1Qci [3] 中定义的可选最大帧大小过滤器需要帧大小信息,而该信息只有在出口块中才可用,但丢弃必须发生在入口块中。因此,再循环是必要的,并会增加一个已知的恒定延迟。对于 P4-TAS,我们将 P4-PSFP 的实现从 Intel Tofino™ 移植到 Tofino™ 2,在后者中,更大的流水线允许在一次通过中计算 GCL 位置。我们还移除了可选最大帧大小过滤器,从而消除了对再循环的需要。如果需要,可以重新加入该过滤器,代价是进行一次再循环。
术语 TSN、sGCL、GCL、入口块、出口块、再循环均已保留或准确翻译;IEEE 标准号与引用 [3] 未改动;“known constant delay”译为“已知的恒定延迟”准确。未发现明显问题。
sGCL entries in P4-PSFP are modeled as MAT entries with the range matching type. However, the range matching type is limited in the TNA and only \qty 20bits can be matched. Timestamps in the TNA are \qty 48bits with nanosecond granularity. Therefore, in P4-PSFP, \qty 20bits are cut out of the middle of the timestamp to enable the range matching type and enable an appropriate time resolution. Thus, GCLs have a minimum resolution of \qty 2 and a maximum resolution of approximately \qty 4. GCL entries with a lower resolution, or GCLs that last longer cannot be defined in P4-PSFP. However, due to hardware limitations, P4-TAS requires small intervals between tGCL entries where a minimum resolution of \qty 2 is too large. This is further elaborated in Section V-B3. Therefore, we employ an algorithm called range-to-ternary conversion [ 40 ] to increase the resolution of time slices. This algorithm allows to model a single range entry using multiple ternary entries.
P4-PSFP 中的 sGCL 条目被建模为采用范围匹配类型的 MAT 条目。然而,范围匹配类型在 TNA 中受到限制,并且只能匹配 \qty 20bits。TNA 中的时间戳为 \qty 48bits,具有纳秒粒度。因此,在 P4-PSFP 中,会从时间戳的中间截取 \qty 20bits,以启用范围匹配类型并实现适当的时间分辨率。因此,GCL 的最小分辨率为 \qty 2,最大分辨率约为 \qty 4。分辨率更低的 GCL 条目,或者持续时间更长的 GCL,无法在 P4-PSFP 中定义。然而,由于硬件限制,P4-TAS 要求 tGCL 条目之间具有很小的间隔,而此时 \qty 2 的最小分辨率过大。第 V-B3 节对此作了进一步阐述。因此,我们采用一种称为 range-to-ternary conversion(范围到三值转换)[40] 的算法来提高时间片的分辨率。该算法允许使用多个三值条目来建模单个范围条目。
MAT、TNA、sGCL、tGCL 等缩写已保留;引用 [40] 和章节 V-B3 未改动。原文中的 `\qty 2`、`\qty 4` 缺少单位或上标上下文,可能是抽取残缺,例如可能对应不同时间单位或数量级;需结合论文 PDF 或公式排版人工核对。
The algorithm takes an integer range [ L, R ] [L,R] representing a time slice and breaks it down into the smallest possible set of prefixes that collectively cover the entire range. It does this by repeatedly selecting the largest prefix starting at the current lower bound that remains fully within the range. These selected blocks together ensure complete coverage of the interval [ 40 ]. Some example conversions are given in Figure 7. Each block in Figure 7 denotes a ternary entry that covers parts of the range. The ∗ * denotes a “don’t care” bit, meaning the bit can take either value 0 or 1.
该算法接收一个表示时间片的整数范围 [L, R],并将其分解为能够共同覆盖整个范围的、数量尽可能少的前缀集合。它通过反复选择从当前下界开始且仍完全位于该范围内的最大前缀来完成这一过程。这些被选中的块共同确保对区间的完整覆盖 [40]。图 7 给出了一些转换示例。图 7 中的每个块表示一个覆盖该范围部分内容的三值条目。∗ 表示一个“无关”位,意味着该位可以取 0 或 1 中的任一值。
原文中 `[ L, R ] [L,R]` 存在重复抽取,译文合并为 `[L, R]`;`∗ *` 也疑似符号重复,译文保留 `∗`。算法逻辑、引用 [40]、图 7、0/1 数值均准确。因输入存在重复识别痕迹,需人工核对原文排版。
In a GCL, time slices are defined as consecutive, non-overlapping ranges. Under these constraints, Sun [ 41 ] has proven that the solution is both correct and unique.
在 GCL 中,时间片被定义为连续且不重叠的范围。在这些约束下,Sun [41] 已经证明该解既正确又唯一。
“consecutive, non-overlapping ranges”译为“连续且不重叠的范围”准确;Sun [41] 引用保留;逻辑关系清楚。未发现明显问题。
With this algorithm, GCLs have a resolution of \qty 1 \nano to \qty 78. The upper bound of \qty 78 exceeds the requirements of GCL periods by far and is not necessary in TSN networks. However, the full \qty 48bits timestamp range is available for matching, and reducing the resolution does not have a benefit. The number of ternary table entries required by this conversion algorithm to model GCLs is evaluated in Section V-C.
通过该算法,GCL 具有从 \qty 1 \nano 到 \qty 78 的分辨率。\qty 78 的上界远远超过 GCL 周期的要求,并且在 TSN 网络中并不必要。然而,完整的 \qty 48bits 时间戳范围可用于匹配,而降低分辨率并没有好处。该转换算法为建模 GCL 所需的三值表条目数量将在第 V-C 节中评估。
GCL、TSN、三值表条目、章节 V-C 均处理准确;`\qty 1 \nano` 可理解为 1 ns,但原文 `\qty 78` 缺少单位或指数,可能为抽取残缺。需人工核对具体上界单位和数值。
In this section, we evaluate the P4-TAS implementation. First, we identify and quantify internal delays including the traffic generator accuracy, the queue opening delay, and the TAS control frame delay. Next, we externally measure the duration of tGCL entries and introduce gate switching intervals (GSIs) to mitigate transitional behavior between tGCL entries resulting from the queue opening delay. Then, we assess the scalability of P4-TAS by analyzing the number of supported tGCL and sGCL entries, and the maximum number of streams for identification of DetNet and TSN flows. Finally, we compare P4-TAS to available TAS implementations.
在本节中,我们评估 P4-TAS 实现。首先,我们识别并量化内部延迟,包括流量生成器精度、队列打开延迟以及 TAS 控制帧延迟。接下来,我们从外部测量 tGCL 条目的持续时间,并引入门切换间隔(gate switching intervals,GSIs),以缓解由队列打开延迟导致的 tGCL 条目之间的过渡行为。然后,我们通过分析所支持的 tGCL 和 sGCL 条目数量,以及用于识别 DetNet 和 TSN 流的最大流数量,来评估 P4-TAS 的可扩展性。最后,我们将 P4-TAS 与现有 TAS 实现进行比较。
TAS、tGCL、sGCL、GSI、DetNet、TSN 等术语和缩写保留;评估流程顺序与原文一致;“transitional behavior”译为“过渡行为”准确。未发现明显问题。
Most TSN scheduling approaches assume ideal switch behavior and neglect implementation-specific effects such as internal delays or jitter. Stüber et al. [ 8 ] address this by proposing a scheduling algorithm that accounts for such inaccuracies. They emphasize the need to consider hardware-induced variability in TAS configurations. While their work focuses on scheduling-level robustness, we take a complementary approach by identifying and quantifying undocumented internal delay sources in a hardware implementation. These findings can support the design of more accurate and robust schedules.
大多数 TSN 调度方法假设交换机行为是理想的,并忽略内部延迟或抖动等实现特定效应。Stüber 等人 [8] 通过提出一种将此类不准确性纳入考虑的调度算法来处理这一问题。他们强调,在 TAS 配置中需要考虑硬件引起的可变性。虽然他们的工作侧重于调度层面的鲁棒性,但我们采取一种互补的方法,即在一个硬件实现中识别并量化未公开记录的内部延迟来源。这些发现可以支持设计更准确且更鲁棒的调度。
引用 [8] 保留;“implementation-specific effects”“hardware-induced variability”“scheduling-level robustness”等概念翻译准确;逻辑转折“While”已体现。未发现明显问题。
Franco et al. [ 42 ] profile the latency behavior of the Intel Tofino™ ASIC. They analyze factors such as parsing depth and MAT complexity. However, beyond processing delays, additional delay sources exist within TSN bridges that are not typically disclosed [ 9 ]. We quantify several of them in our P4-TAS implementation on the Intel Tofino™ 2 platform. While the measurement results are specific to the P4-TAS implementation on the Intel Tofino™ 2 ASIC, the sources of those delays are also present in other hardware [ 9, 35 ].
Franco 等人 [42] 对 Intel Tofino™ ASIC 的延迟行为进行了剖析。他们分析了解析深度和 MAT 复杂性等因素。然而,除处理延迟之外,TSN 网桥内部还存在通常不会被披露的其他延迟来源 [9]。我们在 Intel Tofino™ 2 平台上的 P4-TAS 实现中量化了其中若干延迟来源。虽然测量结果特定于 Intel Tofino™ 2 ASIC 上的 P4-TAS 实现,但这些延迟的来源也存在于其他硬件中 [9, 35]。
Intel Tofino™、ASIC、MAT、TSN 网桥等术语保留准确;引用 [42]、[9]、[9, 35] 未改动;“profile”译为“剖析”贴合技术语境。未发现明显问题。
First, we evaluate the accuracy of the internal traffic generator which affects the timing of period-completion frames. We then analyze the queue opening delay of the AFC mechanism. Finally, we measure a delay introduced by the packet generator used for TAS control frames, and give a summary of the measurements.
首先,我们评估内部流量生成器的精度,该精度会影响周期完成帧的时序。随后,我们分析 AFC 机制的队列打开延迟。最后,我们测量由用于 TAS 控制帧的数据包生成器引入的延迟,并给出这些测量结果的总结。
AFC、TAS 保留;“period-completion frames”译为“周期完成帧”与上下文一致;三步顺序准确。未发现明显问题。
P4-TAS uses the internal packet generator to signal the completion of each tGCL cycle with a configured period h h as described in Section IV-B1. A period-completion frame is generated every h h ns, and the timestamp of the j j -th period, denoted as t j h t^{h}_{j}, is stored in a register. Due to limitations of the packet generator, small timing deviations may occur. To quantify this effect, we measure the difference between the timestamps of consecutive period-completion frames, i.e., t j + 1 h t^{h}_{j+1} and t j h t^{h}_{j}, relative to the configured period h h. The deviation δ ^ TG \hat{\delta}_{\text{TG}} is defined in Equation 1: δ ^ TG \displaystyle\hat{\delta}_{\text{TG}} = (t j + 1 h − t j h) − h. \displaystyle=(t^{h}_{j+1}-t^{h}_{j})-h. (1)
如第 IV-B1 节所述,P4-TAS 使用内部数据包生成器,以配置的周期 h 来指示每个 tGCL 周期的完成。每隔 h ns 生成一个周期完成帧,并将第 j 个周期的时间戳(记为 \(t^{h}_{j}\))存储在一个寄存器中。由于数据包生成器的限制,可能会出现较小的时序偏差。为了量化这一影响,我们测量连续周期完成帧的时间戳之间的差值,即 \(t^{h}_{j+1}\) 和 \(t^{h}_{j}\) 之间的差值,并将其相对于配置的周期 h 进行比较。偏差 \(\hat{\delta}_{\text{TG}}\) 在公式 1 中定义为:\(\hat{\delta}_{\text{TG}} = (t^{h}_{j+1}-t^{h}_{j})-h\)。公式编号为 (1)。
第 IV-B1 节、tGCL、周期 h、j、\(t^{h}_{j}\)、\(\hat{\delta}_{\text{TG}}\) 与公式结构均已保留;原文中 `h h`、`j j`、公式重复片段明显为文本抽取重复,译文按数学含义去重。因输入公式和变量存在重复识别痕迹,需人工核对 PDF 中的公式排版。
This value is recorded as a time series in a register in the data plane. Based on use cases identified by Stüber et al. [ 43 ], we select representative periods h h: \qty 500 for factory automation, \qty 2 for industrial isochronous traffic, and \qty 128 for aerospace applications. Additionally, we include \qty 10, \qty 499 and \qty 501 to analyze edge cases and artifacts, and \qty 400 since this period is used for the evaluation in Section V-B. For each period, we record the timestamps of 16,000 period-completion frames. Figure 8 shows the results.
该值作为时间序列记录在数据平面的一个寄存器中。基于 Stüber 等人 [43] 所识别的用例,我们选择了具有代表性的周期 \(h\):用于工厂自动化的 \(\qty{500}{}\)、用于工业等时流量的 \(\qty{2}{}\),以及用于航空航天应用的 \(\qty{128}{}\)。此外,我们还纳入 \(\qty{10}{}\)、\(\qty{499}{}\) 和 \(\qty{501}{}\),以分析边缘情况和伪影;并纳入 \(\qty{400}{}\),因为该周期用于第 V-B 节中的评估。对于每个周期,我们记录 16,000 个周期完成帧的时间戳。图 8 展示了结果。
术语“period-completion frames”译为“周期完成帧”,需结合全文确认是否已有固定译名;\(\qty{}\) 原文未显示单位,已保留符号形式。数字 500、2、128、10、499、501、400 和 16,000 均已保留。逻辑关系完整。
The boxplot in Figure 8 shows the median as a red line, the first and third quartiles as the edges of the box, and whiskers that extend to 1.5 times the interquartile range. Values outside this range are plotted as outliers. A positive δ ^ TG \hat{\delta}_{\text{TG}} indicates that the actual period exceeded the configured value by δ ^ TG \hat{\delta}_{\text{TG}} while a negative value means it was shorter by that amount.
图 8 中的箱线图用红线表示中位数,用箱体边缘表示第一四分位数和第三四分位数,并用须线表示延伸到四分位距 1.5 倍的位置。该范围之外的值被绘制为离群值。正的 \(\hat{\delta}_{\text{TG}}\) 表示实际周期比配置值长 \(\hat{\delta}_{\text{TG}}\),而负值表示实际周期比配置值短相同的量。
“whiskers”译为“须线”符合箱线图术语;\(\hat{\delta}_{\text{TG}}\) 保留正确。正负偏差含义翻译完整。未发现明显问题。
Most periods show deviations below δ ^ TG = \qty 2 \hat{\delta}_{\text{TG}}=\qty{2}{}, with all outliers staying within ± \pm \qty 11. An exception occurs at a period of \qty 400 and \qty 500 which shows a wider spread with less outliers. We attribute this to internal scheduling behavior of the packet generator in the Intel Tofino™ switching ASIC. Shifting the period slightly, e.g., to \qty 499 or \qty 501, results in deviations similar to the other configurations.
大多数周期表现出的偏差低于 \(\hat{\delta}_{\text{TG}}=\qty{2}{}\),所有离群值都保持在 \(\pm \qty{11}{}\) 以内。例外出现在周期为 \(\qty{400}{}\) 和 \(\qty{500}{}\) 时,其分布范围更宽,但离群值更少。我们将其归因于 Intel Tofino™ 交换 ASIC 中数据包生成器的内部调度行为。将周期略微移动,例如移动到 \(\qty{499}{}\) 或 \(\qty{501}{}\),会得到与其他配置类似的偏差。
“wider spread with less outliers”译为“分布范围更宽,但离群值更少”;严格语法应为 fewer outliers,原意明确。Intel Tofino™、ASIC、\(\hat{\delta}_{\text{TG}}\)、数值均保留。未显示单位,已保留 \(\qty{}\)。
Although these deviations are small, they can impact the periodicity computation. If a period-completion frame arrives late, the computed relative position within the current GCL cycle may exceed the period h h, which would index an out-of-period entry. To ensure that all frames are assigned to a valid tGCL entry, P4-TAS clamps any calculated position ≥ h \geq h to the final entry of the cycle. Conversely, if a period-completion frame arrives early, the periodicity mechanism in Section IV-B1 semantically evaluates the position modulo h h, so the result always lies in [ 0, h) [0,h). Therefore, the deviation from the configured period is compensated and all frames are mapped to existing GCL entries.
尽管这些偏差很小,它们仍可能影响周期性计算。如果一个周期完成帧到达较晚,则当前 GCL 周期内计算得到的相对位置可能超过周期 \(h\),这会索引到一个超出周期范围的条目。为确保所有帧都被分配到有效的 tGCL 条目,P4-TAS 会将任何计算得到的、满足 \(\geq h\) 的位置钳制到该周期的最后一个条目。相反,如果一个周期完成帧提前到达,则第 IV-B1 节中的周期性机制会在语义上按模 \(h\) 对位置求值,因此结果始终位于 \([0,h)\) 中。因此,相对于配置周期的偏差得到了补偿,所有帧都会映射到既有的 GCL 条目。
“clamps”译为“钳制”较技术化,也可译为“截断/限制”;此处含义是将越界值限制到最后条目。GCL、tGCL、\(h\)、\([0,h)\) 保留正确。逻辑中“晚到导致位置超过周期、早到通过模运算落入范围”已完整表达。
In the TNA, there is a small but non-zero delay between writing the AFC value, i.e., between initiating a queue state change, and the actual update of the queue state in the hardware [ 44 ]. To quantify internal delays in the AFC mechanism, we measure the time between issuing a queue state change and the actual release of TSN frames. We denote this queue opening 1 1 1 Measurements showed that queue opening and closing delays are distributed in the same way in the TNA. delay as δ ^ queue \hat{\delta}_{\text{queue}}. This delay impacts TSN precision and is rarely documented in available hardware. The measurement procedure is implemented in the data plane of P4-TAS and is shown in Figure 9.
在 TNA 中,从写入 AFC 值,也即发起队列状态改变,到硬件中队列状态实际更新之间,存在一个很小但非零的延迟 [44]。为量化 AFC 机制中的内部延迟,我们测量发出队列状态改变与 TSN 帧实际释放之间的时间。我们将该队列开启延迟记为 \(\hat{\delta}_{\text{queue}}\)。测量显示,在 TNA 中,队列开启延迟和关闭延迟以相同方式分布。该延迟会影响 TSN 精度,并且在可获得的硬件文档中很少被记录。测量过程在 P4-TAS 的数据平面中实现,如图 9 所示。
原文脚注内容夹在句中:“Measurements showed...” 已作为独立句译出。AFC、TNA、TSN、\(\hat{\delta}_{\text{queue}}\) 保留正确。“queue opening delay”译为“队列开启延迟”。脚注编号“1 1 1”疑似抽取噪声,未直译。
In Figure 9, a closed queue is first filled with TSN frames (step 1). When a TAS control frame matches a tGCL entry that opens the queue, it triggers a queue opening via AFC and records the timestamp t change t_{\text{change}} (step 2). The dequeuing timestamp t deq t_{\text{deq}} of the first TSN frame leaving the queue is then used to compute δ ^ queue \hat{\delta}_{\text{\text{queue}}} as shown in Equation 2 (step 3):
在图 9 中,首先用 TSN 帧填充一个关闭的队列(步骤 1)。当一个 TAS 控制帧匹配到打开该队列的 tGCL 条目时,它会通过 AFC 触发队列开启,并记录时间戳 \(t_{\text{change}}\)(步骤 2)。随后,使用离开队列的第一个 TSN 帧的出队时间戳 \(t_{\text{deq}}\),按照公式 2 计算 \(\hat{\delta}_{\text{queue}}\)(步骤 3):
“dequeuing timestamp”译为“出队时间戳”;TAS、tGCL、AFC、TSN 均保留。原文公式符号中出现 \(\text{\text{queue}}\) 双重 text 抽取形式,译文规范化为 \(\text{queue}\)。未发现明显问题。
δ ^ queue \displaystyle\hat{\delta}_{\text{\text{queue}}} = t deq − t change. \displaystyle=t_{\text{deq}}-t_{\text{change}}. (2)
\[ \hat{\delta}_{\text{queue}} = t_{\text{deq}} - t_{\text{change}}. \] (2)
公式含义为队列开启延迟等于第一个 TSN 帧出队时间减去状态改变发起时间;符号已从原文重复抽取形式规范化。未发现明显问题。
This value is stored as a time series in a register of the data plane for all observed transitions (step 4).
对于所有观测到的转换,该值都作为时间序列存储在数据平面的一个寄存器中(步骤 4)。
“observed transitions”译为“观测到的转换”,结合上下文指队列状态转换。未发现明显问题。
The tGCL for this measurement is configured with eight consecutive entries, one per priority. Each entry opens the corresponding priority queue for \qty 100, so that the schedule cycles through all eight priorities in turn. TSN traffic is generated using P4TG [ 45, 46, 47 ] at \qty 400 with randomized priorities and \qty 64 frames. This ensures that the queues are saturated. The experiment is run for \qty 60. Figure 10 shows the complementary cumulative distribution function (CCDF) of the measured queue opening delay δ ^ queue \hat{\delta}_{\text{queue}}.
用于该测量的 tGCL 被配置为八个连续条目,每个优先级对应一个条目。每个条目都会将相应的优先级队列打开 \(\qty{100}{}\),从而使调度依次循环经过全部八个优先级。TSN 流量使用 P4TG [45, 46, 47] 生成,速率为 \(\qty{400}{}\),优先级随机化,帧大小为 \(\qty{64}{}\)。这确保队列处于饱和状态。实验运行 \(\qty{60}{}\)。图 10 展示了测得的队列开启延迟 \(\hat{\delta}_{\text{queue}}\) 的互补累积分布函数(CCDF)。
\(\qty{100}{}\)、\(\qty{400}{}\)、\(\qty{64}{}\)、\(\qty{60}{}\) 原文未显示单位,已保留符号形式;“at \(\qty{400}{}\)”可能指速率或周期,需结合上下文确认单位和含义。“complementary cumulative distribution function”译为“互补累积分布函数”。引用 [45, 46, 47] 已保留。
Most delays are below δ ^ queue = \qty 11 \hat{\delta}_{\text{queue}}=\qty{11}{} with a tail extending up to \qty 63 and a mean of μ (δ ^ queue) = \qty 14.63 \mu(\hat{\delta}_{\text{queue}})=\qty{14.63}{}. These results reveal small but measurable internal delays. In particular, the queue opening delay can cause transitional behavior at tGCL boundaries where frames from the previous entry may still be transmitted briefly after the next entry has started. The impact of this effect and the role of gate switching intervals (GSIs) are evaluated in Section V-B3.
大多数延迟低于 \(\hat{\delta}_{\text{queue}}=\qty{11}{}\),尾部延伸到 \(\qty{63}{}\),均值为 \(\mu(\hat{\delta}_{\text{queue}})=\qty{14.63}{}\)。这些结果揭示了很小但可测量的内部延迟。特别是,队列开启延迟可能在 tGCL 边界处造成过渡行为,即在下一个条目已经开始之后,来自前一个条目的帧仍可能短暂地继续传输。该效应的影响以及门控切换间隔(gate switching intervals, GSIs)的作用在第 V-B3 节中评估。
\(\hat{\delta}_{\text{queue}}\)、\(\mu(\hat{\delta}_{\text{queue}})\)、11、63、14.63 均已保留;单位原文未显示,保留 \(\qty{}\)。GSI 首次在本段展开为“门控切换间隔”,需与全文术语表保持一致。逻辑完整。
For TAS control frames, the internal packet generator is configured to generate a frame every nanosecond. The frames are sequentially generated in batches of eight, with each frame controlling one of the eight priority queues. In practice, however, a frame cannot be generated every nanosecond. Instead, a small delay occurs between frame generation which limits the granularity at which queue state updates can be triggered. To quantify this phenomenon, we collect the timestamp of each TAS control frame in the data plane of P4-TAS and compute the delay δ ^ control \hat{\delta}_{\text{control}} between two consecutive frames i i and i + 1 i+1:
对于 TAS 控制帧,内部数据包生成器被配置为每纳秒生成一个帧。这些帧按每批八个的方式顺序生成,其中每个帧控制八个优先级队列之一。然而,在实践中,无法每纳秒生成一个帧。相反,帧生成之间会出现一个小的延迟,这限制了能够触发队列状态更新的粒度。为了量化这一现象,我们在 P4-TAS 的数据平面中收集每个 TAS 控制帧的时间戳,并计算两个连续帧 i 和 i+1 之间的延迟 \(\hat{\delta}_{\text{control}}\):
术语“TAS 控制帧”“内部数据包生成器”“优先级队列”“数据平面”保持一致;数字“每纳秒”“八个”已保留;逻辑上说明理论配置与实践限制的转折已保留。公式符号 \(\hat{\delta}_{\text{control}}\)、i、i+1 已保留。未发现明显问题。
δ ^ control \displaystyle\hat{\delta}_{\text{control}} = t i + 1 − t i. \displaystyle=t_{i+1}-t_{i}. (3)
\[ \hat{\delta}_{\text{control}} = t_{i+1}-t_i. \tag{3} \]
公式保持原意,即连续两个帧时间戳之差;编号 (3) 已保留。未发现明显问题。
We collect 100,000 values for δ ^ control \hat{\delta}_{\text{control}}, all calculated in the data plane and stored in a time series register. The resulting histogram is shown in Figure 11.
我们收集了 \(\hat{\delta}_{\text{control}}\) 的 100,000 个取值,这些值全部在数据平面中计算,并存储在一个时间序列寄存器中。所得直方图如图 11 所示。
数字 100,000、图 11 均已保留;“data plane”译为“数据平面”,“time series register”译为“时间序列寄存器”。未发现明显问题。
The measured median is δ ^ control,M = \qty 9 \hat{\delta}_{\text{control,M}}=\qty{9}{}, with only a few frames showing a slightly higher delay of up to \qty 12. Thus, transmission gate states can be updated only every \qty 9. Because frames are generated sequentially in batches of eight, updates for different priority queues are offset sequentially by \qty 9 and cannot occur simultaneously. Further, this means that the transmission gate state update of the same priority can be triggered every 8 ⋅ δ ^ control ≈ \qty 72 8\cdot\hat{\delta}_{\text{control}}\approx\qty{72}{}. This value should be seen as a worst-case upper bound. In practice, the effective delay can be close to zero if a control frame arrives just before a scheduled gate change. Such a short delay only matters if the tGCL entry resolution is on the order of \qty 72 which is much smaller than typical tGCL entry durations [ 8 ].
测得的中位数为 \(\hat{\delta}_{\text{control,M}}=\qty{9}{}\),只有少数帧表现出略高的延迟,最高可达 \(\qty{12}{}\)。因此,传输门状态只能每 \(\qty{9}{}\) 更新一次。由于帧按每批八个的方式顺序生成,不同优先级队列的更新会以 \(\qty{9}{}\) 为间隔依次偏移,不能同时发生。此外,这意味着同一优先级的传输门状态更新可以每 \(8\cdot\hat{\delta}_{\text{control}}\approx\qty{72}{}\) 触发一次。该值应被视为最坏情况下的上界。在实践中,如果一个控制帧恰好在计划的门状态变化之前到达,则有效延迟可以接近于零。只有当 tGCL 条目的分辨率处于 \(\qty{72}{}\) 量级时,这样的短延迟才重要,而这远小于典型的 tGCL 条目持续时间 [8]。
\(\qty{9}{}\)、\(\qty{12}{}\)、\(\qty{72}{}\) 的单位在输入中缺失,可能由 PDF 抽取造成,需结合论文原图或 LaTeX 源确认,推测语境可能为纳秒但未在译文中擅自补充;公式 \(8\cdot\hat{\delta}_{\text{control}}\approx\qty{72}{}\) 已保留;“worst-case upper bound”译为“最坏情况下的上界”。存在单位缺失风险。
Table I gives an overview of the identified and measured internal delays in the best and in the worst case.
表 I 概述了在最佳情况和最坏情况下识别并测量得到的内部延迟。
表号 Table I 已译为“表 I”;“identified and measured internal delays”译为“识别并测量得到的内部延迟”。未发现明显问题。
Those internal delays accumulate to Δ internal \Delta_{\text{internal}} shown in Equation 4:
这些内部延迟会累积为公式 4 中所示的 \(\Delta_{\text{internal}}\):
逻辑“内部延迟累积”已保留;符号 \(\Delta_{\text{internal}}\) 与公式编号 4 已保留。未发现明显问题。
Δ internal \displaystyle\Delta_{\text{internal}} = δ TG + δ queue + δ control. \displaystyle=\delta_{\text{TG}}+\delta_{\text{queue}}+\delta_{\text{control}}. (4)
\[ \Delta_{\text{internal}}=\delta_{\text{TG}}+\delta_{\text{queue}}+\delta_{\text{control}}. \tag{4} \]
公式中的三项 \(\delta_{\text{TG}}\)、\(\delta_{\text{queue}}\)、\(\delta_{\text{control}}\) 已完整保留;编号 (4) 已保留。未发现明显问题。
The internal delay Δ internal \Delta_{\text{internal}} may reduce or extend the duration of a tGCL entry. Figure 12 illustrates this effect for three consecutive tGCL entries of configured duration d d.
内部延迟 \(\Delta_{\text{internal}}\) 可能会缩短或延长一个 tGCL 条目的持续时间。图 12 以三个连续的、配置持续时间为 d 的 tGCL 条目说明了这一影响。
“reduce or extend”译为“缩短或延长”;三个连续 tGCL 条目与配置持续时间 d 已保留。未发现明显问题。
If the preceding tGCL entry i − 1 i-1 experiences a negative internal delay, it is shortened while tGCL entry i i is extended. In addition, tGCL entry i i itself may experience a positive delay. In this case, the actual duration of tGCL entry i i becomes
如果前一个 tGCL 条目 \(i-1\) 经历了负的内部延迟,则它会被缩短,而 tGCL 条目 i 会被延长。此外,tGCL 条目 i 本身也可能经历正延迟。在这种情况下,tGCL 条目 i 的实际持续时间变为
条目 \(i-1\) 与 i 的关系已保留;负内部延迟导致前一条目缩短、当前条目延长的逻辑已保留;段落以公式引出,语义完整依赖下一段公式。未发现明显问题。
d ^ i = d + | Δ internal i − 1 | + Δ internal i. \displaystyle\hat{d}_{i}=d+|\Delta^{i-1}_{\text{internal}}|+\Delta^{i}_{\text{internal}}. (5)
\[ \hat{d}_i=d+|\Delta^{i-1}_{\text{internal}}|+\Delta^i_{\text{internal}}. \tag{5} \]
公式中的实际持续时间 \(\hat{d}_i\)、配置持续时间 d、前一条目的内部延迟绝对值 \(|\Delta^{i-1}_{\text{internal}}|\)、当前条目的内部延迟 \(\Delta^i_{\text{internal}}\) 均已保留;编号 (5) 已保留。未发现明显问题。
In the worst case, Δ internal i \Delta^{i}_{\text{internal}} is composed of the maximum traffic generator deviation, queue opening delay, and control traffic delay: Δ internal, max i = \qty 11 + \qty 63 + \qty 12 = \qty 86 \Delta^{i}_{\text{internal},\max}=\qty{11}{}+\qty{63}{}+\qty{12}{}=\qty{86}{}. Further, Δ internal i − 1 \Delta^{i-1}_{\text{internal}} can be negative if the traffic generator deviation is negative and all other delays are close to zero, yielding up to \qty 11 of shortening. This implies that a tGCL entry may be extended by up to \qty 86, or be shortened by \qty 11. Further, through correlation of consecutive tGCL entries, a tGCL entry may be extended by up to \qty 97 as shown in Figure 12.
在最坏情况下,\(\Delta^{i}_{\text{internal}}\) 由最大流量发生器偏差、队列开启延迟以及控制流量延迟组成:\(\Delta^{i}_{\text{internal},\max}=\qty{11}{}+\qty{63}{}+\qty{12}{}=\qty{86}{}\)。此外,如果流量发生器偏差为负且所有其他延迟都接近于零,则 \(\Delta^{i-1}_{\text{internal}}\) 可以为负,从而最多产生 \(\qty{11}{}\) 的缩短。这意味着,一个 tGCL 条目最多可能被延长 \(\qty{86}{}\),或者被缩短 \(\qty{11}{}\)。进一步地,通过连续 tGCL 条目之间的相关性,一个 tGCL 条目最多可能被延长 \(\qty{97}{}\),如图 12 所示。
数字 11、63、12、86、11、97 均已保留;公式与上下标含义已保留。`\qty{}` 的单位在输入中为空,可能因识别或源文件缺失导致单位不明,需结合论文上下文确认是否为 ns。
In the best case, a TAS control traffic frame arrives exactly at the switchover point to a new tGCL entry, resulting in a control traffic delay of δ control, min = \qty 0 \delta_{\text{control},\min}=\qty{0}{}. Combined with the measured best case queue delay δ queue, min = \qty 1 \delta_{\text{queue},\min}=\qty{1}{}, and traffic generator accuracy δ TG,min = 0 \delta_{\text{TG,min}}=0, the best case internal delay is Δ internal, min = \qty 1 \Delta_{\text{internal},\min}=\qty{1}{}.
在最佳情况下,一个 TAS 控制流量帧正好在切换到新的 tGCL 条目的切换点到达,从而使控制流量延迟为 \(\delta_{\text{control},\min}=\qty{0}{}\)。结合测得的最佳情况队列延迟 \(\delta_{\text{queue},\min}=\qty{1}{}\),以及流量发生器精度 \(\delta_{\text{TG,min}}=0\),最佳情况内部延迟为 \(\Delta_{\text{internal},\min}=\qty{1}{}\)。
术语 TAS、tGCL、控制流量延迟、队列延迟、流量发生器精度均保持一致;数值 0、1、0、1 已保留。`\qty{}` 单位为空,需人工确认单位是否在原文排版中遗漏或被抽取丢失。
These values therefore define a theoretical bound for deviations in tGCL entry duration. The following evaluation section examines how often and to what extent such deviations occur in practice.
因此,这些数值定义了 tGCL 条目持续时间偏差的理论界限。下面的评估章节将考察此类偏差在实践中出现的频率以及程度。
逻辑关系“therefore”已译出;“how often and to what extent”已分别译为“频率以及程度”。未发现明显问题。
P4-TAS enables the configuration of tGCLs and their periods with nanosecond granularity. However, the internal delays characterized in Section V-A may introduce deviations between the configured and the actual durations of tGCL entries. This section evaluates the accuracy of configured tGCL entries by comparing the expected duration with the measured duration observed in the data plane. First, we present the testbed and describe the measurement procedure. Then, we analyze the results and introduce gate switching intervals (GSIs) to improve timing accuracy.
P4-TAS 支持以纳秒粒度配置 tGCL 及其周期。然而,第 V-A 节中表征的内部延迟可能会在 tGCL 条目的配置持续时间与实际持续时间之间引入偏差。本节通过比较预期持续时间与在数据平面中观察到的测量持续时间,来评估所配置 tGCL 条目的准确性。首先,我们介绍测试床并描述测量流程。随后,我们分析结果,并引入门切换间隔(gate switching intervals,GSIs)以提高定时准确性。
“nanosecond granularity”译为“纳秒粒度”;“data plane”译为“数据平面”;GSIs 缩写和全称均已保留。未发现明显问题。
The testbed for the external tGCL entry measurement is shown in Figure 13.
用于外部 tGCL 条目测量的测试床如图 13 所示。
“external tGCL entry measurement”译为“外部 tGCL 条目测量”;图号 13 已保留。未发现明显问题。
Traffic is generated with P4TG [ 45, 46, 47 ] at a rate of \qty 514Mpps using minimum-size \qty 64 frames and a constant inter-arrival time, i.e., no bursts. Each frame is assigned a random priority sampled from a uniform distribution and is encapsulated with MPLS to validate the DetNet translation. A tGCL with a period of \qty 400 divided into eight \qty 50 entries is configured in P4-TAS. During each entry, only one of the eight queues is open, corresponding to one priority. Incoming MPLS traffic is translated into a TSN stream, after which the configured tGCL is applied based on the resulting TSN stream identifier. After shaping by the TAS, the traffic is forwarded to a third Tofino™ switch which records frame arrival times per priority in a dedicated P4 program.
流量由 P4TG [45, 46, 47] 生成,速率为 \(\qty{514}{Mpps}\),使用最小尺寸的 \(\qty{64}{}\) 帧,并采用恒定的到达间隔时间,即没有突发。每个帧都会被分配一个从均匀分布中采样得到的随机优先级,并使用 MPLS 进行封装,以验证 DetNet 转换。在 P4-TAS 中配置了一个周期为 \(\qty{400}{}\) 的 tGCL,该周期被划分为八个 \(\qty{50}{}\) 条目。在每个条目期间,八个队列中只有一个队列打开,对应一个优先级。传入的 MPLS 流量被转换为 TSN 流,随后基于所得 TSN 流标识符应用所配置的 tGCL。经过 TAS 整形之后,流量被转发到第三台 Tofino™ 交换机,该交换机在一个专用 P4 程序中按优先级记录帧到达时间。
P4TG 引用 [45, 46, 47]、514 Mpps、64、400、八个 50 条目、MPLS、DetNet、TSN、TAS、Tofino™ 均已保留。`\qty{64}{}`、`\qty{400}{}`、`\qty{50}{}` 单位为空,尤其 64 可能指 64 字节帧,400/50 可能为时间单位,需结合原文确认。
The measurement procedure in the dedicated P4 program on the third switch is based on detecting changes in priority within the received stream. It assumes that frames of only one priority π ∈ { 0, …, 7 } \pi\in\{0,\ldots,7\} arrive at the measurement switch during each tGCL entry as configured in P4-TAS. A series of timestamps of the first and last frame in a tGCL entry, i.e., of the same priority, is collected and stored in the data plane. This is illustrated in Figure 14.
第三台交换机上的专用 P4 程序中的测量流程,基于检测接收流中优先级的变化。它假设在每个 tGCL 条目期间,只有一种优先级 \(\pi\in\{0,\ldots,7\}\) 的帧会按照 P4-TAS 中的配置到达测量交换机。系统会收集一个 tGCL 条目内第一个帧和最后一个帧的时间戳序列,也就是同一优先级帧的时间戳序列,并将其存储在数据平面中。图 14 对此进行了说明。
优先级集合 \(\pi\in\{0,\ldots,7\}\) 已保留;“first and last frame in a tGCL entry, i.e., of the same priority”的限定关系已译出。未发现明显问题。
For priority π = 0 \pi=0, the arrival time of the first frame in the i i -th tGCL entry is stored as t first i, π = 0 t^{i,\pi=0}_{\text{first}}. When the next priority π = 1 \pi=1 appears, the arrival time of the last frame of the previous priority π = 0 \pi=0 is stored as t last i, π = 0 t^{i,\pi=0}_{\text{last}}, and the new frame marks t first i + 1, π = 1 t^{i+1,\pi=1}_{\text{first}}. This is shown in step 1 in Figure 14. The control plane calculates the duration of entry i i for priority π \pi as follows:
对于优先级 \(\pi=0\),第 \(i\) 个 tGCL 条目中第一个帧的到达时间被存储为 \(t^{i,\pi=0}_{\text{first}}\)。当下一个优先级 \(\pi=1\) 出现时,前一个优先级 \(\pi=0\) 的最后一个帧的到达时间被存储为 \(t^{i,\pi=0}_{\text{last}}\),而新的帧则标记 \(t^{i+1,\pi=1}_{\text{first}}\)。这如图 14 中的步骤 1 所示。控制平面按如下方式计算优先级 \(\pi\) 的条目 \(i\) 的持续时间:
\(\pi=0\)、\(\pi=1\)、\(t^{i,\pi=0}_{\text{first}}\)、\(t^{i,\pi=0}_{\text{last}}\)、\(t^{i+1,\pi=1}_{\text{first}}\) 均已保留;输入中 “i i -th” 属于抽取重复,已按“第 \(i\) 个”处理。未发现明显问题。
d ^ i π = t last i, π − t first i, π. \displaystyle\hat{d}_{i}^{\pi}=t^{i,\pi}_{\text{last}}-t^{i,\pi}_{\text{first}}. (6)
\[ \hat{d}_{i}^{\pi}=t^{i,\pi}_{\text{last}}-t^{i,\pi}_{\text{first}}. \tag{6} \]
公式编号 (6)、估计持续时间 \(\hat{d}_{i}^{\pi}\)、最后帧与第一帧时间戳之差均已保留。未发现明显问题。
The measured tGCL entry duration is then compared with the configured tGCL entry duration of d = \qty 50 d=\qty{50}{}, and the deviation δ ^ slice i, π \hat{\delta}^{i,\pi}_{\text{slice}} is obtained as
随后,将测得的 tGCL 条目持续时间与配置的 tGCL 条目持续时间 \(d=\qty{50}{}\) 进行比较,并按如下方式得到偏差 \(\hat{\delta}^{i,\pi}_{\text{slice}}\):
数值 \(d=\qty{50}{}\) 与偏差符号 \(\hat{\delta}^{i,\pi}_{\text{slice}}\) 已保留。该段以 “as” 引出后续公式,但当前输入未包含公式,存在表格或公式上下文缺失风险。
δ ^ slice i, π = d ^ i π − d. \displaystyle\hat{\delta}^{i,\pi}_{\text{slice}}=\hat{d}_{i}^{\pi}-d. (7)
\(\hat{\delta}^{i,\pi}_{\text{slice}}=\hat{d}_{i}^{\pi}-d。\) (7)
公式符号按原文保留;\(\hat{\delta}^{i,\pi}_{\text{slice}}\)、\(\hat{d}_{i}^{\pi}\)、\(d\) 的上下文定义不在本段内,需依赖前文。未发现明显问题。
A negative value for δ ^ slice i, π \hat{\delta}^{i,\pi}_{\text{slice}} thus means that the measured tGCL entry duration was shorter than the configured duration while a positive value means that it was longer. A total of 32,764 values for δ ^ slice i, π \hat{\delta}^{i,\pi}_{\text{slice}} is collected.
因此,\(\hat{\delta}^{i,\pi}_{\text{slice}}\) 的负值表示测得的 tGCL 条目持续时间短于配置的持续时间,而正值表示其长于配置的持续时间。总共收集了 32,764 个 \(\hat{\delta}^{i,\pi}_{\text{slice}}\) 的取值。
正负号含义与公式 \(\hat{d}_{i}^{\pi}-d\) 一致;数字 32,764 已保留;tGCL 缩写保留。未发现明显问题。
The identified internal queue opening/closing delay identified in Section V-A2 causes queue state transitions to occur during a short interval instead of instantaneously. This may cause transitional behavior where queues of a tGCL entry are not yet closed while queues of the next tGCL entry have already begun forwarding. As a result, frames from two tGCL entries are transmitted simultaneously, violating the configured tGCL. This overlap is a phenomenon of P4-TAS, not an artifact of the measurement, and must be addressed. To mitigate this effect, we introduce gate switching intervals (GSIs) which are illustrated in Figure 15.
第 V-A2 节中识别出的内部队列开启/关闭延迟,会导致队列状态转换在一个很短的间隔内发生,而不是瞬时发生。这可能引起过渡行为,即某个 tGCL 条目的队列尚未关闭,而下一个 tGCL 条目的队列已经开始转发。因此,来自两个 tGCL 条目的帧会被同时传输,从而违反已配置的 tGCL。这种重叠是 P4-TAS 的一种现象,并非测量伪影,必须予以处理。为减轻这一影响,我们引入了门控切换间隔(gate switching intervals, GSIs),如图 15 所示。
“measurement artifact”译为“测量伪影”准确;“opening/closing delay”译为“开启/关闭延迟”;因果链条完整保留;GSI 缩写和图 15 保留。未发现明显问题。
gate switching intervals are short, explicit tGCL entries in which all queues are closed. They are inserted between tGCL entries. These GSIs suppress transitional forwarding behavior and isolate each tGCL entry. We configured GSIs of \qty 30 which was sufficient to eliminate overlap without significantly impacting available transmission time. While the worst-case queue opening delay measured in Section V-A reaches \qty 63, a \qty 30 GSI provides sufficient isolation because the GSI itself is subject to the same internal delays. This effectively extends the GSI ’s duration and ensures that queue state transitions complete before the next scheduled entry begins. Larger GSIs did not improve the results.
门控切换间隔是短的、显式的 tGCL 条目,在这些条目中所有队列均关闭。它们被插入到 tGCL 条目之间。这些 GSI 抑制过渡性转发行为,并隔离每个 tGCL 条目。我们配置了 \(\qty{30}{}\) 的 GSI,这足以消除重叠,同时不会显著影响可用传输时间。虽然第 V-A 节中测得的最坏情况队列开启延迟达到 \(\qty{63}{}\),但 \(\qty{30}{}\) 的 GSI 能提供足够的隔离,因为 GSI 本身也受到相同内部延迟的影响。这实际上延长了 GSI 的持续时间,并确保队列状态转换在下一个调度条目开始之前完成。更大的 GSI 并未改善结果。
原文 `\qty 30` 和 `\qty 63` 缺少单位,按公式宏形式保留;由于单位缺失,可能需要结合图表或上下文确认是否为 ns。逻辑上“GSI 本身受相同延迟影响从而有效延长持续时间”已保留。
First, we measured the deviation of the observed values from the configured duration of tGCL entries without introducing GSIs. The resulting statistics were inconsistent, with a mean deviation across all measurements of μ (δ ^ slice) = \qty − 22.8 \mu(\hat{\delta}_{\text{slice}})=\qty{-22.8}{} and a median of \qty 450. These apparent deviations are not meaningful because consecutive entries frequently overlapped at their boundaries as explained in Section V-B3.
首先,我们在未引入 GSI 的情况下,测量了观测值相对于 tGCL 条目配置持续时间的偏差。所得统计结果并不一致,所有测量的平均偏差为 \(\mu(\hat{\delta}_{\text{slice}})=\qty{-22.8}{}\),中位数为 \(\qty{450}{}\)。这些表面上的偏差并不具有意义,因为如第 V-B3 节所解释的,连续条目经常在其边界处发生重叠。
平均值 -22.8 与中位数 450 已保留;`\qty{-22.8}{}` 和 `\qty 450` 均缺少单位,需结合上下文确认单位;“apparent deviations”译为“表面上的偏差”符合语义。因存在单位缺失和统计量看似异常,建议人工复核。
The deviation of the measured values from the configured duration of tGCL entries using GSIs is presented in Figure 16. The metric is computed per priority, i.e., for each π \pi and entry i i as δ ^ slice i, π \hat{\delta}^{i,\pi}_{\text{slice}}, and the histogram is shown aggregated across all priorities because the behavior is identical for all.
使用 GSI 时,测得值相对于 tGCL 条目配置持续时间的偏差见图 16。该指标按优先级计算,即对于每个 \(\pi\) 和条目 \(i\),计算为 \(\hat{\delta}^{i,\pi}_{\text{slice}}\);由于所有优先级的行为相同,直方图以聚合所有优先级的方式显示。
“per priority”译为“按优先级”;\(\pi\)、\(i\)、\(\hat{\delta}^{i,\pi}_{\text{slice}}\) 已保留;图 16 保留。未发现明显问题。
The measured distribution in Figure 16 shows two dominant modes: one around \qty -60ns and one around \qty 30ns, separated by a valley around the median of \qty -19. The bimodal distribution results from how delays of consecutive entries interact. A large delay at a boundary makes the current tGCL entry longer than configured, creating the positive cluster. The following tGCL entry then starts late and becomes shorter, creating the negative cluster. A deviation of exactly zero is unlikely since it would require two consecutive delays to be almost identical which is rare in practice. The overall median is slightly negative, reflecting that shortened entries occur somewhat more frequently. The minimum of \qty -239ns represents a rare worst case where one tGCL entry is extended by nearly the maximum possible delay and the neighboring tGCL entry experiences no delay and is consequently shortened by the same amount.
图 16 中的测量分布显示出两个主导模态:一个位于约 \(\qty{-60}{ns}\) 附近,另一个位于约 \(\qty{30}{ns}\) 附近,两者之间由约 \(\qty{-19}{}\) 的中位数附近的谷值分隔。双峰分布源于连续条目的延迟如何相互作用。边界处的较大延迟会使当前 tGCL 条目长于配置值,从而形成正值簇。随后的 tGCL 条目随后会延迟开始并变短,从而形成负值簇。恰好为零的偏差不太可能出现,因为这将要求两个连续延迟几乎相同,而这在实践中很少见。总体中位数略为负值,反映出缩短的条目出现得稍微更频繁。最小值 \(\qty{-239}{ns}\) 表示一种罕见的最坏情况:某个 tGCL 条目被延长了接近最大可能延迟的幅度,而相邻的 tGCL 条目没有经历延迟,并因此被缩短了相同的幅度。
-60 ns、30 ns、-239 ns 已保留;中位数 `\qty -19` 原文缺少单位,疑似 ns,译文保留为空单位形式;双峰分布的因果解释已完整保留。因中位数单位缺失,需人工复核。
Scalability is a critical aspect for TSN and DetNet deployments which often involve large numbers of scheduled traffic streams. However, many scheduling algorithms overlook hardware resource constraints such as limited MAT capacity [ 8 ]. In this section, we evaluate the scalability of our P4-TAS implementation by analyzing the number of supported tGCL and sGCL entries, and the number of streams for DetNet and TSN stream identification.
可扩展性是 TSN 和 DetNet 部署的一个关键方面,这类部署通常涉及大量调度流量流。然而,许多调度算法忽视了硬件资源约束,例如有限的 MAT 容量 [8]。在本节中,我们通过分析所支持的 tGCL 和 sGCL 条目数量,以及用于 DetNet 和 TSN 流标识的流数量,来评估我们 P4-TAS 实现的可扩展性。
TSN、DetNet、MAT、tGCL、sGCL 均保留;引用 [8] 保留;“scheduled traffic streams”译为“调度流量流”。未发现明显问题。
Many TSN scheduling algorithms assume an unlimited number of GCL entries [ 8 ]. However, real hardware imposes strict limits due to finite memory resources which may make a schedule undeployable if exceeded. Therefore, we evaluate the number of tGCL and sGCL entries that can be stored in the proposed P4-TAS implementation. First, we analyze how many MAT entries are required per GCL entry. Then, we describe the available GCL sizes in P4-TAS.
许多 TSN 调度算法假设 GCL 条目数量不受限制 [8]。然而,真实硬件由于有限的存储资源而施加严格限制;如果超过这些限制,可能会使一个调度无法部署。因此,我们评估在所提出的 P4-TAS 实现中可以存储的 tGCL 和 sGCL 条目数量。首先,我们分析每个 GCL 条目需要多少个 MAT 条目。然后,我们描述 P4-TAS 中可用的 GCL 规模。
引用 [8] 保留;“undeployable”译为“无法部署”;“GCL sizes”译为“GCL 规模”,可能也可译为“GCL 大小/容量”,但语义无明显风险。未发现明显问题。
Internal delays
内部延迟
本段仅为标题或小节名;译为“内部延迟”准确。未发现明显问题。
No. streams
流的数量
该段像是表格列名或图表标签,缺少上下文;“No.”译为“数量”符合常见论文表头用法。
Δ internal, max = \Delta_{\text{internal, max}}= 86 ns
\(\Delta_{\text{internal, max}} = 86\ \text{ns}\)
原文存在“Δ internal, max =”与公式重复表达,疑似从排版中抽取出的公式标签;数值 86 ns 已保留。
Predict6G Open Source TSN Platform [ 25 ]
Predict6G 开源 TSN 平台 [25]
专有项目名 Predict6G 和缩写 TSN 保留;引用编号 [25] 已保留。该段可能是表格或图注条目,缺少上下文。
10k entries
1 万个条目
“10k entries”译为“1 万个条目”;该段缺少表格上下文,无法确认条目类型。
In P4-TAS, the tGCL is modeled as a MAT in the egress P4 control block which matches on the relative timestamp and on one of the eight queues. Therefore, for each tGCL entry, eight range MAT entries are required, i.e., one for each gate. The sGCL is modeled as a MAT in the ingress P4 control block which matches on the relative timestamp and on the stream gate identifier. Thus, for an sGCL entry, only a single range MAT entry is required.
在 P4-TAS 中,tGCL 被建模为出口 P4 控制块中的一个 MAT,该 MAT 基于相对时间戳以及八个队列之一进行匹配。因此,对于每个 tGCL 条目,需要八个范围 MAT 条目,即每个门对应一个条目。sGCL 被建模为入口 P4 控制块中的一个 MAT,该 MAT 基于相对时间戳以及流门标识符进行匹配。因此,对于一个 sGCL 条目,只需要一个范围 MAT 条目。
tGCL、sGCL、MAT、egress/ingress P4 control block、relative timestamp、stream gate identifier 等术语已按技术含义翻译并保留缩写;“eight queues”“eight range MAT entries”“one for each gate”“single range MAT entry”等数量和对应关系未发现明显问题。
Because range matching in the TNA is limited, a range-to-ternary conversion is employed to enable matching on the relative timestamp. The conversion algorithm described in Section IV-C2 replaces a single range MAT entry with multiple ternary MAT entries to increase the resolution of matched timestamps. However, this approach also increases the number of required MAT entries.
由于 TNA 中的范围匹配受到限制,因此采用了范围到三元匹配的转换,以便能够基于相对时间戳进行匹配。第 IV-C2 节中描述的转换算法用多个三元 MAT 条目替代单个范围 MAT 条目,以提高被匹配时间戳的分辨率。然而,这种方法也会增加所需 MAT 条目的数量。
TNA、range matching、range-to-ternary conversion、ternary MAT entries 等术语已保留或准确翻译;“increase the resolution of matched timestamps”译为“提高被匹配时间戳的分辨率”合理;因果和转折关系未发现明显问题。
Let w w denote the number of bits used to represent the range. Gupta et al. [ 40 ] showed that a range of width w w bits can be transformed into at most 2 ⋅ w − 2 2\cdot w-2 ternary entries. Consequently, in the worst case, modeling a tGCL containing n n tGCL entries results in up to 8 ⋅ n ⋅ (2 ⋅ w − 2) 8\cdot n\cdot(2\cdot w-2), and a sGCL entry in n ⋅ (2 ⋅ w − 2) n\cdot(2\cdot w-2) ternary MAT entries. In practice, the actual number is often significantly lower due to favorable alignment. For example, if the range’s width is a power of two and properly aligned, a single ternary entry suffices. The tGCL configuration from Section V-B which used a period of \qty 400 divided into eight \qty 50 tGCL entries with additional \qty 30 GSIs required 1512 ternary MAT entries. Numerous studies have proposed optimized range-to-ternary conversion algorithms aimed at reducing ternary entry counts [ 49, 50, 51, 52 ], and these may be explored in future work.
设 \(w\) 表示用于表示该范围的位数。Gupta 等人 [40] 表明,宽度为 \(w\) 位的范围最多可以转换为 \(2 \cdot w - 2\) 个三元条目。因此,在最坏情况下,对包含 \(n\) 个 tGCL 条目的 tGCL 进行建模会产生最多 \(8 \cdot n \cdot (2 \cdot w - 2)\) 个三元 MAT 条目,而对 sGCL 条目进行建模会产生 \(n \cdot (2 \cdot w - 2)\) 个三元 MAT 条目。在实践中,由于有利的对齐,实际数量通常显著更低。例如,如果范围的宽度是 2 的幂并且正确对齐,则一个三元条目就足够。第 V-B 节中的 tGCL 配置使用了一个周期为 \(\qty 400\) 的周期,该周期被划分为八个 \(\qty 50\) 的 tGCL 条目,并带有额外的 \(\qty 30\) GSI;该配置需要 1512 个三元 MAT 条目。许多研究已经提出了优化的范围到三元匹配转换算法,目标是减少三元条目数量 [49, 50, 51, 52],这些算法可在未来工作中探索。
原文中出现“w w”“2 ⋅ w − 2 2\\cdot w-2”“n n”等重复,疑似抽取或 OCR 重复,译文按数学含义去重;公式 \(2 \cdot w - 2\)、\(8 \cdot n \cdot (2 \cdot w - 2)\)、\(n \cdot (2 \cdot w - 2)\) 已保留。`\qty 400`、`\qty 50`、`\qty 30` 缺少单位,可能是 LaTeX 单位抽取残缺;“a sGCL entry in \(n \cdot ...\)”原文语法也可能应为 sGCL containing \(n\) entries,需结合原 PDF 确认。
The Intel Tofino™ 2 can generate up to 16 different periodic streams. Since one of those is required for the continuous TAS control traffic, P4-TAS can configure 15 streams for period-completion frames. Therefore, 15 different GCL periods can be configured which can be shared between PSFP and TAS.
Intel Tofino™ 2 最多可以生成 16 个不同的周期性流。由于其中一个需要用于连续的 TAS 控制流量,因此 P4-TAS 可以为周期完成帧配置 15 个流。因此,可以配置 15 个不同的 GCL 周期,并且这些周期可以在 PSFP 和 TAS 之间共享。
Intel Tofino™ 2、TAS、P4-TAS、GCL、PSFP 等术语已保留;16、1、15、15 的数量关系一致;“period-completion frames”译为“周期完成帧”需与全文术语表保持一致。
The tGCL MAT in P4-TAS can hold 39,000 MAT entries which is a result of the available hardware resources. The MAT is therefore large enough to accommodate multiple tGCLs. We increased the size of the stream gate MAT of PSFP from 2048 in P4-PSFP [ 6 ] to 6000 MAT entries. This is possible because the implementation is ported to the Tofino™ 2 ASIC which has more resources available.
P4-TAS 中的 tGCL MAT 可以容纳 39,000 个 MAT 条目,这是可用硬件资源所决定的。因此,该 MAT 足够大,可以容纳多个 tGCL。我们将 PSFP 的流门 MAT 大小从 P4-PSFP [6] 中的 2048 个 MAT 条目增加到 6000 个 MAT 条目。这是可能的,因为该实现被移植到了 Tofino™ 2 ASIC,而该 ASIC 具有更多可用资源。
39,000、2048、6000 等数字已保留;tGCL MAT、stream gate MAT、P4-PSFP、Tofino™ 2 ASIC 等术语未发现明显问题;因果关系“more resources available”已准确表达。
The ternary match operates on a \qty 48bit timestamp enabling resolutions of up to \qty 78h. While such a range exceeds practical requirements, reducing the number of matched bits has no impact due to internal hardware alignment.
三元匹配作用于一个 \(\qty 48bit\) 时间戳,从而支持最高可达 \(\qty 78h\) 的分辨范围。尽管这样的范围超过了实际需求,但由于内部硬件对齐,减少被匹配的位数并不会产生影响。
原文中的 `\qty 48bit` 和 `\qty 78h` 可能是 LaTeX 单位宏抽取形式,译文保留为公式宏;“resolutions of up to \(\qty 78h\)”更准确可能指可覆盖约 78 小时的时间范围,而非“分辨率”,需结合上下文确认。硬件对齐逻辑已保留。
These resource limits show that P4-TAS can support multiple tGCLs and sGCLs, ensuring deployability of realistic TSN schedules.
这些资源限制表明,P4-TAS 能够支持多个 tGCL 和 sGCL,从而确保现实 TSN 调度的可部署性。
术语 tGCL、sGCL、TSN 已保留;“resource limits”译为“资源限制”符合上下文。未发现明显问题。
To evaluate the scalability of stream identification in our implementation, we analyze the structure and capacity of the MAT used for DetNet and TSN streams.
为了评估我们实现中流识别的可扩展性,我们分析了用于 DetNet 和 TSN 流的 MAT 的结构与容量。
MAT 缩写保留;DetNet、TSN 术语保留;逻辑为“评估可扩展性,因此分析结构与容量”。未发现明显问题。
A single MAT handles both DetNet and TSN stream identification. It uses ternary keys consisting of the S-Label for DetNet streams and Ethernet destination address, VLAN ID, and IPv4 source and destination address for TSN streams [ 4 ]. The use of ternary matches enables wildcarding and aggregation. For example, an entry matching only on the S-Label enables DetNet-to-TSN translation while another matching on MAC destination and VLAN ID supports TSN-to-DetNet translation or TSN stream identification.
单个 MAT 同时处理 DetNet 和 TSN 流识别。它使用三元键,这些三元键由 DetNet 流的 S-Label,以及 TSN 流的以太网目的地址、VLAN ID、IPv4 源地址和 IPv4 目的地址组成 [4]。使用三元匹配能够实现通配和聚合。例如,一个仅匹配 S-Label 的条目能够实现 DetNet 到 TSN 的转换,而另一个匹配 MAC 目的地址和 VLAN ID 的条目则支持 TSN 到 DetNet 的转换或 TSN 流识别。
“ternary keys/matches”译为“三元键/三元匹配”;S-Label、VLAN ID、IPv4、MAC 均保留;引用 [4] 保留;两个示例的方向 DetNet-to-TSN 与 TSN-to-DetNet 未颠倒。未发现明显问题。
The MAT supports 8196 entries which allows at least 8196 DetNet or TSN streams to be identified. In cases where IP-based identification is used, ternary aggregation can further increase the number of identifiable streams. A survey by Stüber et al. [ 10 ] reports deployments with up to 10,812 streams, indicating that our implementation can support realistic industrial-scale scenarios with appropriate use of wildcarding. These results show that the design is scalable and capable of supporting a number of streams typical in TSN/DetNet deployments.
该 MAT 支持 8196 个条目,这使得至少 8196 个 DetNet 或 TSN 流能够被识别。在使用基于 IP 的识别的情况下,三元聚合可以进一步增加可识别流的数量。Stüber 等人 [10] 的一项调查报告了多达 10,812 个流的部署,这表明,在适当使用通配的情况下,我们的实现能够支持现实的工业规模场景。这些结果表明,该设计具有可扩展性,并且能够支持 TSN/DetNet 部署中典型数量的流。
数字 8196、10,812 保留;“at least”译为“至少”;“with appropriate use of wildcarding”译为“适当使用通配”准确;引用 [10] 保留。未发现明显问题。
In this section, we summarize and compare capabilities of several TAS-capable platforms, including our P4-TAS prototype on a P4-programmable ASIC. The overview is shown in Table II.
在本节中,我们总结并比较了若干具备 TAS 能力的平台的功能,包括我们在 P4 可编程 ASIC 上实现的 P4-TAS 原型。概览如表 II 所示。
“TAS-capable platforms”译为“具备 TAS 能力的平台”;P4-programmable ASIC 译为“P4 可编程 ASIC”;表 II 保留。未发现明显问题。
Similar to P4-TAS, the Predict6G open-source platform provides TAS and DetNet integration, but its documentation does not specify configurable time resolution, internal delay behavior, or scalability [ 25 ]. Commercial platforms such as NXP’s SJA1105TEL [ 33, 34 ], Microchip’s SparX-5i family [ 35 ] and PD-IES008 [ 36, 37 ], and Relyum’s RELY-TSN12 [ 48 ] provide hardware support for TAS and PSFP. However, publicly available specifications typically stop at time granularity, queue counts, or GCL sizes while omitting internal delay sources that ultimately determine schedule precision. Although these devices advertise nanosecond-level configuration granularity, our evaluation with P4-TAS shows that practical gate updates are constrained by internal delays in the range of tens of nanoseconds. This phenomenon is further supported by Eppler et al. [ 9 ] who report internal TAS delays of approximately \qty 2.6 \micro in Relyum switches. This demonstrates that internal TAS timing effects are inherent to the mechanism itself and not specific to P4-based implementations, and can be significantly larger in proprietary devices. Since such delays are rarely documented, the effective precision of commercial solutions is difficult to assess from datasheets alone. Further, those delays are internal and cannot be measured in commercial black-box switches.
与 P4-TAS 类似,Predict6G 开源平台提供 TAS 和 DetNet 集成,但其文档未说明可配置的时间分辨率、内部延迟行为或可扩展性 [25]。NXP 的 SJA1105TEL [33, 34]、Microchip 的 SparX-5i 系列 [35] 和 PD-IES008 [36, 37],以及 Relyum 的 RELY-TSN12 [48] 等商业平台为 TAS 和 PSFP 提供硬件支持。然而,公开可获得的规格通常止步于时间粒度、队列数量或 GCL 大小,而省略了最终决定调度精度的内部延迟来源。尽管这些设备宣称具有纳秒级配置粒度,但我们对 P4-TAS 的评估表明,实际的门更新受到几十纳秒范围内内部延迟的约束。Eppler 等人 [9] 报告 Relyum 交换机中的内部 TAS 延迟约为 \qty 2.6 \micro,这进一步支持了这一现象。这表明,内部 TAS 定时效应是该机制本身固有的,并非 P4 实现所特有,而且在专有设备中可能显著更大。由于此类延迟很少被记录在文档中,仅凭数据手册难以评估商业解决方案的有效精度。此外,这些延迟是内部延迟,无法在商业黑盒交换机中测量。
厂商、型号与引用均保留;TAS、DetNet、PSFP、GCL 等术语保留;“nanosecond-level configuration granularity”译为“纳秒级配置粒度”;“tens of nanoseconds”译为“几十纳秒”。原文中的 `\qty 2.6 \micro` 可能是 LaTeX 公式抽取残缺,单位疑似微秒,需要人工确认。
Our P4-TAS prototype achieves a comparable time configuration granularity of \qty 1 while also documenting measured internal delays. Specifically, we observed a worst-case internal delay for a tGCL entry of Δ internal,max = \qty 86 \Delta_{\text{internal,max}}=\qty{86}{} in the evaluation. In contrast, vendor platforms either do not document such values (e.g., NXP SJA1105TEL, Microchip PD-IES008) or only disclose partial information (e.g., SparX-5i which specifies a queue opening delay of δ queue = \qty 512 \delta_{\text{queue}}=\qty{512}{}). The transparency in P4-TAS allows a more realistic assessment of achievable schedule precision. In terms of scalability, P4-TAS supports a larger number of flows (≥ \geq 8196) and larger GCLs (39k for TAS and 6k for PSFP) compared to the commercial platforms. For GCL entries, the range-to-ternary conversion overhead must be considered which we evaluated in Section V-C1.
我们的 P4-TAS 原型实现了可比的时间配置粒度 \qty 1,同时还记录了实测的内部延迟。具体而言,在评估中,我们观察到 tGCL 条目的最坏情况内部延迟为 Δ internal,max = \qty 86 \Delta_{\text{internal,max}}=\qty{86}{}。相比之下,厂商平台要么不记录此类数值,例如 NXP SJA1105TEL、Microchip PD-IES008;要么只披露部分信息,例如 SparX-5i,其指定的队列开启延迟为 δ queue = \qty 512 \delta_{\text{queue}}=\qty{512}{}。P4-TAS 的透明性允许更现实地评估可实现的调度精度。在可扩展性方面,与商业平台相比,P4-TAS 支持更多数量的流(≥ 8196)和更大的 GCL(TAS 为 39k,PSFP 为 6k)。对于 GCL 条目,必须考虑范围到三元转换的开销,我们已在第 V-C1 节中对此进行了评估。
数字 1、86、512、≥8196、39k、6k 和章节 V-C1 保留;“range-to-ternary conversion overhead”译为“范围到三元转换的开销”。原文存在明显公式/单位抽取残缺,如 `\qty 1`、`\qty 86 ... \qty{86}{}`、`\qty 512 ... \qty{512}{}` 缺少单位或重复公式,需结合论文 PDF 人工确认。
A further distinction is line-rate throughput. While most commercial TSN-capable switch ASICs target 1–25 Gb/s per port in automotive and industrial domains, P4-TAS operates at up to 400 Gb/s per port. This enables its use not only in TSN deployments but also in high-speed data center environments where integration with DetNet becomes relevant. Hence, P4-TAS extends the design space beyond today’s embedded and industrial use cases.
另一个区别是线速吞吐量。虽然大多数具备 TSN 能力的商业交换机 ASIC 在汽车和工业领域以每端口 1–25 Gb/s 为目标,但 P4-TAS 每端口最高可运行在 400 Gb/s。这使其不仅能够用于 TSN 部署,也能够用于高速数据中心环境,在这些环境中,与 DetNet 的集成变得相关。因此,P4-TAS 将设计空间扩展到了当今嵌入式和工业用例之外。
速率 1–25 Gb/s 和 400 Gb/s 保留;“line-rate throughput”译为“线速吞吐量”;“beyond today’s embedded and industrial use cases”译为“扩展到了当今嵌入式和工业用例之外”。未发现明显问题。
Overall, the comparison shows that commodity hardware already supports TAS and PSFP functionality, but vendors disclose little about their internal timing behavior. This lack of transparency makes it difficult to design schedules with nanosecond accuracy. P4-TAS fills this gap by explicitly characterizing internal delays, enabling more predictable and transparent use of TSN.
总体而言,该比较表明,商用硬件已经支持 TAS 和 PSFP 功能,但厂商很少披露其内部定时行为。这种透明性不足使得以纳秒精度设计调度变得困难。P4-TAS 通过明确表征内部延迟填补了这一空白,从而使 TSN 的使用更加可预测且更加透明。
TAS、PSFP、TSN 保留;因果逻辑为“缺少披露导致难以纳秒级调度,P4-TAS 通过表征内部延迟弥补”。未发现明显问题。
We presented P4-TAS, a P4-based implementation of the Time-Aware Shaper (TAS) for TSN on the Intel Tofino™ 2. To achieve periodicity of tGCLs, we leveraged a mechanism for periodic behavior in P4 switches using the internal packet generator as a clock source. Building on this foundation, we introduced a mechanism for precise queue state control for the TAS using an internally generated, continuous stream of TAS control traffic. P4-TAS also incorporates PSFP, where we improved our earlier P4-PSFP design by eliminating recirculation and increasing the GCL time resolution to the nanosecond scale using a range-to-ternary algorithm. Additionally, P4-TAS includes an MPLS/ TSN translation layer enabling TSN traffic shaping and policing to be applied to DetNet flows at line rate up to \qty 400. Beyond functional capabilities, our implementation provides transparent insights into internal timing behavior which is rarely documented in commercial platforms.
我们提出了 P4-TAS,即一种在 Intel Tofino™ 2 上面向 TSN 的、基于 P4 的时间感知整形器(Time-Aware Shaper, TAS)实现。为了实现 tGCL 的周期性,我们利用了一种在 P4 交换机中实现周期性行为的机制,该机制使用内部包生成器作为时钟源。在此基础上,我们引入了一种用于 TAS 的精确队列状态控制机制,该机制使用内部生成的、连续的 TAS 控制流量。P4-TAS 还纳入了 PSFP,其中我们通过消除再循环,并使用范围到三元算法将 GCL 时间分辨率提高到纳秒尺度,改进了我们早期的 P4-PSFP 设计。此外,P4-TAS 包含一个 MPLS/TSN 转换层,使得 TSN 流量整形和监管能够以最高 \qty 400 的线速应用于 DetNet 流。除了功能能力之外,我们的实现还对内部定时行为提供了透明见解,而这在商业平台中很少被记录在文档中。
Intel Tofino™ 2、TAS、tGCL、PSFP、P4-PSFP、GCL、MPLS/TSN、DetNet 等术语保留;“recirculation”译为“再循环”;“traffic shaping and policing”译为“流量整形和监管”。原文 `up to \qty 400` 存在单位缺失,结合前文疑似 400 Gb/s,但本段输入缺单位,需人工确认。
Our evaluation covered three aspects. First, we identified and quantified undocumented internal delay sources, including traffic generator inaccuracy, queue opening delay, and TAS control frame delays. We identified a theoretical worst-case accumulated delay of about \qty 86 for a tGCL entry, which is orders of magnitude smaller than the microsecond-scale gate transition delays reported for some commercial TSN switches [ 9 ]. Second, we externally measured the duration of tGCL entries and compared it to the configured duration. In this process, we identified that the internal queue opening delay leads to transitional behavior where queues of a tGCL entry are not yet closed while queues of the next tGCL entry have already begun forwarding. Therefore, we introduced gate switching intervals (GSIs), short explicit tGCL entries in which all queues are closed, to mitigate this effect. Third, we analyzed scalability, demonstrating support for 39,000 tGCL entries and more than 8,196 flows, covering the requirements of current industrial deployments.
我们的评估涵盖了三个方面。首先,我们识别并量化了未记录在文档中的内部延迟来源,包括流量生成器不准确性、队列开启延迟以及 TAS 控制帧延迟。我们识别出,对于一个 tGCL 条目,理论最坏情况下的累积延迟约为 \qty 86,这比一些商用 TSN 交换机所报告的微秒级门转换延迟小数个数量级 [9]。其次,我们从外部测量了 tGCL 条目的持续时间,并将其与配置的持续时间进行了比较。在这一过程中,我们发现内部队列开启延迟会导致一种过渡行为,即某个 tGCL 条目的队列尚未关闭,而下一个 tGCL 条目的队列已经开始转发。因此,我们引入了门切换间隔(gate switching intervals,GSIs),即短的、显式的 tGCL 条目,在这些条目中所有队列均被关闭,以缓解这一影响。第三,我们分析了可扩展性,证明其支持 39,000 个 tGCL 条目以及超过 8,196 条流,覆盖了当前工业部署的需求。
术语 tGCL、TAS、TSN、GSI 均已保留并补充中文解释;数字 39,000、8,196、引用 [9] 已保留。`\qty 86` 原文疑似缺少单位或 LaTeX 提取残缺,无法确认是 86 ns、86 μs 或其他单位,需结合上下文核对。
Compared with existing ASIC- and FPGA-based TSN platforms, P4-TAS offers similar configurability in terms of time granularity, but additionally exposes internal delays that directly affect scheduling precision. This transparency allows schedules to be designed with awareness of hardware-induced deviations, something not possible with today’s black-box hardware. Moreover, P4-TAS supports line rates up to \qty 400Gb/s per port and seamless DetNet / TSN translation, extending applicability from industrial and automotive networks to high-throughput environments such as data centers and carrier backbones
与现有基于 ASIC 和 FPGA 的 TSN 平台相比,P4-TAS 在时间粒度方面提供了类似的可配置性,但还额外暴露了会直接影响调度精度的内部延迟。这种透明性使得调度表能够在知晓硬件所引入偏差的情况下进行设计,而这是当今黑盒硬件无法实现的。此外,P4-TAS 支持每端口最高 \qty 400Gb/s 的线速,并支持无缝的 DetNet / TSN 转换,从而将适用范围从工业网络和车载网络扩展到数据中心和运营商骨干网等高吞吐量环境。
ASIC、FPGA、TSN、DetNet 等缩写已保留;逻辑上“类似可配置性”与“额外暴露内部延迟”的转折关系已保留。`\qty 400Gb/s` 原文 LaTeX 数量命令格式可能缺少花括号,但含义可明确理解为 400 Gb/s;段末原文缺少句号,不影响译文含义。
Future work will focus on improving scalability, for example by optimizing range-to-ternary usage, and on investigating how delay characterization can be incorporated into scheduling algorithms to increase robustness against hardware-level variability. Further, we will explore the integration of P4-TAS into a PTP -synchronized multi-hop TSN testbed to validate gate scheduling and latency guarantees under realistic TSN -specific traffic patterns.
未来工作将重点关注提升可扩展性,例如通过优化 range-to-ternary 的使用方式;同时还将研究如何把延迟表征纳入调度算法,以增强其对硬件层面可变性的鲁棒性。此外,我们将探索把 P4-TAS 集成到一个由 PTP 同步的多跳 TSN 测试床中,以在现实的 TSN 特定流量模式下验证门调度和时延保证。
PTP、TSN、P4-TAS 等缩写已保留;“range-to-ternary”可能是特定 P4/TCAM 规则映射术语,直译保留较稳妥。逻辑上“提升可扩展性”和“纳入延迟表征以增强鲁棒性”两项未来工作已区分;未发现明显问题。
中文逐段译稿
工业自动化和汽车系统中的时间关键型应用依赖于能够提供确定性保证的网络,例如低时延、最小抖动以及几乎为零的丢包。为满足这些严格要求,两种互补技术已经出现:时间敏感网络(Time-Sensitive Networking,TSN)和确定性网络(Deterministic Networking,DetNet)。TSN 是一套 IEEE 802.1 标准,通过引入用于流量整形 [ 1, 2, 3 ] 和可靠性 [ 4 ] 的机制来增强以太网,使其支持实时通信。相比之下,DetNet 是由 IETF 标准化的第 3 层技术,它通过在多个 IP 跳之间实现有界时延和高可靠性,将这些能力扩展到路由网络 [ 5 ]。
术语 TSN、DetNet、IEEE 802.1、IETF、Layer 3 均已保留并翻译;数字、引用编号 [ 1, 2, 3 ]、[ 4 ]、[ 5 ] 未遗漏;“virtually zero packet loss”译为“几乎为零的丢包”准确;逻辑对比关系“相比之下”已体现。未发现明显问题。
调度流量是 TSN 中的一个概念,其中 talker 的传输时间被协调,以避免在中间节点中排队,从而使帧以最小延迟穿越网络。这种协调称为调度,并产生一个网络范围的调度表,以确保确定性转发。TSN 中存在多种流量整形机制,例如基于信用的整形器(Credit-Based Shaper,CBS)、异步流量整形器(Asynchronous Traffic Shaper,ATS)以及时间感知整形器(Time-Aware Shaper,TAS)。其中,TAS 通过利用一种类似时分多址(Time Division Multiple Access,TDMA)的方法来保护调度流量免受干扰,例如来自尽力而为流的干扰,从而确保低时延和有界延迟。这是通过周期性打开和关闭队列的传输门来实现的。此外,按流过滤与监管(Per-Stream Filtering and Policing,PSFP)是 TSN 的一种机制,它结合基于速率和基于时间的监管,以丢弃不符合调度的帧。
CBS、ATS、TAS、TDMA、PSFP 缩写及全称均已保留;“talkers”按 TSN 术语保留为 talker;“best-effort flows”译为“尽力而为流”;“transmission gates”译为“传输门”;“out-of-schedule frames”译为“不符合调度的帧”。未发现明显问题。
DetNet 将 TSN 概念与 MPLS 或 IP 等技术结合加以利用。在 DetNet 部署中,TSN 可以作为一个子层使用,以在子网络中提供确定性转发。在实践中,TSN 实现通常是基于硬件的,并针对最高可达 \qty 1 的带宽进行了优化。然而,DetNet 面向更广泛的使用场景,包括高吞吐量应用和数据中心骨干网。这种层次化组合允许形成可扩展的设计,将 DetNet 的高速能力与 TSN 的精确定时机制结合起来。两种技术的集成使得复杂的流量整形和可靠性机制能够跨骨干基础设施应用。
术语 DetNet、TSN、MPLS、IP 已保留;逻辑上“However”转折已体现。“up to \qty 1”疑似从 LaTeX 单位命令解析不完整,缺少单位或数值上下文,可能原文应为某个带宽值。该处需人工核对原 PDF 或源文件。
本文的贡献是多方面的。我们提出 P4-TAS,这是在 Intel Tofino™ 2 交换 ASIC 上基于 P4 实现的 TAS,能够在数据平面中实现符合 TSN 的整形和监管。我们的设计引入了一种新的周期性队列控制机制,该机制使用连续的、内部生成的 TAS 控制帧流。它建立在我们此前 P4-PSFP 工作 [ 6 ] 中的一种机制之上,该机制使用内部数据包生成器作为时钟源。P4-TAS 还集成了 PSFP,并包含一个 MPLS/TSN 转换层 [ 7 ],从而能够以最高 400 Gb/s 的线速将 TSN 流量整形应用于 DetNet 流。本工作的一个关键贡献是识别并量化影响调度精度的内部处理延迟。此类延迟通常不会在商用 TSN 能力交换机中被公开记录,但对于精确的流量调度至关重要 [ 8, 9 ]。我们的实现揭示了数据平面中的多个延迟来源,并提供了纳秒尺度的相应测量结果。我们证明,我们的方法实现的内部延迟比某些商用平台所报告的延迟低数个数量级 [ 9 ],并提供了透明性。最后,我们评估了 P4-TAS 的可扩展性,并将其与现有 TAS 实现进行了比较。
P4-TAS、Intel Tofino™ 2、ASIC、P4、TAS、TSN、P4-PSFP、PSFP、MPLS/TSN、DetNet 均已保留;400 Gb/s、纳秒尺度、引用 [ 6 ]、[ 7 ]、[ 8, 9 ]、[ 9 ] 未遗漏;“orders of magnitude smaller”译为“低数个数量级”准确。未发现明显问题。
本文其余部分的结构如下。在第 II 节中,我们提供关于 TSN、DetNet 和 P4 编程语言的背景信息。在第 III 节中,我们回顾关于结合 TSN 与 DetNet 系统的相关工作,以及关于这些技术的仿真和硬件实现的相关工作。第 IV 节介绍 P4 实现,包括我们的系统架构和 P4-TAS 机制。在第 V 节中,我们评估内部延迟,从外部测量传输门精度,分析可扩展性,并将 P4-TAS 与其他实现进行比较。最后,在第 VI 节中,我们对本文作出总结。
章节编号 II、III、IV、V、VI 均已保留;TSN、DetNet、P4、P4-TAS 术语一致;“externally”已译为“从外部”。未发现明显问题。
在本节中,我们提供关于 TSN、DetNet 和 P4 编程语言的技术背景。
术语 TSN、DetNet、P4 已保留;句意完整。未发现明显问题。
我们首先简要概述 TSN,解释调度流量,然后总结 TAS 和 PSFP 的概念。
TSN、TAS、PSFP 缩写已保留;段落为章节引导句,逻辑顺序“首先、然后”已体现。未发现明显问题。
TSN 是一套 IEEE 802.1 标准,用于扩展传统以太网,以支持具有严格服务质量(Quality of Service,QoS)保证的确定性通信。TSN 网络由互连的网桥和端站构成。一个数据流被称为 TSN 流,它源自一个 talker(发送站),并被导向一个或多个 listener(接收站)。TSN 流基于其 VLAN 标签、第 2 层目的地址以及可选的其他头部字段来识别 [ 4 ]。在允许一个流进行传输之前,它必须经过准入控制 [ 4 ]。这一过程涉及 talker 通过流描述符通告其流量特征,例如时延要求。随后,网络通过评估资源可用性并相应进行预留,来决定是否接纳该流。
IEEE 802.1、QoS、TSN stream、talker、listener、VLAN、Layer 2、admission control 等术语已准确处理;引用 [ 4 ] 两处均保留;“optionally other header fields”中的可选限定词已保留。未发现明显问题。
在 TSN 中,流可以被调度,也就是说,它们在 talker 处的发送时间会被协调,使得帧在中间网桥处经历最小延迟。这种协调是离线计算的,并产生一个网络范围的调度表。此类调度表的计算不在本文工作的范围之内。关于 TSN 中调度的更多信息,可参见 Stüber 等人的综述 [ 10 ]。亚微秒尺度的时间同步对于 TSN 中的调度至关重要。为此,会采用精确时间协议(Precision Time Protocol,PTP)等协议 [ 11 ]。
“i.e.”解释关系已体现;“offline”译为“离线计算”;“sub-microsecond scale”译为“亚微秒尺度”;Stüber et al.、PTP、引用 [ 10 ]、[ 11 ] 已保留。未发现明显问题。
TSN 中的调度流通常被分配最高优先级,并且必须受到保护,以免受较低优先级流量的影响,例如尽力而为流量。这确保调度帧在其预定时间到达每个中间节点。TAS 和 PSFP 是使用门控机制来保护调度流量的机制。二者如图 1 所示,并在下文中解释。
TAS、PSFP、TSN 术语已保留;“highest priority”“lower-priority traffic”“best-effort traffic”逻辑关系清晰;Figure 1 译为“图 1”。未发现明显问题。
TAS 在 IEEE Std 802.1Qbv [1] 中被标准化,在出口侧提供基于时间的整形。每个出口端口提供八个 FIFO 队列,这些队列与来自 VLAN 标签的帧优先级相关联 [12]。这些队列由传输门控制,而传输门又由门控列表(gate control list,GCL)控制。GCL 是一个周期性的条目序列,每个条目指定一个时间片以及相应的门状态。在 TAS 中,我们将其称为传输 GCL(transmission GCL,tGCL)。每个 tGCL 条目指定一个持续时间以及一个八位向量,该向量指示八个传输门中的哪些处于打开或关闭状态。位于传输门打开的队列中的帧按 FIFO 顺序传输,而位于传输门关闭的队列中的帧则保持缓冲。所有条目处理完之后,该序列以周期长度 h h 周期性重复。
术语 TAS、IEEE Std 802.1Qbv、FIFO、VLAN、GCL、tGCL 均已保留并译出;数字“八个”“八位”一致;逻辑上区分了打开门传输、关闭门缓冲。末尾“cycle length h h”疑似公式或 OCR/抽取重复,应人工核对原文公式表示。
PSFP 在 IEEE Std 802.1Qci [3] 中被标准化,通过将速率监管与基于时间的监管相结合,在入口侧强制实施逐流一致性。通过这种方式,PSFP 确保遵守由准入控制建立的资源边界。虽然速率监管是一种众所周知的机制,但基于时间的监管面向调度流量,并且是本文工作的重点。对于基于时间的监管,每个流都与一个由周期性 GCL 控制的流门相关联,我们将该 GCL 称为流 GCL(stream GCL,sGCL)。sGCL 定义门状态随时间的变化,并由此定义该流被允许的传输窗口。在其允许窗口之外到达的帧会被立即丢弃,即在排队之前丢弃,从而防止它们消耗预留资源。
PSFP、IEEE Std 802.1Qci、GCL、sGCL 等缩写处理一致;“per-stream conformance”译为“逐流一致性”合理;“admitted transmission windows”译为“被允许的传输窗口”保留了准入含义;逻辑和因果关系未发现明显问题。
在图 1 中,两个流进入 TSN 网桥。根据它们的 sGCL,第一个流门处于打开状态,而第二个流门处于关闭状态。因此,流 1 的帧被转发,而流 2 的帧被 PSFP 丢弃。随后,这两个流共享同一组出口端口队列,这些队列由 tGCL 控制。此处,只有第一个传输门处于打开状态,因此只有存储在队列 1 中的帧会被传输。
图号、流编号、门状态、PSFP 丢弃、tGCL 控制均保持一致;“queue 1”译为“队列 1”;未发现明显问题。
PSFP 中的流门与 TAS 中的传输门在三个方面不同。第一,流门按流应用,而传输门按出口端口和队列应用。第二,一个 sGCL 条目定义单个流门的状态,而一个 tGCL 条目定义全部八个队列的状态。第三,关闭的流门在排队之前丢弃帧,而关闭的传输门将帧缓冲在队列中。
三点差异完整保留;“per stream”“per egress port and queue”关系明确;“queueing/queue”前后语义一致;未发现明显问题。
DetNet 架构支持具有极低分组丢失率和有界时延的实时应用 [5]。它由 IETF DetNet 工作组标准化。DetNet 运行在网络层,例如 IP 层,并向较低层提供 QoS 和可靠性,例如向 MPLS 和 TSN 提供 QoS 和可靠性。DetNet 适用于处于单一管理控制之下的网络,例如私有 WAN,或覆盖整个园区的网络。
DetNet、IETF、QoS、MPLS、TSN、WAN 等术语保留;“bounded latency”译为“有界时延”;“to the lower layer”语义略不寻常,但按原文译为向较低层提供能力;未发现明显问题。
DetNet 中的有界时延是通过消除节点内部由队列拥塞导致的分组丢失来实现的。为此,在每个节点上预留带宽和缓冲区资源。资源预留可以使用资源预留协议(Resource reServation Protocol,RSVP)完成。对于 DetNet 内部的流量工程,可以应用 IEEE 802.1 工作组定义的机制,例如 TAS。
“packet loss resulting from queue congestion within a node”译为“节点内部由队列拥塞导致的分组丢失”,因果准确;RSVP 原文大小写“reServation”保留在英文全称中;未发现明显问题。
DetNet 架构将数据平面功能划分为两个子层。第一,服务子层提供 DetNet QoS 机制,例如有界时延和服务保护,例如通过向分组添加序列号信息来实现。第二,转发子层在 DetNet 服务子层处理节点之间提供连接性 [13]。DetNet 存在多种数据平面技术,例如基于 MPLS 的 DetNet [13],以及基于 IP 的 DetNet [14]。对于基于 MPLS 的 DetNet,转发子层和服务子层通过 MPLS 标签来标识,这些标签分别称为转发标签(Forward label,F-Label)和服务标签(Service label,S-Label)。一个或多个 F-Label 用于将分组转发通过 DetNet 域。S-Label 位于 F-Label 之后,用于标识 DetNet 流。基于所识别出的 DetNet 流,应用 QoS 机制。此外,DetNet 控制字(DetNet control word,d-CW)位于 MPLS 栈之后。该控制字包含一个用于 DetNet 保护机制的序列号。
两个子层、DetNet over MPLS/IP、F-Label、S-Label、d-CW 的结构关系完整;“follows after”均译为“位于……之后”;“Detnet”原文大小写不一致,译文统一为 DetNet;未发现明显问题。
已有一些标准使用 DetNet MPLS 数据平面来互连 TSN 网络 [15, 7]。对于基于 TSN 的 DetNet MPLS,DetNet 流在 DetNet / TSN 域边界处基于 S-Label 进行标识,并被转换为 TSN 流。为此,IEEE Std 802.1CBdb [16] 定义了一种 MPLS DetNet 流标识方法,该方法识别 S-Label 并压入一个新的 VLAN ID。随后,基于新的 VLAN ID 应用 TSN 流标识。借助这些互连的数据平面,TAS 和 PSFP 等 TSN 服务可以应用于 DetNet 流。
标准号 IEEE Std 802.1CBdb、引用 [15, 7]、S-Label、VLAN ID 均保留;“pushes a new VLAN ID”译为“压入一个新的 VLAN ID”符合网络封装语境;“DetNet MPLS over TSN”译为“基于 TSN 的 DetNet MPLS”可能需结合全文确认技术栈顺序。
Programming Protocol-independent Packet Processors(P4)是一种领域专用编程语言,用于在可编程 P4 的交换机中实现自定义数据平面 [17]。P4 程序可以操作分组并作出转发决策,以实现自定义算法。下文中,我们描述 P4 流水线、分组生成器,以及一种称为高级流控制(advanced flow control,AFC)的特性的概念。Hauser 等人的一篇综述提供了关于 P4 的更多信息 [18]。
P4 全称、P4-programmable switches、packet generator、AFC 均已保留;“manipulate packets”译为“操作分组”;作者引用 Hauser et al. 处理为“Hauser 等人”;未发现明显问题。
可编程 P4 的交换机被称为目标,并实现一种特定架构。Intel Tofino™ 2 交换 ASIC 是一种基于硬件的 P4 目标。通常,P4 架构遵循流水线式结构。Tofino Native Architecture(TNA)的流水线,即 Intel Tofino™ 所使用架构的流水线,如图 2 所示。
“targets”译为“目标”并保留 P4 语境;Intel Tofino™ 2、ASIC、TNA 均保留;最后一句中“the architecture used by the Intel Tofino™”指代 TNA,译文已体现;未发现明显问题。
TNA 由一个入口块和一个出口块组成,每个块都具有可编程解析器、控制块以及反解析器。在入口控制块中处理帧之后,帧会在 TNA 的流量管理器组件中排队。该组件是可配置的,但不是可编程的 [19]。
术语 TNA、ingress block、egress block、parser、control block、deparser、traffic manager 均已按 P4/交换芯片语境翻译;引用 [19] 保留;“configurable but not programmable”的对比关系已保留。未发现明显问题。
P4 程序中的控制块定义算法的逻辑。它们利用元数据进行数据包处理。一个 P4 程序定义两种不同类型的元数据。第一,用户定义元数据在流水线处理期间存储信息。第二,固有元数据包含由架构给出的信息,例如,一个帧的入口时间戳以及入口端口。控制块由 match+action 表组成。MAT 的概念如图 3 [20] 所示,并在下文中解释。
metadata 译为“元数据”,intrinsic metadata 译为“固有元数据”;match+action tables 保留关键英文表达并译为“表”;MAT 缩写保留;图号和引用 [20] 保留。未发现明显问题。
在一个 MAT 中,选定的数据包头字段和元数据形成一个组合键。每个数据包都会根据所选的键字段在 MAT 中进行匹配。当在表中发生匹配时,会执行一个关联的动作,该动作可以操作数据包数据,或者做出转发决策。数据平面定义 MAT 的结构,即键字段和动作。然而,这些 MAT 的内容由控制平面填充。此外,寄存器是 P4 中一种常用特性,允许对数据包进行有状态处理。
composite key 译为“组合键”;data plane/control plane 分别译为“数据平面/控制平面”;stateful processing 译为“有状态处理”;逻辑上保留了结构由数据平面定义、内容由控制平面填充的区分。未发现明显问题。
P4 控制块支持逻辑表达式和简单算术表达式,但不支持循环,以维持线速处理。为了支持迭代算法,可以对数据包进行再循环。第一次通过流水线后被修改的头部在第二次通过时可用。再循环会引入时延,并且需要专用端口。像 TNA 这样的架构提供内部再循环端口,或者可以为再循环配置物理端口。
loops 译为“循环”,line rate processing 译为“线速处理”,recirculation 译为“再循环”;“Modified headers from the first pass are available in the second”中的 first/second pass 语义已保留。未发现明显问题。
Intel Tofino™ 原生支持使用 PTP 进行时间同步 [11, 19]。此外,Kannan 等人 [21] 提出了一种 PTP 的数据平面实现,可以利用该实现来实现高精度时间同步。
Intel Tofino™、PTP、Kannan et al. 均保留;“natively supports”译为“原生支持”;引用 [11, 19] 和 [21] 保留。未发现明显问题。
TNA 提供一个内部数据包生成器,该生成器可以被配置为通过一个专用内部端口生成数据包。生成的数据包会在流水线中处理。可以配置多个具有不同触发器的应用来触发数据包生成,例如周期性触发器。此外,数据包生成器可以被配置为生成 B B 个批次,每个批次的大小为 K K 个数据包,以支持数据包突发。生成的数据包包含一个由流量生成器添加的数据包生成头部。这个数据包生成头部标识应用、批次编号以及该批次中的数据包编号 [19]。
“B B”和“K K”疑似由 PDF/公式抽取导致的重复或格式识别问题,可能原意为变量 B 和 K;packet generator 与 traffic generator 在原文中分别出现,已分别译为“数据包生成器”和“流量生成器”;batch number、packet number 语义保留。需人工确认 B B、K K 的公式格式。
Intel Tofino™ 2 特有的一项功能是高级流控制(AFC),它能够控制流量管理器中一个出口端口的队列,即分派帧或暂缓帧。在流水线处理期间,通过向数据包的固有元数据中写入一个 AFC 值来操纵队列状态。由于该操作必须由一个传入数据包触发,因此每一次队列状态变化都由数据包到达来发起。单个数据包可以精确控制一个队列。AFC 值基于出口端口、队列 ID 以及期望的队列状态来计算。重要的是,受控队列不需要对应于被处理数据包自身所分配的出口端口或队列。
advanced flow control 译为“高级流控制”,AFC 保留;dispatching or holding back frames 译为“分派帧或暂缓帧”;intrinsic metadata 译为“固有元数据”;最后一句的“不需要对应于被处理数据包自身所分配的出口端口或队列”准确保留了控制对象与当前包转发目标可分离的含义。未发现明显问题。
在本节中,我们回顾关于 TSN 与 DetNet 系统结合的相关工作,以及关于这些技术的仿真和硬件实现的相关工作。
TSN、DetNet 保留;simulations and hardware implementations 译为“仿真和硬件实现”;段落功能为章节引导,逻辑清楚。未发现明显问题。
由于 TSN 和 DetNet 在促进 5G 网络中的超低时延通信方面发挥关键作用,二者的集成近年来受到了相当多的关注。Nasrallah 等人 [22] 对 TSN 和 DetNet 技术进行了全面综述,强调了它们对于 5G 环境中时间关键型应用的重要性。在此基础上,Abuibaid 等人 [23] 开展了一项案例研究,测量了 TSN 和 DetNet 在实际 5G 场景中的性能。此外,Wüsteney 等人 [24] 提出了一个用于时间敏感通信的时延模型,该通信穿越集成了 TSN 和 DetNet 的网络。Menendez 等人 [25] 提出了一种使用 XDP 和 eBPF 的 TAS 软件实现。此外,他们通过一个基于 MPLS over UDP/IP 数据平面的方案,将 TSN 功能集成到 DetNet 环境中。尽管他们的开源实现代表了向 TSN/DetNet 集成迈出的重要一步,但他们的评估没有考虑内部定时行为,并且只测量了最高到 \qty 600 的流量速率。
作者名、引用编号、TSN、DetNet、5G、TAS、XDP、eBPF、MPLS over UDP/IP 均保留;“internal timing behavior”译为“内部定时行为”;末尾 “\qty 600” 明显缺少单位或 LaTeX 参数,可能为抽取残缺,无法确定是 Mbit/s、kpps 或其他单位。需人工结合原 PDF/上下文确认。
尽管取得了这些进展,在实现高效硬件实现方面仍然存在挑战:这些硬件实现需要集成 TSN 和 DetNet 功能,同时还要识别并量化内部定时行为,而这正是本工作旨在解决的问题。
“Despite these advances”的转折关系已保留;“realizing efficient hardware implementations”与“integrate TSN and DetNet functionalities”已完整表达;“identifying and quantifying internal timing behaviour”译为“识别并量化内部定时行为”;英式拼写 behaviour 不影响含义。未发现明显问题。
已经开发出许多仿真框架,用于对 TSN [26, 27]、DetNet [28],或二者同时进行建模 [29]。特别是,Addanki 等人 [29] 提供了一个仿真器,该仿真器集成了网络层 DetNet 的构建模块以及链路层 TSN 的构建模块。Polverini 等人 [30] 描述了一种面向 BMv2 软件目标的、基于 P4 的 DetNet 实现,该实现利用 SRv6 数据平面来实现可靠性。虽然这类仿真对于探索 TSN 与 DetNet 之间的交互很有价值,但它们并不能充分解决真实世界部署中的挑战。在硬件中实现时间敏感机制会因资源约束和定时精度要求而引入额外复杂性。Ahmed 等人 [31, 32] 提供了 TSN 中 CBS 和 ATS 的 FPGA 实现,而我们此前则展示了在 ASIC 上对 PSFP 机制进行的基于 P4 的硬件实现 [6]。
术语 TSN、DetNet、P4、BMv2、SRv6、CBS、ATS、PSFP、FPGA、ASIC 均已保留;引用编号完整;“network layer/link layer”译为“网络层/链路层”准确;“reliability”译为“可靠性”无明显风险。未发现明显问题。
若干商用硬件平台支持 TAS 和 PSFP。NXP 的车规级 SJA1105TEL 交换机 ASIC 为每个端口提供 8 个出口队列,时间粒度为 \qty 200。该 ASIC 支持最多 1024 条流的时间门控传输 [33, 34]。类似地,Microchip 的 SparX-5i [35] 和 PD-IES008 [36, 37] 系列提供纳秒级粒度的时间间隔配置,并支持最多 10,000 个 tGCL 条目。这些平台表明 TAS 已可在硬件中获得,但已发表的信息通常止步于高级别功能描述,例如队列数量、GCL 大小或时间粒度。第 V-D 节提供了这些能力的摘要,并在表 II 中将其与 P4-TAS 进行比较。尽管这些平台声称具有纳秒级配置粒度,但由于未公开说明的内部延迟和硬件限制,在实践中无法可靠地以这种尺度进行队列更新。Eppler 等人 [9] 最近的一项工作量化了商用 TSN 交换机内部这类未公开说明的定时行为。他们的测量揭示了数量级为数百纳秒到数微秒的内部调度和门转换延迟;这对调度综合具有重要意义,并且如果不加以考虑,可能导致错过传输窗口。
“\qty 200”疑似源文本中单位缺失或 LaTeX 识别不完整,无法确认是 200 ns 还是其他单位;数字 8、1024、10,000、数百纳秒至数微秒均保留;tGCL、GCL、TAS、PSFP 术语一致;“schedule synthesis”译为“调度综合”可能需结合论文领域确认,但可接受。因公式/单位残缺,需人工复核。
在本文中,我们提出了在可编程 ASIC 上对若干选定 TSN 和 DetNet 机制的硬件实现。不同于以往的学术平台或商用平台,我们的设计能够对内部延迟进行透明评估,从而为其在真实系统中的行为和集成提供更深入的认识。
“selected TSN and DetNet mechanisms”译为“若干选定 TSN 和 DetNet 机制”保留限定;“transparent evaluation of internal delays”译义准确;逻辑关系清晰。未发现明显问题。
在本节中,我们描述 P4-TAS 交换机的实现,该交换机在 Intel Tofino™ 2 交换 ASIC 上结合了 PSFP 和 TAS 机制。首先,我们描述系统架构以及其与 DetNet 域的集成。然后,我们介绍 TAS 机制在 P4 中的实现。最后,我们解释对 P4-PSFP 实现所作的改进。源代码已在 GitHub 上公开提供 [38]。
P4-TAS、Intel Tofino™ 2、PSFP、TAS、P4-PSFP、GitHub 均保留;“incorporating”译为“结合了”准确;章节结构顺序完整。未发现明显问题。
P4-TAS 实现被设计为一个提供 TSN 功能的以太网交换机。它执行 IEEE Std 802.1CB [4] 中定义的 TSN 流识别,并应用基于 TAS 的流量整形以及基于 PSFP 的监管。这些机制使 P4-TAS 能够在 TSN 域内部原生运行,并为 TSN 流提供确定性转发。除了在纯 TSN 网络中的角色之外,P4-TAS 还可以充当 DetNet 域与 TSN 域之间的边界元素。在这种情况下,它处理传入的 MPLS 封装 DetNet 流,并基于 IEEE Std 802.1CBdb [16] 将其转换为 TSN 流。这使 DetNet 能够利用 TSN 子层进行调度和整形。该集成场景如图 4 所示。
IEEE Std 802.1CB、IEEE Std 802.1CBdb、MPLS、DetNet、TSN、TAS、PSFP 均保留;“policing”译为“监管”符合网络 QoS 语境;“border element”译为“边界元素”可接受。未发现明显问题。
在进入 TSN 域的入口处(步骤 1),P4-TAS 交换机基于 S-Label 压入一个 VLAN 标签,从而将 DetNet 流转换为 TSN 流。随后,使用目的 MAC 地址和所压入的 VLAN 标签来应用 TSN 流识别(步骤 2)。接着,已识别的 TSN 流接受基于 TAS 和 PSFP 的流量整形与监管(步骤 3),并且该帧通过 TSN 域转发。在出口处(步骤 4),移除 VLAN 标签,以恢复原始 DetNet 流。
步骤 1-4 完整;S-Label、VLAN、MAC、TAS、PSFP 保留;“pushing a VLAN tag”译为“压入一个 VLAN 标签”符合封装处理语境;“restore the original DetNet flow”译为“恢复原始 DetNet 流”准确。未发现明显问题。
IEEE Std 802.1Qbv 中定义的 TAS 根据 tGCL 周期性地打开和关闭多个出口队列。周期性行为,例如 GCL 中的周期性行为,并非 P4 原生支持。此外,Intel Tofino™ 2 上的队列状态可以用 AFC 控制,但这类改变只能由帧的到达触发,并且每个帧只能更新一个出口端口的单个队列。为了在这些约束下实现 TAS,P4-TAS 结合了三个构建模块:用于 tGCL 的周期性时间模型、用于触发 AFC 更新的专用连续 TAS 控制帧流,以及一个将控制帧映射到队列状态变化的 tGCL MAT。下面将对它们进行描述。最后,将概述它们如何在流水线中协同运行。
IEEE Std 802.1Qbv、TAS、tGCL、GCL、P4、Intel Tofino™ 2、AFC、MAT 均保留;“a dedicated stream of continuous TAS control frames”译为“专用连续 TAS 控制帧流”准确;“one egress port”未误译为多个端口。未发现明显问题。
在可编程 P4 硬件中建模 GCL 的周期性具有挑战性,因为周期性行为并非原生支持。Intel Tofino™ 中的时间戳是绝对时间戳,即其值会连续增加,而 GCL 中的时间片是相对的,并遵循一种周期性模式。因此,每个帧的绝对时间戳都必须映射到其在当前 GCL 周期内的相应位置。虽然这可以通过取模运算实现,但这类运算过于复杂,无法在数据平面中以线速执行。
“absolute/relative timestamps”译为“绝对/相对时间戳”准确;“modulo operation”译为“取模运算”准确;“line rate”译为“线速”符合术语;逻辑完整。未发现明显问题。
在我们先前的工作 [6] 中,我们描述了一种在基于 P4 的 PSFP 实现中建模 sGCL 周期性的方法。在该方法中,我们利用 Intel Tofino™ 的内部数据包生成器作为时钟源。在每个 GCL 周期结束时,内部数据包生成器会生成一个周期完成帧。该帧的入口时间戳被存储在一个寄存器中,并引用最后一个已完成周期的时间戳。对于所有其他帧,即非周期完成帧,入口流水线从该帧的绝对时间戳中减去这个已存储的值,以获得正在进行的 sGCL 周期内的相对时间戳。通过这种方式,每个帧都被映射到一个周期长度的相对时间窗口中。交换机的绝对硬件时钟可以被一致地使用,同时 sGCL 被视为一个重复的条目列表。我们利用这一机制来实现 TAS 中 tGCL 的周期性。然而,不同于 PSFP 中的 sGCL,在那里每个 sGCL 条目打开或关闭单个流门,即接纳或丢弃一个帧;tGCL 条目必须通过打开和关闭队列来控制多个传输门。因此,该周期性机制作为 TAS 的基础,但还需要额外机制,以便在每个 tGCL 条目期间连续更新所有出口队列的门状态。
sGCL、PSFP、Intel Tofino™、GCL、tGCL、TAS 均保留;“period-completion frame”译为“周期完成帧”一致;“stream gate/transmission gate”分别译为“流门/传输门”,术语区分清楚;因长句中“references the timestamp”译为“引用……时间戳”可接受但略偏直译。未发现明显问题。
Intel Tofino™ 2 上的队列可以通过在流水线中处理一个帧的内在 AFC 元数据来打开或关闭。被控制的队列不需要对应于该帧自身被分配到的出口端口或队列。单个帧可以精确控制该交换机上一个端口的一个队列。在本节中,我们解释对所有队列进行及时控制的概念,该概念实现了 tGCL。
AFC、Intel Tofino™ 2、tGCL 保留;“intrinsic AFC metadata”译为“内在 AFC 元数据”可能需结合 Intel/P4 术语确认,也可译为“固有 AFC 元数据”;“exactly one queue of one port”译为“一个端口的一个队列”准确;“timely control”译为“及时控制”语义基本准确。未发现明显问题。
为了在 Intel Tofino™ 2 上使用 AFC 实现 tGCL 队列状态变更,每一次队列状态更新都必须由一个帧的到达来触发。为此,P4-TAS 采用内部数据包生成器来连续产生 TAS 控制帧。这些帧以连续的、背靠背的 8 帧批次生成,使得每个队列都被分配一个帧。每个 TAS 控制帧都携带内部元数据,其中包含它所控制的队列和出口端口的标识符。到达后,该帧在 tGCL 周期中的位置会基于该帧的到达时间戳计算,具体如第 IV-B1 节所述。然后,将 tGCL 周期中的位置以及内部元数据与 tGCL MAT 进行匹配;该 tGCL MAT 指定在该时间点对应队列应当打开还是关闭。
术语 tGCL、AFC、TAS、MAT、intrinsic metadata 已按技术语境保留或译为“内部元数据”;“back-to-back batches of eight”译为“背靠背的 8 帧批次”,数字无误;逻辑为控制帧到达触发状态更新,未发现明显问题。
TAS 控制帧采用最小尺寸(\qty 64B),并且除内部元数据之外不包含任何载荷。它们以最小到达间隔被连续生成,以确保队列状态精确遵循已配置的 tGCL。该机制不会消耗用户流量的带宽,因为内部数据包生成器和一个专用内部端口被专门用于 TAS 控制流量。在实践中,连续的 TAS 控制帧之间存在一个很短的延迟。由于队列只能在其关联的控制帧到达时改变状态,因此可能发生延迟打开或延迟关闭。我们在第 V-A3 节评估这一行为的影响。
“\qty 64B”疑似 LaTeX 数量宏识别结果,按原符号保留;若排版目标需要,可人工确认是否应呈现为“64 B”。“inter-arrival time”译为“到达间隔”,逻辑无误;状态更新延迟的因果关系保留完整。
tGCL 在出口流水线中被编码为一个 MAT,并如图 5 所示。第 IV-B2 节中的 TAS 控制帧会与其进行匹配。
“egress pipeline”译为“出口流水线”;“matched against it”指与 tGCL MAT 匹配,指代关系清楚。未发现明显问题。
MAT 中的每个条目对应于一个 tGCL 条目中的 8 个队列之一,也就是说,每个 tGCL 条目需要 8 个 MAT 条目。该条目指定一个队列当前应当处于打开状态还是关闭状态。查找键由 tGCL 中的相对时间戳、队列标识符以及出口端口组成,其中相对时间戳按照第 IV-B1 节计算。MAT 动作会把一个预先计算好的 AFC 值写入该帧的内部元数据;该 AFC 值对队列、出口端口和状态进行编码。这会触发队列状态更新。队列状态更新存在一个很小的延迟,该延迟在第 V-A2 节中进行评估。
数字“8 个队列 / 8 个 MAT 条目”无误;lookup key、action、precomputed AFC value 的结构关系保留完整;“state”在上下文中为打开/关闭状态。未发现明显问题。
第 IV-B1 节至第 IV-B3 节中的机制在 P4-TAS 流水线内协同运行,如图 6 所示。
章节范围 “IV-B1 – IV-B3” 翻译准确;“operate together”译为“协同运行”。未发现明显问题。
首先,所生成的周期完成帧标记 tGCL 和 sGCL 周期的边界,并在步骤 1 中维护用于相对时间戳计算的参考。在这里,每个持续时间为 h h 的周期结束时生成一个单独的帧,并且第 j j 个周期的时间戳 t j h t^{h}_{j} 会被存储在一个寄存器中,以供后续处理。随后,这些帧会被丢弃。
“h h”“j j”“t j h t^{h}_{j}”明显像公式抽取或 OCR/LaTeX 解析残缺,已尽量按原形式保留;需人工结合论文公式确认其正确写法,可能应为周期长度 h、第 j 个周期时间戳 \(t^h_j\)。其余逻辑为周期完成帧用于更新相对时间参考。
其次,TAS 控制帧由数据包生成器以最小到达间隔连续生成。对于这些帧,在步骤 2 中会计算其相对于 tGCL 上一个已流逝周期的时间戳。该时间戳用于将 TAS 控制帧匹配到 tGCL MAT 的对应条目。在将 TAS 控制帧排入流量管理器的专用队列之后,AFC 机制会在步骤 3 的出口侧被应用。在这里,会使用第 IV-B3 节所描述的 MAT,并基于当前 tGCL 条目打开或关闭对应队列。随后,这些帧会被丢弃。
“last elapsed period”译为“上一个已流逝周期”,保留相对时间语义;traffic manager 译为“流量管理器”;AFC 在出口侧应用的顺序与原文一致。未发现明显问题。
第三,TSN 数据帧在入口流水线中由 PSFP 机制进行监管,以强制其符合已准入的速率和传输时间。对于这些帧,在步骤 4 中会计算其相对于 sGCL 上一个已流逝周期的时间戳,并在步骤 5 中应用 PSFP。在该步骤中,这些帧会受到监管,并根据其优先级被丢弃或排入队列。队列状态则基于 tGCL 处于打开状态或关闭状态。一旦队列打开,帧就会以 FIFO 方式转发。
“policed”译为“监管”,符合 PSFP/流量监管语境;“admitted rates and transmission times”译为“已准入的速率和传输时间”;FIFO 保留。未发现明显问题。
本工作中使用的 Intel Tofino™ 2 ASIC 为 IEEE 1588 PTP [19] 提供硬件支持。PTP 通过在网络节点之间交换带时间戳的消息来对齐它们的本地时钟,从而实现亚微秒级同步精度。该功能可以完全使用 ASIC 的板载资源来实现 [19]。然而,P4-TAS 实现并不包含 PTP 同步机制,因为集成此类功能超出了本工作的范围,并且对于本文所给出的评估并非必需。已有工作表明,在基于 Tofino 的交换机上,可以通过将硬件时间戳与控制平面时钟管理相结合来实现精确的 PTP 同步 [39, 19],甚至也可以完全在数据平面内实现 [21]。不过,我们所评估的 TAS 功能和内部延迟特性在很大程度上独立于全网范围的时间对齐。未来工作将探索把 P4-TAS 集成到一个同步的多跳 TSN 测试平台中,以便在多个设备之间实现协调的、时间感知的调度。
IEEE 1588 PTP、ASIC、Tofino、TSN 等缩写保留;引用编号 [19]、[39, 19]、[21] 无误;“sub-microsecond”译为“亚微秒级”;“network-wide time alignment”译为“全网范围的时间对齐”。未发现明显问题。
P4-TAS 纳入了先前的 P4-PSFP 实现 [6]。PSFP 组件中的流过滤器、流门控器和流量计按照 IEEE Std 802.1Qci [3] 实现。P4-PSFP 的功能已在 [6] 中得到广泛评估。在本节中,我们描述对 P4-PSFP 的改进,这些改进消除了再循环,并提高了 GCL 的时间分辨率。
“stream filter、stream gate、flow meter”分别译为“流过滤器、流门控器、流量计”,符合 TSN/PSFP 组件语义;IEEE Std 802.1Qci、引用 [3]、[6] 保留;“recirculation”译为“再循环”。未发现明显问题。
P4-PSFP 出于两个原因对 TSN 流量进行再循环。第一,计算在一个 sGCL 中的相对位置无法放入单次流水线迭代中完成。第二,IEEE Std 802.1Qci [3] 中定义的可选最大帧大小过滤器需要帧大小信息,而该信息只有在出口块中才可用,但丢弃必须发生在入口块中。因此,再循环是必要的,并会增加一个已知的恒定延迟。对于 P4-TAS,我们将 P4-PSFP 的实现从 Intel Tofino™ 移植到 Tofino™ 2,在后者中,更大的流水线允许在一次通过中计算 GCL 位置。我们还移除了可选最大帧大小过滤器,从而消除了对再循环的需要。如果需要,可以重新加入该过滤器,代价是进行一次再循环。
术语 TSN、sGCL、GCL、入口块、出口块、再循环均已保留或准确翻译;IEEE 标准号与引用 [3] 未改动;“known constant delay”译为“已知的恒定延迟”准确。未发现明显问题。
P4-PSFP 中的 sGCL 条目被建模为采用范围匹配类型的 MAT 条目。然而,范围匹配类型在 TNA 中受到限制,并且只能匹配 \qty 20bits。TNA 中的时间戳为 \qty 48bits,具有纳秒粒度。因此,在 P4-PSFP 中,会从时间戳的中间截取 \qty 20bits,以启用范围匹配类型并实现适当的时间分辨率。因此,GCL 的最小分辨率为 \qty 2,最大分辨率约为 \qty 4。分辨率更低的 GCL 条目,或者持续时间更长的 GCL,无法在 P4-PSFP 中定义。然而,由于硬件限制,P4-TAS 要求 tGCL 条目之间具有很小的间隔,而此时 \qty 2 的最小分辨率过大。第 V-B3 节对此作了进一步阐述。因此,我们采用一种称为 range-to-ternary conversion(范围到三值转换)[40] 的算法来提高时间片的分辨率。该算法允许使用多个三值条目来建模单个范围条目。
MAT、TNA、sGCL、tGCL 等缩写已保留;引用 [40] 和章节 V-B3 未改动。原文中的 `\qty 2`、`\qty 4` 缺少单位或上标上下文,可能是抽取残缺,例如可能对应不同时间单位或数量级;需结合论文 PDF 或公式排版人工核对。
该算法接收一个表示时间片的整数范围 [L, R],并将其分解为能够共同覆盖整个范围的、数量尽可能少的前缀集合。它通过反复选择从当前下界开始且仍完全位于该范围内的最大前缀来完成这一过程。这些被选中的块共同确保对区间的完整覆盖 [40]。图 7 给出了一些转换示例。图 7 中的每个块表示一个覆盖该范围部分内容的三值条目。∗ 表示一个“无关”位,意味着该位可以取 0 或 1 中的任一值。
原文中 `[ L, R ] [L,R]` 存在重复抽取,译文合并为 `[L, R]`;`∗ *` 也疑似符号重复,译文保留 `∗`。算法逻辑、引用 [40]、图 7、0/1 数值均准确。因输入存在重复识别痕迹,需人工核对原文排版。
在 GCL 中,时间片被定义为连续且不重叠的范围。在这些约束下,Sun [41] 已经证明该解既正确又唯一。
“consecutive, non-overlapping ranges”译为“连续且不重叠的范围”准确;Sun [41] 引用保留;逻辑关系清楚。未发现明显问题。
通过该算法,GCL 具有从 \qty 1 \nano 到 \qty 78 的分辨率。\qty 78 的上界远远超过 GCL 周期的要求,并且在 TSN 网络中并不必要。然而,完整的 \qty 48bits 时间戳范围可用于匹配,而降低分辨率并没有好处。该转换算法为建模 GCL 所需的三值表条目数量将在第 V-C 节中评估。
GCL、TSN、三值表条目、章节 V-C 均处理准确;`\qty 1 \nano` 可理解为 1 ns,但原文 `\qty 78` 缺少单位或指数,可能为抽取残缺。需人工核对具体上界单位和数值。
在本节中,我们评估 P4-TAS 实现。首先,我们识别并量化内部延迟,包括流量生成器精度、队列打开延迟以及 TAS 控制帧延迟。接下来,我们从外部测量 tGCL 条目的持续时间,并引入门切换间隔(gate switching intervals,GSIs),以缓解由队列打开延迟导致的 tGCL 条目之间的过渡行为。然后,我们通过分析所支持的 tGCL 和 sGCL 条目数量,以及用于识别 DetNet 和 TSN 流的最大流数量,来评估 P4-TAS 的可扩展性。最后,我们将 P4-TAS 与现有 TAS 实现进行比较。
TAS、tGCL、sGCL、GSI、DetNet、TSN 等术语和缩写保留;评估流程顺序与原文一致;“transitional behavior”译为“过渡行为”准确。未发现明显问题。
大多数 TSN 调度方法假设交换机行为是理想的,并忽略内部延迟或抖动等实现特定效应。Stüber 等人 [8] 通过提出一种将此类不准确性纳入考虑的调度算法来处理这一问题。他们强调,在 TAS 配置中需要考虑硬件引起的可变性。虽然他们的工作侧重于调度层面的鲁棒性,但我们采取一种互补的方法,即在一个硬件实现中识别并量化未公开记录的内部延迟来源。这些发现可以支持设计更准确且更鲁棒的调度。
引用 [8] 保留;“implementation-specific effects”“hardware-induced variability”“scheduling-level robustness”等概念翻译准确;逻辑转折“While”已体现。未发现明显问题。
Franco 等人 [42] 对 Intel Tofino™ ASIC 的延迟行为进行了剖析。他们分析了解析深度和 MAT 复杂性等因素。然而,除处理延迟之外,TSN 网桥内部还存在通常不会被披露的其他延迟来源 [9]。我们在 Intel Tofino™ 2 平台上的 P4-TAS 实现中量化了其中若干延迟来源。虽然测量结果特定于 Intel Tofino™ 2 ASIC 上的 P4-TAS 实现,但这些延迟的来源也存在于其他硬件中 [9, 35]。
Intel Tofino™、ASIC、MAT、TSN 网桥等术语保留准确;引用 [42]、[9]、[9, 35] 未改动;“profile”译为“剖析”贴合技术语境。未发现明显问题。
首先,我们评估内部流量生成器的精度,该精度会影响周期完成帧的时序。随后,我们分析 AFC 机制的队列打开延迟。最后,我们测量由用于 TAS 控制帧的数据包生成器引入的延迟,并给出这些测量结果的总结。
AFC、TAS 保留;“period-completion frames”译为“周期完成帧”与上下文一致;三步顺序准确。未发现明显问题。
如第 IV-B1 节所述,P4-TAS 使用内部数据包生成器,以配置的周期 h 来指示每个 tGCL 周期的完成。每隔 h ns 生成一个周期完成帧,并将第 j 个周期的时间戳(记为 \(t^{h}_{j}\))存储在一个寄存器中。由于数据包生成器的限制,可能会出现较小的时序偏差。为了量化这一影响,我们测量连续周期完成帧的时间戳之间的差值,即 \(t^{h}_{j+1}\) 和 \(t^{h}_{j}\) 之间的差值,并将其相对于配置的周期 h 进行比较。偏差 \(\hat{\delta}_{\text{TG}}\) 在公式 1 中定义为:\(\hat{\delta}_{\text{TG}} = (t^{h}_{j+1}-t^{h}_{j})-h\)。公式编号为 (1)。
第 IV-B1 节、tGCL、周期 h、j、\(t^{h}_{j}\)、\(\hat{\delta}_{\text{TG}}\) 与公式结构均已保留;原文中 `h h`、`j j`、公式重复片段明显为文本抽取重复,译文按数学含义去重。因输入公式和变量存在重复识别痕迹,需人工核对 PDF 中的公式排版。
该值作为时间序列记录在数据平面的一个寄存器中。基于 Stüber 等人 [43] 所识别的用例,我们选择了具有代表性的周期 \(h\):用于工厂自动化的 \(\qty{500}{}\)、用于工业等时流量的 \(\qty{2}{}\),以及用于航空航天应用的 \(\qty{128}{}\)。此外,我们还纳入 \(\qty{10}{}\)、\(\qty{499}{}\) 和 \(\qty{501}{}\),以分析边缘情况和伪影;并纳入 \(\qty{400}{}\),因为该周期用于第 V-B 节中的评估。对于每个周期,我们记录 16,000 个周期完成帧的时间戳。图 8 展示了结果。
术语“period-completion frames”译为“周期完成帧”,需结合全文确认是否已有固定译名;\(\qty{}\) 原文未显示单位,已保留符号形式。数字 500、2、128、10、499、501、400 和 16,000 均已保留。逻辑关系完整。
图 8 中的箱线图用红线表示中位数,用箱体边缘表示第一四分位数和第三四分位数,并用须线表示延伸到四分位距 1.5 倍的位置。该范围之外的值被绘制为离群值。正的 \(\hat{\delta}_{\text{TG}}\) 表示实际周期比配置值长 \(\hat{\delta}_{\text{TG}}\),而负值表示实际周期比配置值短相同的量。
“whiskers”译为“须线”符合箱线图术语;\(\hat{\delta}_{\text{TG}}\) 保留正确。正负偏差含义翻译完整。未发现明显问题。
大多数周期表现出的偏差低于 \(\hat{\delta}_{\text{TG}}=\qty{2}{}\),所有离群值都保持在 \(\pm \qty{11}{}\) 以内。例外出现在周期为 \(\qty{400}{}\) 和 \(\qty{500}{}\) 时,其分布范围更宽,但离群值更少。我们将其归因于 Intel Tofino™ 交换 ASIC 中数据包生成器的内部调度行为。将周期略微移动,例如移动到 \(\qty{499}{}\) 或 \(\qty{501}{}\),会得到与其他配置类似的偏差。
“wider spread with less outliers”译为“分布范围更宽,但离群值更少”;严格语法应为 fewer outliers,原意明确。Intel Tofino™、ASIC、\(\hat{\delta}_{\text{TG}}\)、数值均保留。未显示单位,已保留 \(\qty{}\)。
尽管这些偏差很小,它们仍可能影响周期性计算。如果一个周期完成帧到达较晚,则当前 GCL 周期内计算得到的相对位置可能超过周期 \(h\),这会索引到一个超出周期范围的条目。为确保所有帧都被分配到有效的 tGCL 条目,P4-TAS 会将任何计算得到的、满足 \(\geq h\) 的位置钳制到该周期的最后一个条目。相反,如果一个周期完成帧提前到达,则第 IV-B1 节中的周期性机制会在语义上按模 \(h\) 对位置求值,因此结果始终位于 \([0,h)\) 中。因此,相对于配置周期的偏差得到了补偿,所有帧都会映射到既有的 GCL 条目。
“clamps”译为“钳制”较技术化,也可译为“截断/限制”;此处含义是将越界值限制到最后条目。GCL、tGCL、\(h\)、\([0,h)\) 保留正确。逻辑中“晚到导致位置超过周期、早到通过模运算落入范围”已完整表达。
在 TNA 中,从写入 AFC 值,也即发起队列状态改变,到硬件中队列状态实际更新之间,存在一个很小但非零的延迟 [44]。为量化 AFC 机制中的内部延迟,我们测量发出队列状态改变与 TSN 帧实际释放之间的时间。我们将该队列开启延迟记为 \(\hat{\delta}_{\text{queue}}\)。测量显示,在 TNA 中,队列开启延迟和关闭延迟以相同方式分布。该延迟会影响 TSN 精度,并且在可获得的硬件文档中很少被记录。测量过程在 P4-TAS 的数据平面中实现,如图 9 所示。
原文脚注内容夹在句中:“Measurements showed...” 已作为独立句译出。AFC、TNA、TSN、\(\hat{\delta}_{\text{queue}}\) 保留正确。“queue opening delay”译为“队列开启延迟”。脚注编号“1 1 1”疑似抽取噪声,未直译。
在图 9 中,首先用 TSN 帧填充一个关闭的队列(步骤 1)。当一个 TAS 控制帧匹配到打开该队列的 tGCL 条目时,它会通过 AFC 触发队列开启,并记录时间戳 \(t_{\text{change}}\)(步骤 2)。随后,使用离开队列的第一个 TSN 帧的出队时间戳 \(t_{\text{deq}}\),按照公式 2 计算 \(\hat{\delta}_{\text{queue}}\)(步骤 3):
“dequeuing timestamp”译为“出队时间戳”;TAS、tGCL、AFC、TSN 均保留。原文公式符号中出现 \(\text{\text{queue}}\) 双重 text 抽取形式,译文规范化为 \(\text{queue}\)。未发现明显问题。
\[ \hat{\delta}_{\text{queue}} = t_{\text{deq}} - t_{\text{change}}. \] (2)
公式含义为队列开启延迟等于第一个 TSN 帧出队时间减去状态改变发起时间;符号已从原文重复抽取形式规范化。未发现明显问题。
对于所有观测到的转换,该值都作为时间序列存储在数据平面的一个寄存器中(步骤 4)。
“observed transitions”译为“观测到的转换”,结合上下文指队列状态转换。未发现明显问题。
用于该测量的 tGCL 被配置为八个连续条目,每个优先级对应一个条目。每个条目都会将相应的优先级队列打开 \(\qty{100}{}\),从而使调度依次循环经过全部八个优先级。TSN 流量使用 P4TG [45, 46, 47] 生成,速率为 \(\qty{400}{}\),优先级随机化,帧大小为 \(\qty{64}{}\)。这确保队列处于饱和状态。实验运行 \(\qty{60}{}\)。图 10 展示了测得的队列开启延迟 \(\hat{\delta}_{\text{queue}}\) 的互补累积分布函数(CCDF)。
\(\qty{100}{}\)、\(\qty{400}{}\)、\(\qty{64}{}\)、\(\qty{60}{}\) 原文未显示单位,已保留符号形式;“at \(\qty{400}{}\)”可能指速率或周期,需结合上下文确认单位和含义。“complementary cumulative distribution function”译为“互补累积分布函数”。引用 [45, 46, 47] 已保留。
大多数延迟低于 \(\hat{\delta}_{\text{queue}}=\qty{11}{}\),尾部延伸到 \(\qty{63}{}\),均值为 \(\mu(\hat{\delta}_{\text{queue}})=\qty{14.63}{}\)。这些结果揭示了很小但可测量的内部延迟。特别是,队列开启延迟可能在 tGCL 边界处造成过渡行为,即在下一个条目已经开始之后,来自前一个条目的帧仍可能短暂地继续传输。该效应的影响以及门控切换间隔(gate switching intervals, GSIs)的作用在第 V-B3 节中评估。
\(\hat{\delta}_{\text{queue}}\)、\(\mu(\hat{\delta}_{\text{queue}})\)、11、63、14.63 均已保留;单位原文未显示,保留 \(\qty{}\)。GSI 首次在本段展开为“门控切换间隔”,需与全文术语表保持一致。逻辑完整。
对于 TAS 控制帧,内部数据包生成器被配置为每纳秒生成一个帧。这些帧按每批八个的方式顺序生成,其中每个帧控制八个优先级队列之一。然而,在实践中,无法每纳秒生成一个帧。相反,帧生成之间会出现一个小的延迟,这限制了能够触发队列状态更新的粒度。为了量化这一现象,我们在 P4-TAS 的数据平面中收集每个 TAS 控制帧的时间戳,并计算两个连续帧 i 和 i+1 之间的延迟 \(\hat{\delta}_{\text{control}}\):
术语“TAS 控制帧”“内部数据包生成器”“优先级队列”“数据平面”保持一致;数字“每纳秒”“八个”已保留;逻辑上说明理论配置与实践限制的转折已保留。公式符号 \(\hat{\delta}_{\text{control}}\)、i、i+1 已保留。未发现明显问题。
\[ \hat{\delta}_{\text{control}} = t_{i+1}-t_i. \tag{3} \]
公式保持原意,即连续两个帧时间戳之差;编号 (3) 已保留。未发现明显问题。
我们收集了 \(\hat{\delta}_{\text{control}}\) 的 100,000 个取值,这些值全部在数据平面中计算,并存储在一个时间序列寄存器中。所得直方图如图 11 所示。
数字 100,000、图 11 均已保留;“data plane”译为“数据平面”,“time series register”译为“时间序列寄存器”。未发现明显问题。
测得的中位数为 \(\hat{\delta}_{\text{control,M}}=\qty{9}{}\),只有少数帧表现出略高的延迟,最高可达 \(\qty{12}{}\)。因此,传输门状态只能每 \(\qty{9}{}\) 更新一次。由于帧按每批八个的方式顺序生成,不同优先级队列的更新会以 \(\qty{9}{}\) 为间隔依次偏移,不能同时发生。此外,这意味着同一优先级的传输门状态更新可以每 \(8\cdot\hat{\delta}_{\text{control}}\approx\qty{72}{}\) 触发一次。该值应被视为最坏情况下的上界。在实践中,如果一个控制帧恰好在计划的门状态变化之前到达,则有效延迟可以接近于零。只有当 tGCL 条目的分辨率处于 \(\qty{72}{}\) 量级时,这样的短延迟才重要,而这远小于典型的 tGCL 条目持续时间 [8]。
\(\qty{9}{}\)、\(\qty{12}{}\)、\(\qty{72}{}\) 的单位在输入中缺失,可能由 PDF 抽取造成,需结合论文原图或 LaTeX 源确认,推测语境可能为纳秒但未在译文中擅自补充;公式 \(8\cdot\hat{\delta}_{\text{control}}\approx\qty{72}{}\) 已保留;“worst-case upper bound”译为“最坏情况下的上界”。存在单位缺失风险。
表 I 概述了在最佳情况和最坏情况下识别并测量得到的内部延迟。
表号 Table I 已译为“表 I”;“identified and measured internal delays”译为“识别并测量得到的内部延迟”。未发现明显问题。
这些内部延迟会累积为公式 4 中所示的 \(\Delta_{\text{internal}}\):
逻辑“内部延迟累积”已保留;符号 \(\Delta_{\text{internal}}\) 与公式编号 4 已保留。未发现明显问题。
\[ \Delta_{\text{internal}}=\delta_{\text{TG}}+\delta_{\text{queue}}+\delta_{\text{control}}. \tag{4} \]
公式中的三项 \(\delta_{\text{TG}}\)、\(\delta_{\text{queue}}\)、\(\delta_{\text{control}}\) 已完整保留;编号 (4) 已保留。未发现明显问题。
内部延迟 \(\Delta_{\text{internal}}\) 可能会缩短或延长一个 tGCL 条目的持续时间。图 12 以三个连续的、配置持续时间为 d 的 tGCL 条目说明了这一影响。
“reduce or extend”译为“缩短或延长”;三个连续 tGCL 条目与配置持续时间 d 已保留。未发现明显问题。
如果前一个 tGCL 条目 \(i-1\) 经历了负的内部延迟,则它会被缩短,而 tGCL 条目 i 会被延长。此外,tGCL 条目 i 本身也可能经历正延迟。在这种情况下,tGCL 条目 i 的实际持续时间变为
条目 \(i-1\) 与 i 的关系已保留;负内部延迟导致前一条目缩短、当前条目延长的逻辑已保留;段落以公式引出,语义完整依赖下一段公式。未发现明显问题。
\[ \hat{d}_i=d+|\Delta^{i-1}_{\text{internal}}|+\Delta^i_{\text{internal}}. \tag{5} \]
公式中的实际持续时间 \(\hat{d}_i\)、配置持续时间 d、前一条目的内部延迟绝对值 \(|\Delta^{i-1}_{\text{internal}}|\)、当前条目的内部延迟 \(\Delta^i_{\text{internal}}\) 均已保留;编号 (5) 已保留。未发现明显问题。
在最坏情况下,\(\Delta^{i}_{\text{internal}}\) 由最大流量发生器偏差、队列开启延迟以及控制流量延迟组成:\(\Delta^{i}_{\text{internal},\max}=\qty{11}{}+\qty{63}{}+\qty{12}{}=\qty{86}{}\)。此外,如果流量发生器偏差为负且所有其他延迟都接近于零,则 \(\Delta^{i-1}_{\text{internal}}\) 可以为负,从而最多产生 \(\qty{11}{}\) 的缩短。这意味着,一个 tGCL 条目最多可能被延长 \(\qty{86}{}\),或者被缩短 \(\qty{11}{}\)。进一步地,通过连续 tGCL 条目之间的相关性,一个 tGCL 条目最多可能被延长 \(\qty{97}{}\),如图 12 所示。
数字 11、63、12、86、11、97 均已保留;公式与上下标含义已保留。`\qty{}` 的单位在输入中为空,可能因识别或源文件缺失导致单位不明,需结合论文上下文确认是否为 ns。
在最佳情况下,一个 TAS 控制流量帧正好在切换到新的 tGCL 条目的切换点到达,从而使控制流量延迟为 \(\delta_{\text{control},\min}=\qty{0}{}\)。结合测得的最佳情况队列延迟 \(\delta_{\text{queue},\min}=\qty{1}{}\),以及流量发生器精度 \(\delta_{\text{TG,min}}=0\),最佳情况内部延迟为 \(\Delta_{\text{internal},\min}=\qty{1}{}\)。
术语 TAS、tGCL、控制流量延迟、队列延迟、流量发生器精度均保持一致;数值 0、1、0、1 已保留。`\qty{}` 单位为空,需人工确认单位是否在原文排版中遗漏或被抽取丢失。
因此,这些数值定义了 tGCL 条目持续时间偏差的理论界限。下面的评估章节将考察此类偏差在实践中出现的频率以及程度。
逻辑关系“therefore”已译出;“how often and to what extent”已分别译为“频率以及程度”。未发现明显问题。
P4-TAS 支持以纳秒粒度配置 tGCL 及其周期。然而,第 V-A 节中表征的内部延迟可能会在 tGCL 条目的配置持续时间与实际持续时间之间引入偏差。本节通过比较预期持续时间与在数据平面中观察到的测量持续时间,来评估所配置 tGCL 条目的准确性。首先,我们介绍测试床并描述测量流程。随后,我们分析结果,并引入门切换间隔(gate switching intervals,GSIs)以提高定时准确性。
“nanosecond granularity”译为“纳秒粒度”;“data plane”译为“数据平面”;GSIs 缩写和全称均已保留。未发现明显问题。
用于外部 tGCL 条目测量的测试床如图 13 所示。
“external tGCL entry measurement”译为“外部 tGCL 条目测量”;图号 13 已保留。未发现明显问题。
流量由 P4TG [45, 46, 47] 生成,速率为 \(\qty{514}{Mpps}\),使用最小尺寸的 \(\qty{64}{}\) 帧,并采用恒定的到达间隔时间,即没有突发。每个帧都会被分配一个从均匀分布中采样得到的随机优先级,并使用 MPLS 进行封装,以验证 DetNet 转换。在 P4-TAS 中配置了一个周期为 \(\qty{400}{}\) 的 tGCL,该周期被划分为八个 \(\qty{50}{}\) 条目。在每个条目期间,八个队列中只有一个队列打开,对应一个优先级。传入的 MPLS 流量被转换为 TSN 流,随后基于所得 TSN 流标识符应用所配置的 tGCL。经过 TAS 整形之后,流量被转发到第三台 Tofino™ 交换机,该交换机在一个专用 P4 程序中按优先级记录帧到达时间。
P4TG 引用 [45, 46, 47]、514 Mpps、64、400、八个 50 条目、MPLS、DetNet、TSN、TAS、Tofino™ 均已保留。`\qty{64}{}`、`\qty{400}{}`、`\qty{50}{}` 单位为空,尤其 64 可能指 64 字节帧,400/50 可能为时间单位,需结合原文确认。
第三台交换机上的专用 P4 程序中的测量流程,基于检测接收流中优先级的变化。它假设在每个 tGCL 条目期间,只有一种优先级 \(\pi\in\{0,\ldots,7\}\) 的帧会按照 P4-TAS 中的配置到达测量交换机。系统会收集一个 tGCL 条目内第一个帧和最后一个帧的时间戳序列,也就是同一优先级帧的时间戳序列,并将其存储在数据平面中。图 14 对此进行了说明。
优先级集合 \(\pi\in\{0,\ldots,7\}\) 已保留;“first and last frame in a tGCL entry, i.e., of the same priority”的限定关系已译出。未发现明显问题。
对于优先级 \(\pi=0\),第 \(i\) 个 tGCL 条目中第一个帧的到达时间被存储为 \(t^{i,\pi=0}_{\text{first}}\)。当下一个优先级 \(\pi=1\) 出现时,前一个优先级 \(\pi=0\) 的最后一个帧的到达时间被存储为 \(t^{i,\pi=0}_{\text{last}}\),而新的帧则标记 \(t^{i+1,\pi=1}_{\text{first}}\)。这如图 14 中的步骤 1 所示。控制平面按如下方式计算优先级 \(\pi\) 的条目 \(i\) 的持续时间:
\(\pi=0\)、\(\pi=1\)、\(t^{i,\pi=0}_{\text{first}}\)、\(t^{i,\pi=0}_{\text{last}}\)、\(t^{i+1,\pi=1}_{\text{first}}\) 均已保留;输入中 “i i -th” 属于抽取重复,已按“第 \(i\) 个”处理。未发现明显问题。
\[ \hat{d}_{i}^{\pi}=t^{i,\pi}_{\text{last}}-t^{i,\pi}_{\text{first}}. \tag{6} \]
公式编号 (6)、估计持续时间 \(\hat{d}_{i}^{\pi}\)、最后帧与第一帧时间戳之差均已保留。未发现明显问题。
随后,将测得的 tGCL 条目持续时间与配置的 tGCL 条目持续时间 \(d=\qty{50}{}\) 进行比较,并按如下方式得到偏差 \(\hat{\delta}^{i,\pi}_{\text{slice}}\):
数值 \(d=\qty{50}{}\) 与偏差符号 \(\hat{\delta}^{i,\pi}_{\text{slice}}\) 已保留。该段以 “as” 引出后续公式,但当前输入未包含公式,存在表格或公式上下文缺失风险。
\(\hat{\delta}^{i,\pi}_{\text{slice}}=\hat{d}_{i}^{\pi}-d。\) (7)
公式符号按原文保留;\(\hat{\delta}^{i,\pi}_{\text{slice}}\)、\(\hat{d}_{i}^{\pi}\)、\(d\) 的上下文定义不在本段内,需依赖前文。未发现明显问题。
因此,\(\hat{\delta}^{i,\pi}_{\text{slice}}\) 的负值表示测得的 tGCL 条目持续时间短于配置的持续时间,而正值表示其长于配置的持续时间。总共收集了 32,764 个 \(\hat{\delta}^{i,\pi}_{\text{slice}}\) 的取值。
正负号含义与公式 \(\hat{d}_{i}^{\pi}-d\) 一致;数字 32,764 已保留;tGCL 缩写保留。未发现明显问题。
第 V-A2 节中识别出的内部队列开启/关闭延迟,会导致队列状态转换在一个很短的间隔内发生,而不是瞬时发生。这可能引起过渡行为,即某个 tGCL 条目的队列尚未关闭,而下一个 tGCL 条目的队列已经开始转发。因此,来自两个 tGCL 条目的帧会被同时传输,从而违反已配置的 tGCL。这种重叠是 P4-TAS 的一种现象,并非测量伪影,必须予以处理。为减轻这一影响,我们引入了门控切换间隔(gate switching intervals, GSIs),如图 15 所示。
“measurement artifact”译为“测量伪影”准确;“opening/closing delay”译为“开启/关闭延迟”;因果链条完整保留;GSI 缩写和图 15 保留。未发现明显问题。
门控切换间隔是短的、显式的 tGCL 条目,在这些条目中所有队列均关闭。它们被插入到 tGCL 条目之间。这些 GSI 抑制过渡性转发行为,并隔离每个 tGCL 条目。我们配置了 \(\qty{30}{}\) 的 GSI,这足以消除重叠,同时不会显著影响可用传输时间。虽然第 V-A 节中测得的最坏情况队列开启延迟达到 \(\qty{63}{}\),但 \(\qty{30}{}\) 的 GSI 能提供足够的隔离,因为 GSI 本身也受到相同内部延迟的影响。这实际上延长了 GSI 的持续时间,并确保队列状态转换在下一个调度条目开始之前完成。更大的 GSI 并未改善结果。
原文 `\qty 30` 和 `\qty 63` 缺少单位,按公式宏形式保留;由于单位缺失,可能需要结合图表或上下文确认是否为 ns。逻辑上“GSI 本身受相同延迟影响从而有效延长持续时间”已保留。
首先,我们在未引入 GSI 的情况下,测量了观测值相对于 tGCL 条目配置持续时间的偏差。所得统计结果并不一致,所有测量的平均偏差为 \(\mu(\hat{\delta}_{\text{slice}})=\qty{-22.8}{}\),中位数为 \(\qty{450}{}\)。这些表面上的偏差并不具有意义,因为如第 V-B3 节所解释的,连续条目经常在其边界处发生重叠。
平均值 -22.8 与中位数 450 已保留;`\qty{-22.8}{}` 和 `\qty 450` 均缺少单位,需结合上下文确认单位;“apparent deviations”译为“表面上的偏差”符合语义。因存在单位缺失和统计量看似异常,建议人工复核。
使用 GSI 时,测得值相对于 tGCL 条目配置持续时间的偏差见图 16。该指标按优先级计算,即对于每个 \(\pi\) 和条目 \(i\),计算为 \(\hat{\delta}^{i,\pi}_{\text{slice}}\);由于所有优先级的行为相同,直方图以聚合所有优先级的方式显示。
“per priority”译为“按优先级”;\(\pi\)、\(i\)、\(\hat{\delta}^{i,\pi}_{\text{slice}}\) 已保留;图 16 保留。未发现明显问题。
图 16 中的测量分布显示出两个主导模态:一个位于约 \(\qty{-60}{ns}\) 附近,另一个位于约 \(\qty{30}{ns}\) 附近,两者之间由约 \(\qty{-19}{}\) 的中位数附近的谷值分隔。双峰分布源于连续条目的延迟如何相互作用。边界处的较大延迟会使当前 tGCL 条目长于配置值,从而形成正值簇。随后的 tGCL 条目随后会延迟开始并变短,从而形成负值簇。恰好为零的偏差不太可能出现,因为这将要求两个连续延迟几乎相同,而这在实践中很少见。总体中位数略为负值,反映出缩短的条目出现得稍微更频繁。最小值 \(\qty{-239}{ns}\) 表示一种罕见的最坏情况:某个 tGCL 条目被延长了接近最大可能延迟的幅度,而相邻的 tGCL 条目没有经历延迟,并因此被缩短了相同的幅度。
-60 ns、30 ns、-239 ns 已保留;中位数 `\qty -19` 原文缺少单位,疑似 ns,译文保留为空单位形式;双峰分布的因果解释已完整保留。因中位数单位缺失,需人工复核。
可扩展性是 TSN 和 DetNet 部署的一个关键方面,这类部署通常涉及大量调度流量流。然而,许多调度算法忽视了硬件资源约束,例如有限的 MAT 容量 [8]。在本节中,我们通过分析所支持的 tGCL 和 sGCL 条目数量,以及用于 DetNet 和 TSN 流标识的流数量,来评估我们 P4-TAS 实现的可扩展性。
TSN、DetNet、MAT、tGCL、sGCL 均保留;引用 [8] 保留;“scheduled traffic streams”译为“调度流量流”。未发现明显问题。
许多 TSN 调度算法假设 GCL 条目数量不受限制 [8]。然而,真实硬件由于有限的存储资源而施加严格限制;如果超过这些限制,可能会使一个调度无法部署。因此,我们评估在所提出的 P4-TAS 实现中可以存储的 tGCL 和 sGCL 条目数量。首先,我们分析每个 GCL 条目需要多少个 MAT 条目。然后,我们描述 P4-TAS 中可用的 GCL 规模。
引用 [8] 保留;“undeployable”译为“无法部署”;“GCL sizes”译为“GCL 规模”,可能也可译为“GCL 大小/容量”,但语义无明显风险。未发现明显问题。
内部延迟
本段仅为标题或小节名;译为“内部延迟”准确。未发现明显问题。
流的数量
该段像是表格列名或图表标签,缺少上下文;“No.”译为“数量”符合常见论文表头用法。
\(\Delta_{\text{internal, max}} = 86\ \text{ns}\)
原文存在“Δ internal, max =”与公式重复表达,疑似从排版中抽取出的公式标签;数值 86 ns 已保留。
Predict6G 开源 TSN 平台 [25]
专有项目名 Predict6G 和缩写 TSN 保留;引用编号 [25] 已保留。该段可能是表格或图注条目,缺少上下文。
1 万个条目
“10k entries”译为“1 万个条目”;该段缺少表格上下文,无法确认条目类型。
在 P4-TAS 中,tGCL 被建模为出口 P4 控制块中的一个 MAT,该 MAT 基于相对时间戳以及八个队列之一进行匹配。因此,对于每个 tGCL 条目,需要八个范围 MAT 条目,即每个门对应一个条目。sGCL 被建模为入口 P4 控制块中的一个 MAT,该 MAT 基于相对时间戳以及流门标识符进行匹配。因此,对于一个 sGCL 条目,只需要一个范围 MAT 条目。
tGCL、sGCL、MAT、egress/ingress P4 control block、relative timestamp、stream gate identifier 等术语已按技术含义翻译并保留缩写;“eight queues”“eight range MAT entries”“one for each gate”“single range MAT entry”等数量和对应关系未发现明显问题。
由于 TNA 中的范围匹配受到限制,因此采用了范围到三元匹配的转换,以便能够基于相对时间戳进行匹配。第 IV-C2 节中描述的转换算法用多个三元 MAT 条目替代单个范围 MAT 条目,以提高被匹配时间戳的分辨率。然而,这种方法也会增加所需 MAT 条目的数量。
TNA、range matching、range-to-ternary conversion、ternary MAT entries 等术语已保留或准确翻译;“increase the resolution of matched timestamps”译为“提高被匹配时间戳的分辨率”合理;因果和转折关系未发现明显问题。
设 \(w\) 表示用于表示该范围的位数。Gupta 等人 [40] 表明,宽度为 \(w\) 位的范围最多可以转换为 \(2 \cdot w - 2\) 个三元条目。因此,在最坏情况下,对包含 \(n\) 个 tGCL 条目的 tGCL 进行建模会产生最多 \(8 \cdot n \cdot (2 \cdot w - 2)\) 个三元 MAT 条目,而对 sGCL 条目进行建模会产生 \(n \cdot (2 \cdot w - 2)\) 个三元 MAT 条目。在实践中,由于有利的对齐,实际数量通常显著更低。例如,如果范围的宽度是 2 的幂并且正确对齐,则一个三元条目就足够。第 V-B 节中的 tGCL 配置使用了一个周期为 \(\qty 400\) 的周期,该周期被划分为八个 \(\qty 50\) 的 tGCL 条目,并带有额外的 \(\qty 30\) GSI;该配置需要 1512 个三元 MAT 条目。许多研究已经提出了优化的范围到三元匹配转换算法,目标是减少三元条目数量 [49, 50, 51, 52],这些算法可在未来工作中探索。
原文中出现“w w”“2 ⋅ w − 2 2\\cdot w-2”“n n”等重复,疑似抽取或 OCR 重复,译文按数学含义去重;公式 \(2 \cdot w - 2\)、\(8 \cdot n \cdot (2 \cdot w - 2)\)、\(n \cdot (2 \cdot w - 2)\) 已保留。`\qty 400`、`\qty 50`、`\qty 30` 缺少单位,可能是 LaTeX 单位抽取残缺;“a sGCL entry in \(n \cdot ...\)”原文语法也可能应为 sGCL containing \(n\) entries,需结合原 PDF 确认。
Intel Tofino™ 2 最多可以生成 16 个不同的周期性流。由于其中一个需要用于连续的 TAS 控制流量,因此 P4-TAS 可以为周期完成帧配置 15 个流。因此,可以配置 15 个不同的 GCL 周期,并且这些周期可以在 PSFP 和 TAS 之间共享。
Intel Tofino™ 2、TAS、P4-TAS、GCL、PSFP 等术语已保留;16、1、15、15 的数量关系一致;“period-completion frames”译为“周期完成帧”需与全文术语表保持一致。
P4-TAS 中的 tGCL MAT 可以容纳 39,000 个 MAT 条目,这是可用硬件资源所决定的。因此,该 MAT 足够大,可以容纳多个 tGCL。我们将 PSFP 的流门 MAT 大小从 P4-PSFP [6] 中的 2048 个 MAT 条目增加到 6000 个 MAT 条目。这是可能的,因为该实现被移植到了 Tofino™ 2 ASIC,而该 ASIC 具有更多可用资源。
39,000、2048、6000 等数字已保留;tGCL MAT、stream gate MAT、P4-PSFP、Tofino™ 2 ASIC 等术语未发现明显问题;因果关系“more resources available”已准确表达。
三元匹配作用于一个 \(\qty 48bit\) 时间戳,从而支持最高可达 \(\qty 78h\) 的分辨范围。尽管这样的范围超过了实际需求,但由于内部硬件对齐,减少被匹配的位数并不会产生影响。
原文中的 `\qty 48bit` 和 `\qty 78h` 可能是 LaTeX 单位宏抽取形式,译文保留为公式宏;“resolutions of up to \(\qty 78h\)”更准确可能指可覆盖约 78 小时的时间范围,而非“分辨率”,需结合上下文确认。硬件对齐逻辑已保留。
这些资源限制表明,P4-TAS 能够支持多个 tGCL 和 sGCL,从而确保现实 TSN 调度的可部署性。
术语 tGCL、sGCL、TSN 已保留;“resource limits”译为“资源限制”符合上下文。未发现明显问题。
为了评估我们实现中流识别的可扩展性,我们分析了用于 DetNet 和 TSN 流的 MAT 的结构与容量。
MAT 缩写保留;DetNet、TSN 术语保留;逻辑为“评估可扩展性,因此分析结构与容量”。未发现明显问题。
单个 MAT 同时处理 DetNet 和 TSN 流识别。它使用三元键,这些三元键由 DetNet 流的 S-Label,以及 TSN 流的以太网目的地址、VLAN ID、IPv4 源地址和 IPv4 目的地址组成 [4]。使用三元匹配能够实现通配和聚合。例如,一个仅匹配 S-Label 的条目能够实现 DetNet 到 TSN 的转换,而另一个匹配 MAC 目的地址和 VLAN ID 的条目则支持 TSN 到 DetNet 的转换或 TSN 流识别。
“ternary keys/matches”译为“三元键/三元匹配”;S-Label、VLAN ID、IPv4、MAC 均保留;引用 [4] 保留;两个示例的方向 DetNet-to-TSN 与 TSN-to-DetNet 未颠倒。未发现明显问题。
该 MAT 支持 8196 个条目,这使得至少 8196 个 DetNet 或 TSN 流能够被识别。在使用基于 IP 的识别的情况下,三元聚合可以进一步增加可识别流的数量。Stüber 等人 [10] 的一项调查报告了多达 10,812 个流的部署,这表明,在适当使用通配的情况下,我们的实现能够支持现实的工业规模场景。这些结果表明,该设计具有可扩展性,并且能够支持 TSN/DetNet 部署中典型数量的流。
数字 8196、10,812 保留;“at least”译为“至少”;“with appropriate use of wildcarding”译为“适当使用通配”准确;引用 [10] 保留。未发现明显问题。
在本节中,我们总结并比较了若干具备 TAS 能力的平台的功能,包括我们在 P4 可编程 ASIC 上实现的 P4-TAS 原型。概览如表 II 所示。
“TAS-capable platforms”译为“具备 TAS 能力的平台”;P4-programmable ASIC 译为“P4 可编程 ASIC”;表 II 保留。未发现明显问题。
与 P4-TAS 类似,Predict6G 开源平台提供 TAS 和 DetNet 集成,但其文档未说明可配置的时间分辨率、内部延迟行为或可扩展性 [25]。NXP 的 SJA1105TEL [33, 34]、Microchip 的 SparX-5i 系列 [35] 和 PD-IES008 [36, 37],以及 Relyum 的 RELY-TSN12 [48] 等商业平台为 TAS 和 PSFP 提供硬件支持。然而,公开可获得的规格通常止步于时间粒度、队列数量或 GCL 大小,而省略了最终决定调度精度的内部延迟来源。尽管这些设备宣称具有纳秒级配置粒度,但我们对 P4-TAS 的评估表明,实际的门更新受到几十纳秒范围内内部延迟的约束。Eppler 等人 [9] 报告 Relyum 交换机中的内部 TAS 延迟约为 \qty 2.6 \micro,这进一步支持了这一现象。这表明,内部 TAS 定时效应是该机制本身固有的,并非 P4 实现所特有,而且在专有设备中可能显著更大。由于此类延迟很少被记录在文档中,仅凭数据手册难以评估商业解决方案的有效精度。此外,这些延迟是内部延迟,无法在商业黑盒交换机中测量。
厂商、型号与引用均保留;TAS、DetNet、PSFP、GCL 等术语保留;“nanosecond-level configuration granularity”译为“纳秒级配置粒度”;“tens of nanoseconds”译为“几十纳秒”。原文中的 `\qty 2.6 \micro` 可能是 LaTeX 公式抽取残缺,单位疑似微秒,需要人工确认。
我们的 P4-TAS 原型实现了可比的时间配置粒度 \qty 1,同时还记录了实测的内部延迟。具体而言,在评估中,我们观察到 tGCL 条目的最坏情况内部延迟为 Δ internal,max = \qty 86 \Delta_{\text{internal,max}}=\qty{86}{}。相比之下,厂商平台要么不记录此类数值,例如 NXP SJA1105TEL、Microchip PD-IES008;要么只披露部分信息,例如 SparX-5i,其指定的队列开启延迟为 δ queue = \qty 512 \delta_{\text{queue}}=\qty{512}{}。P4-TAS 的透明性允许更现实地评估可实现的调度精度。在可扩展性方面,与商业平台相比,P4-TAS 支持更多数量的流(≥ 8196)和更大的 GCL(TAS 为 39k,PSFP 为 6k)。对于 GCL 条目,必须考虑范围到三元转换的开销,我们已在第 V-C1 节中对此进行了评估。
数字 1、86、512、≥8196、39k、6k 和章节 V-C1 保留;“range-to-ternary conversion overhead”译为“范围到三元转换的开销”。原文存在明显公式/单位抽取残缺,如 `\qty 1`、`\qty 86 ... \qty{86}{}`、`\qty 512 ... \qty{512}{}` 缺少单位或重复公式,需结合论文 PDF 人工确认。
另一个区别是线速吞吐量。虽然大多数具备 TSN 能力的商业交换机 ASIC 在汽车和工业领域以每端口 1–25 Gb/s 为目标,但 P4-TAS 每端口最高可运行在 400 Gb/s。这使其不仅能够用于 TSN 部署,也能够用于高速数据中心环境,在这些环境中,与 DetNet 的集成变得相关。因此,P4-TAS 将设计空间扩展到了当今嵌入式和工业用例之外。
速率 1–25 Gb/s 和 400 Gb/s 保留;“line-rate throughput”译为“线速吞吐量”;“beyond today’s embedded and industrial use cases”译为“扩展到了当今嵌入式和工业用例之外”。未发现明显问题。
总体而言,该比较表明,商用硬件已经支持 TAS 和 PSFP 功能,但厂商很少披露其内部定时行为。这种透明性不足使得以纳秒精度设计调度变得困难。P4-TAS 通过明确表征内部延迟填补了这一空白,从而使 TSN 的使用更加可预测且更加透明。
TAS、PSFP、TSN 保留;因果逻辑为“缺少披露导致难以纳秒级调度,P4-TAS 通过表征内部延迟弥补”。未发现明显问题。
我们提出了 P4-TAS,即一种在 Intel Tofino™ 2 上面向 TSN 的、基于 P4 的时间感知整形器(Time-Aware Shaper, TAS)实现。为了实现 tGCL 的周期性,我们利用了一种在 P4 交换机中实现周期性行为的机制,该机制使用内部包生成器作为时钟源。在此基础上,我们引入了一种用于 TAS 的精确队列状态控制机制,该机制使用内部生成的、连续的 TAS 控制流量。P4-TAS 还纳入了 PSFP,其中我们通过消除再循环,并使用范围到三元算法将 GCL 时间分辨率提高到纳秒尺度,改进了我们早期的 P4-PSFP 设计。此外,P4-TAS 包含一个 MPLS/TSN 转换层,使得 TSN 流量整形和监管能够以最高 \qty 400 的线速应用于 DetNet 流。除了功能能力之外,我们的实现还对内部定时行为提供了透明见解,而这在商业平台中很少被记录在文档中。
Intel Tofino™ 2、TAS、tGCL、PSFP、P4-PSFP、GCL、MPLS/TSN、DetNet 等术语保留;“recirculation”译为“再循环”;“traffic shaping and policing”译为“流量整形和监管”。原文 `up to \qty 400` 存在单位缺失,结合前文疑似 400 Gb/s,但本段输入缺单位,需人工确认。
我们的评估涵盖了三个方面。首先,我们识别并量化了未记录在文档中的内部延迟来源,包括流量生成器不准确性、队列开启延迟以及 TAS 控制帧延迟。我们识别出,对于一个 tGCL 条目,理论最坏情况下的累积延迟约为 \qty 86,这比一些商用 TSN 交换机所报告的微秒级门转换延迟小数个数量级 [9]。其次,我们从外部测量了 tGCL 条目的持续时间,并将其与配置的持续时间进行了比较。在这一过程中,我们发现内部队列开启延迟会导致一种过渡行为,即某个 tGCL 条目的队列尚未关闭,而下一个 tGCL 条目的队列已经开始转发。因此,我们引入了门切换间隔(gate switching intervals,GSIs),即短的、显式的 tGCL 条目,在这些条目中所有队列均被关闭,以缓解这一影响。第三,我们分析了可扩展性,证明其支持 39,000 个 tGCL 条目以及超过 8,196 条流,覆盖了当前工业部署的需求。
术语 tGCL、TAS、TSN、GSI 均已保留并补充中文解释;数字 39,000、8,196、引用 [9] 已保留。`\qty 86` 原文疑似缺少单位或 LaTeX 提取残缺,无法确认是 86 ns、86 μs 或其他单位,需结合上下文核对。
与现有基于 ASIC 和 FPGA 的 TSN 平台相比,P4-TAS 在时间粒度方面提供了类似的可配置性,但还额外暴露了会直接影响调度精度的内部延迟。这种透明性使得调度表能够在知晓硬件所引入偏差的情况下进行设计,而这是当今黑盒硬件无法实现的。此外,P4-TAS 支持每端口最高 \qty 400Gb/s 的线速,并支持无缝的 DetNet / TSN 转换,从而将适用范围从工业网络和车载网络扩展到数据中心和运营商骨干网等高吞吐量环境。
ASIC、FPGA、TSN、DetNet 等缩写已保留;逻辑上“类似可配置性”与“额外暴露内部延迟”的转折关系已保留。`\qty 400Gb/s` 原文 LaTeX 数量命令格式可能缺少花括号,但含义可明确理解为 400 Gb/s;段末原文缺少句号,不影响译文含义。
未来工作将重点关注提升可扩展性,例如通过优化 range-to-ternary 的使用方式;同时还将研究如何把延迟表征纳入调度算法,以增强其对硬件层面可变性的鲁棒性。此外,我们将探索把 P4-TAS 集成到一个由 PTP 同步的多跳 TSN 测试床中,以在现实的 TSN 特定流量模式下验证门调度和时延保证。
PTP、TSN、P4-TAS 等缩写已保留;“range-to-ternary”可能是特定 P4/TCAM 规则映射术语,直译保留较稳妥。逻辑上“提升可扩展性”和“纳入延迟表征以增强鲁棒性”两项未来工作已区分;未发现明显问题。
切换查看英文原文
Time-critical applications in industrial automation and automotive systems rely on networks that provide deterministic guarantees such as low latency, minimal jitter, and virtually zero packet loss. To meet these stringent requirements, two complementary technologies have emerged: Time-Sensitive Networking (TSN) and Deterministic Networking (DetNet). TSN is a suite of IEEE 802.1 standards that enhances Ethernet to support real-time communication by introducing mechanisms for traffic shaping [ 1, 2, 3 ] and reliability [ 4 ]. In contrast, DetNet is a Layer 3 technology standardized by the IETF that extends these capabilities to routed networks by enabling bounded latency and high reliability across multiple IP hops [ 5 ].
Scheduled traffic is a concept in TSN where transmission times of talkers are coordinated to avoid queuing in intermediate nodes so that frames traverse the network with minimal delay. This coordination is called scheduling and yields a network-wide schedule that ensures deterministic forwarding. Various traffic shaping mechanisms, such as Credit-Based Shaper (CBS), Asynchronous Traffic Shaper (ATS), and the Time-Aware Shaper (TAS) exist in TSN. Among them, the TAS stands out by leveraging a Time Division Multiple Access (TDMA)-like approach to protect scheduled traffic from interference, e.g., by best-effort flows, thereby ensuring low latency and bounded delay. This is achieved through transmission gates which periodically open and close queues. Further, Per-Stream Filtering and Policing (PSFP) is a TSN mechanism that combines rate and time-based policing to drop out-of-schedule frames.
DetNet leverages TSN concepts in combination with technologies such as MPLS or IP. In DetNet deployments, TSN can be employed as a sub layer to provide deterministic forwarding in sub networks. In practice, TSN implementations are typically hardware-based and optimized for bandwidths up to \qty 1. However, DetNet targets broader use cases, including high-throughput applications and data center backbones. This hierarchical composition allows for scalable designs that combine the high-speed capabilities of DetNet with the precise timing mechanisms of TSN. The integration of both technologies enables sophisticated traffic shaping and reliability mechanisms across backbone infrastructure.
The contribution of this paper is manifold. We present P4-TAS, a P4-based implementation of the TAS on the Intel Tofino™ 2 switching ASIC that enables TSN-compliant shaping and policing in the data plane. Our design introduces a novel mechanism for periodic queue control using a continuous stream of internally generated TAS control frames. It builds upon a mechanism from our prior P4-PSFP work [ 6 ] which uses the internal packet generator as a clock source. P4-TAS also incorporates PSFP and includes an MPLS/TSN translation layer [ 7 ], enabling TSN traffic shaping to be applied to DetNet flows at line rates up to 400 Gb/s. A key contribution of this work is the identification and quantification of internal processing delays that affect scheduling precision. Such delays are typically undocumented in commercial TSN-capable switches but are crucial for accurate traffic scheduling [ 8, 9 ]. Our implementation reveals multiple delay sources within the data plane and provides corresponding measurements on a nanosecond scale. We demonstrate that our approach achieves internal delays orders of magnitude smaller than those reported for some commercial platforms [ 9 ], offering transparency. Finally, we evaluate the scalability of P4-TAS and compare it to available TAS implementations.
The rest of the paper is structured as follows. In Section II, we provide background information on TSN, DetNet, and the P4 programming language. In Section III, we review related work on combining TSN and DetNet systems, and on simulations and hardware implementations of those technologies. Section IV introduces the P4 implementation, including our system architecture and the P4-TAS mechanism. In Section V, we evaluate internal delays, measure the transmission gate accuracy externally, analyze the scalability, and compare P4-TAS to other implementations. Finally, in Section VI, we conclude the paper.
In this section, we provide technical background on TSN, DetNet, and the P4 programming language.
We first give a brief overview of TSN, explain scheduled traffic, and then summarize the concepts of the TAS and PSFP.
TSN is a suite of IEEE 802.1 standards that augment traditional Ethernet to support deterministic communication with strict Quality of Service (QoS) guarantees. TSN networks are built from interconnected bridges and end stations. A data flow, referred to as a TSN stream, originates from a talker (sending station) and is directed to one or more listeners (receiving stations). A TSN stream is identified based on its VLAN tag, its Layer 2 destination address, and optionally other header fields [ 4 ]. Before a stream is allowed to transmit, it has to undergo admission control [ 4 ]. This process involves the talker advertising its traffic characteristics, e.g., latency requirements, through a stream descriptor. The network then decides whether to admit the stream by evaluating resource availability and making reservations accordingly.
In TSN, streams can be scheduled, i.e., their sending times at talkers are coordinated such that frames experience minimal delay at intermediate bridges. This coordination is computed offline and yields a network-wide schedule. The calculation of such schedules is outside the scope of this work. More information on scheduling in TSN can be found in a survey by Stüber et al. [ 10 ]. Time synchronization on a sub-microsecond scale is critical for scheduling in TSN. For that purpose, protocols like the Precision Time Protocol (PTP) are employed [ 11 ].
Scheduled streams in TSN are typically assigned the highest priority and must be protected from lower-priority traffic, e.g., best-effort traffic. This ensures that scheduled frames reach each intermediate node at their scheduled times. TAS and PSFP are mechanisms to protect scheduled traffic using gating mechanisms. Both are illustrated in Figure 1 and explained in the following.
The TAS, standardized in IEEE Std 802.1Qbv [ 1 ], provides time-based shaping at the egress. Each egress port provides eight FIFO queues associated with frame priorities from the VLAN tag [ 12 ]. These queues are controlled by transmission gates which are controlled by a gate control list (GCL). A GCL is a periodic sequence of entries, each specifying a time slice and a corresponding gate state. In the TAS, we call this the transmission GCL (tGCL). Each tGCL entry specifies a duration and an eight-bit vector indicating which of the eight transmission gates are open or closed. Frames in queues with an open transmission gate are transmitted in FIFO order while frames in queues with a closed transmission gate remain buffered. After all entries have been processed, the sequence repeats periodically with a cycle length h h.
PSFP, standardized in IEEE Std 802.1Qci [ 3 ], enforces per-stream conformance at the ingress by combining rate policing with time-based policing. In this way, PSFP ensures adherence to the resource bounds established by admission control. While rate policing is a well-known mechanism, time-based policing targets scheduled traffic and is the focus of this work. For time-based policing, each stream is associated with a stream gate controlled by a periodic GCL which we call the stream GCL (sGCL). The sGCL defines the gate state over time and thereby the admitted transmission windows of the stream. Frames arriving outside their admitted window are dropped immediately, i.e., before queuing, preventing them from consuming reserved resources.
In Figure 1, two streams enter the TSN bridge. Based on their sGCLs, the first stream gate is open while the second is closed. Accordingly, frames of stream 1 are forwarded while frames of stream 2 are dropped by PSFP. Both streams then share the same egress port queues which are controlled by the tGCL. Here, only the first transmission gate is open, so only frames stored in queue 1 are transmitted.
Stream gates in PSFP differ from transmission gates in TAS in three ways. First, stream gates apply per stream whereas transmission gates apply per egress port and queue. Second, an sGCL entry defines the state of a single stream gate while a tGCL entry defines the states of all eight queues. Third, closed stream gates drop frames before queueing while closed transmission gates buffer frames in the queue.
The DetNet architecture enables real-time applications with extremely low packet loss rates and a bounded latency [ 5 ]. It is standardized by the IETF DetNet working group. DetNet operates on the networking, e.g., IP, layer and provides QoS and reliability to the lower layer, e.g., to MPLS and TSN. DetNet is applicable to networks under a single administrative control, e.g., to private WANs, or campus-wide networks.
The bounded latency in DetNet is achieved by eliminating packet loss resulting from queue congestion within a node. For that purpose, bandwidth and buffer resources are reserved at each node. Resource reservations can be made using the Resource reServation Protocol (RSVP). For traffic engineering within DetNet, mechanisms defined by the IEEE 802.1 working group such as the TAS are applicable.
The DetNet architecture separates the data plane functions into two sub-layers. First, the service sub-layer provides DetNet QoS mechanisms such as bounded latency and service protection, e.g., by adding sequence number information to packets. Second, the forwarding sub-layer provides connectivity between Detnet service sub-layer processing nodes [ 13 ]. Various data plane technologies for DetNet exist, e.g., DetNet over MPLS [ 13 ], and DetNet over IP [ 14 ]. With DetNet over MPLS, the forwarding and service sub-layers are identified by MPLS labels, called the Forward label (F-Label), and the Service label (S-Label). One or more F-Labels are used to forward the packet through the DetNet domain. The S-Label follows after the F-Labels and is used to identify the DetNet flow. Based on the identified DetNet flow, QoS mechanisms are applied. Further, a DetNet control word (d-CW) follows after the MPLS stack. This control word contains a sequence number for protection mechanisms of DetNet.
Standards exist that interconnect TSN networks using the DetNet MPLS data plane [ 15, 7 ]. For DetNet MPLS over TSN, DetNet flows are identified based on the S-Label at the DetNet / TSN domain border and are translated into TSN streams. For that purpose, IEEE Std 802.1CBdb [ 16 ] defines an MPLS DetNet flow identification which identifies the S-Label and pushes a new VLAN ID. Then, TSN stream identification is applied based on the new VLAN ID. With those interconnected data planes, TSN services such as the TAS and PSFP can be applied to DetNet flows.
Programming Protocol-independent Packet Processors (P4) is a domain-specific programming language to implement custom data planes in P4-programmable switches [ 17 ]. A P4 program can manipulate packets and make forwarding decisions to implement custom algorithms. In the following, we describe the concepts of the P4 pipeline, the packet generator, and a feature called advanced flow control (AFC). A survey by Hauser et al. provides more information on P4 [ 18 ].
P4-programmable switches are called targets and implement a specific architecture. The Intel Tofino™ 2 switching ASIC is a hardware-based P4 target. Typically, a P4 architecture follows a pipelined structure. The pipeline of the Tofino Native Architecture (TNA), the architecture used by the Intel Tofino™, is illustrated in Figure 2.
The TNA consists of an ingress block and an egress block, each with a programmable parser, control blocks, and a deparser. After processing frames in the ingress control block, frames are queued in the traffic manager component of the TNA. This component is configurable but not programmable [ 19 ].
Control blocks in a P4 program define the logic of the algorithm. They leverage metadata for packet processing. A P4 program defines two different types of metadata. First, user-defined metadata stores information during the pipeline processing. Second, intrinsic metadata contains information given by the architecture, e.g., the ingress timestamp of a frame, and the ingress port. Control blocks are composed of match+action tables. The concept of a MAT is illustrated in Figure 3 [ 20 ] and explained in the following.
In a MAT, selected packet header fields and metadata form a composite key. Each packet is matched in the MAT according to the selected key fields. On a match in the table, an associated action is executed which can manipulate packet data or make a forwarding decision. The data plane defines the structure of a MAT, i.e., the key fields, and the actions. However, the content of these MATs is filled by the control plane. Further, registers are a commonly used feature in P4 that allow for stateful processing of packets.
P4 control blocks support logical and simple arithmetic expressions but do not support loops to maintain line rate processing. To enable iterative algorithms, packets can be recirculated. Modified headers from the first pass are available in the second. Recirculation introduces delay and requires dedicated ports. Architectures like the TNA offer internal recirculation ports, or can provision physical ports for recirculation.
The Intel Tofino™ natively supports time synchronization using PTP [ 11, 19 ]. Further, Kannan et al. [ 21 ] propose a data plane implementation of PTP which can be leveraged to achieve high-precision time-synchronization.
The TNA provides an internal packet generator which can be configured to generate packets through a dedicated internal port. Generated packets are processed in the pipeline. Multiple applications with different triggers, such as a periodic trigger, can be configured to trigger packet generation. Further, the packet generator can be configured to generate B B batches with a batch size of K K packets each to enable packet bursts. A generated packet contains a packet generation header added by the traffic generator. This packet generation header identifies the application, the batch number, and the packet number in the batch [ 19 ].
A feature specific to the Intel Tofino™ 2 is advanced flow control (AFC) which enables control over the queues, i.e., dispatching or holding back frames, of an egress port in the traffic manager. The queue state is manipulated by writing an AFC value into a packet’s intrinsic metadata during pipeline processing. As this operation must be triggered by an incoming packet, each queue state change is initiated by packet arrival. A single packet can control exactly one queue. The AFC value is computed based on the egress port, queue ID, and the desired queue state. Importantly, the controlled queue does not need to correspond to the egress port or queue assigned to the processed packet itself.
In this section, we review related work on the combination of TSN and DetNet systems, and on simulations and hardware implementations of those technologies.
The integration of TSN and DetNet has received considerable attention in recent years due to their critical role in facilitating ultra-low latency communication in 5G networks. Nasrallah et al. [ 22 ] provide a comprehensive overview of TSN and DetNet technologies, emphasizing their importance for time-critical applications in 5G environments. Building on this foundation, Abuibaid et al. [ 23 ] conduct a case study that measures the performance of TSN and DetNet in a practical 5G setting. Furthermore, Wüsteney et al. [ 24 ] propose a latency model for time-sensitive communication traversing networks that integrate TSN and DetNet. Menendez et al. [ 25 ] present a software-based implementation of the TAS using XDP and eBPF. Further, they integrate TSN functionality into DetNet environments with a MPLS over UDP/IP data plane. While their open-source implementation represents a significant step towards TSN / DetNet integration, their evaluation does not consider internal timing behavior and only measures traffic rates up to \qty 600.
Despite these advances, challenges remain in realizing efficient hardware implementations that integrate TSN and DetNet functionalities while identifying and quantifying internal timing behaviour which this work aims to address.
Numerous simulation frameworks have been developed to model TSN [ 26, 27 ], DetNet [ 28 ], or both [ 29 ]. In particular, Addanki et al. [ 29 ] offer a simulator that integrates building blocks for DetNet at the network layer and TSN at the link layer. Polverini et al. [ 30 ] describe a P4-based DetNet implementation for the BMv2 software target leveraging an SRv6 data plane for reliability. While such simulations are valuable for exploring the interaction between TSN and DetNet, they do not fully address the challenges of real-world deployment. Implementing time-sensitive mechanisms in hardware introduces additional complexity due to resource constraints and timing precision requirements. Ahmed et al. [ 31, 32 ] provide FPGA-based implementations of the CBS and ATS of TSN, while we presented a P4-based hardware implementation of the PSFP mechanism on an ASIC [ 6 ].
Several commercial hardware platforms support TAS and PSFP. NXP’s automotive-grade SJA1105TEL switch ASIC provides eight egress queues per port with a time granularity of \qty 200. This ASIC supports time-gated transmission of up to 1024 flows [ 33, 34 ]. Similarly, Microchip’s SparX-5i [ 35 ] and PD-IES008 [ 36, 37 ] families expose time interval configuration with nanosecond granularity and support up to 10,000 tGCL entries. These platforms demonstrate that TAS is available in hardware, but published information typically stops at high-level feature descriptions such as queue counts, GCL sizes, or time granularity. A summary of these capabilities is provided in Section V-D and compared against P4-TAS in Table II. Although these platforms claim nanosecond configuration granularity, reliable queue updates at this scale are not feasible in practice due to undocumented internal delays and hardware limitations. A recent work by Eppler et al. [ 9 ] quantified such undocumented timing behavior inside commercial TSN switches. Their measurements reveal internal scheduling and gate transition delays in the order of hundreds of nanoseconds to several microseconds which is significant for schedule synthesis and can lead to missed transmission windows if not accounted for.
In this work, we present a hardware implementation of selected TSN and DetNet mechanisms on a programmable ASIC. Unlike prior academic or commercial platforms, our design enables a transparent evaluation of internal delays, thereby offering deeper insights into their behavior and integration in real systems.
In this section, we describe the implementation of the P4- TAS switch incorporating the PSFP and the TAS mechanism on the Intel Tofino™ 2 switching ASIC. First, we describe the system architecture and integration into DetNet domains. Then, we present the implementation of the TAS mechanism in P4. Finally, we explain improvements to the P4-PSFP implementation. The source code is publicly available on GitHub [ 38 ].
The P4-TAS implementation is designed as an Ethernet switch that provides TSN functionality. It performs TSN stream identification as defined in IEEE Std 802.1CB [ 4 ] and applies traffic shaping with the TAS as well as policing with PSFP. These mechanisms allow P4-TAS to operate natively inside a TSN domain and provide deterministic forwarding for TSN streams. In addition to its role within a pure TSN network, P4-TAS can also act as a border element between a DetNet domain and a TSN domain. In this case, it processes incoming MPLS-encapsulated DetNet flows and translates them into TSN streams based on IEEE Std 802.1CBdb [ 16 ]. This enables DetNet to leverage the TSN sub-layer for scheduling and shaping. The integration scenario is illustrated in Figure 4.
At the ingress to the TSN domain (step 1), the P4-TAS switch translates DetNet flows into TSN streams by pushing a VLAN tag based on the S-Label. Afterwards, TSN stream identification is applied using the destination MAC address and the pushed VLAN tag (step 2). Subsequently, the identified TSN stream is subjected to traffic shaping and policing with TAS and PSFP (step 3), and the frame is forwarded through the TSN domain. At the egress (step 4), the VLAN tag is removed to restore the original DetNet flow.
The TAS defined in IEEE Std 802.1Qbv periodically opens and closes multiple egress queues according to a tGCL. Periodic behavior, such as in a GCL, is not natively supported by P4. Further, queue states on the Intel Tofino™ 2 can be controlled with AFC, but such changes can only be triggered by the arrival of a frame, and each frame can update only a single queue of one egress port. To implement the TAS under these constraints, P4-TAS combines three building blocks: a periodic time model for the tGCL, a dedicated stream of continuous TAS control frames to trigger AFC updates, and a tGCL MAT that maps control frames to queue state changes. They are described in the following. Finally, an overview is given of how they operate together in the pipeline.
Modeling the periodicity of GCLs in programmable P4 hardware is challenging since periodic behavior is not natively supported. Timestamps in the Intel Tofino™ are absolute, i.e., their values continuously increase, whereas the time slices in GCLs are relative and follow a periodic pattern. Thus, each frame’s absolute timestamp must be mapped to its corresponding position within the current GCL period. While this could be achieved with a modulo operation, such operations are too complex to perform at line rate in the data plane.
In our previous work [ 6 ], we described an approach to model the periodicity of sGCLs in a P4-based PSFP implementation. In this approach, we leveraged the internal packet generator of the Intel Tofino™ as a clock source. At the end of each GCL cycle, the internal packet generator generates a period-completion frame. The ingress timestamp of this frame is stored in a register and references the timestamp of the last completed period. For all other frames, i.e., non-period-completion frames, the ingress pipeline subtracts this stored value from the frame’s absolute timestamp to obtain a relative timestamp within the ongoing sGCL cycle. In this way, every frame is mapped into a relative time window of one cycle length. The absolute hardware clock of the switch can be used consistently while the sGCL is treated as a repeating list of entries. We leverage this mechanism to implement the periodicity of tGCLs for the TAS. However, unlike sGCLs in PSFP, where each sGCL entry opens or closes a single stream gate, i.e., admits or drops a frame, a tGCL entry must control multiple transmission gates by opening and closing queues. Thus, the periodicity mechanism serves as the basis for the TAS, but additional mechanisms are required to continuously update the gate states of all egress queues during each tGCL entry.
Queues on the Intel Tofino™ 2 can be opened or closed by processing intrinsic AFC metadata of a frame in the pipeline. The controlled queue does not need to correspond to the egress port or queue assigned to the frame itself. A single frame can control exactly one queue of one port of the switch. In this section, we explain the concept of timely control of all queues which implements the tGCL.
To implement tGCL queue state changes with AFC on the Intel Tofino™ 2, each queue state update must be triggered by the arrival of a frame. For this purpose, P4-TAS employs the internal packet generator to continuously produce TAS control frames. These frames are generated in back-to-back batches of eight so that each queue is assigned one frame. Each TAS control frame carries intrinsic metadata with the identifiers of the queue and egress port it controls. Upon arrival, its position in the tGCL cycle is calculated based on the frame’s arrival timestamp as described in Section IV-B1. Then, the position in the tGCL cycle and the intrinsic metadata are matched against the tGCL MAT which specifies whether the corresponding queue should be opened or closed at that point in time.
The TAS control frames are minimally sized (\qty 64B) and contain no payload beyond intrinsic metadata. They are continuously generated with a minimal inter-arrival time to ensure that queue states follow the configured tGCL precisely. This mechanism does not consume bandwidth for user traffic since the internal packet generator and a dedicated internal port are exclusively used for the TAS control traffic. In practice, there is a short delay between consecutive TAS control frames. Since a queue can only change state when its associated control frame arrives, delayed opening or closing may occur. We evaluate the impact of this behavior in Section V-A3.
The tGCL is encoded as a MAT in the egress pipeline and is shown in Figure 5. TAS control frames from Section IV-B2 are matched against it.
Each entry in the MAT corresponds to one of the eight queues of a tGCL entry, i.e., eight MAT entries per tGCL entry are required. The entry specifies whether a queue should be currently open or closed. The lookup key is composed of the relative timestamp in the tGCL which is calculated according to Section IV-B1, the queue identifier, and the egress port. The MAT action writes a precomputed AFC value which encodes the queue, the egress port, and the state, into the frame’s intrinsic metadata. This triggers the queue state update. The queue state update has a small delay which is evaluated in Section V-A2.
The mechanisms in Section IV-B1 – IV-B3 operate together within the P4-TAS pipeline as illustrated in Figure 6.
First, generated period-completion frames mark the boundaries of the tGCL and sGCL cycles and maintain the reference for relative timestamp calculation in step 1. Here, a single frame is generated at the end of each period with duration h h, and its timestamp of the j j -th period t j h t^{h}_{j} is stored in a register for subsequent processing. Afterward, those frames are dropped.
Second, TAS control frames are continuously generated by the packet generator with a minimal inter-arrival time. For those frames, the timestamp relative to the last elapsed period of the tGCL is calculated in step 2. This timestamp is used to match the TAS control frame to the corresponding entry of the tGCL MAT. After queuing the TAS control frame in a dedicated queue of the traffic manager, the AFC mechanism is applied in the egress in step 3. Here, the corresponding queue is opened or closed based on the current tGCL entry using the MAT described in Section IV-B3. Afterward, the frames are dropped.
Third, TSN data frames are policed in the ingress pipeline by the PSFP mechanism to enforce conformance with admitted rates and transmission times. For those frames, the timestamp relative to the last elapsed period of the sGCL is calculated in step 4, and PSFP is applied in step 5. In this step, the frames are policed and either dropped or queued according to their priority. The queue states are either in an open or a closed state based on the tGCL. Frames are forwarded in a FIFO manner as soon as their queue opens.
The Intel Tofino™ 2 ASIC used in this work provides hardware support for IEEE 1588 PTP [ 19 ]. PTP enables sub-microsecond synchronization accuracy by exchanging timestamped messages between network nodes to align their local clocks. This functionality can be implemented entirely with on-board resources of the ASIC [ 19 ]. However, the P4- TAS implementation does not include a PTP synchronization mechanism since integrating such functionality is beyond the scope of this work and not required for the evaluations presented in this paper. Prior work has demonstrated that precise PTP synchronization on Tofino-based switches can be achieved by combining hardware timestamping with control plane clock management [ 39, 19 ] or even entirely within the data plane [ 21 ]. Nevertheless, the TAS functionality and internal delay characteristics we evaluate are largely independent of network-wide time alignment. Future work will explore the integration of P4-TAS into a synchronized multi-hop TSN testbed to enable coordinated, time-aware scheduling across multiple devices.
P4-TAS incorporates the previous P4-PSFP implementation [ 6 ]. The PSFP components stream filter, stream gate, and flow meter are implemented according to IEEE Std 802.1Qci [ 3 ]. The functionality of P4-PSFP has been extensively evaluated in [ 6 ]. In this section, we describe improvements to P4-PSFP that eliminate recirculation, and increase the time resolution of GCLs.
P4-PSFP recirculates TSN traffic for two reasons. First, calculating the relative position in a sGCL does not fit in a single pipeline iteration. Second, the optional maximum frame size filter defined in IEEE Std 802.1Qci [ 3 ] requires frame size info only available in the egress block while drops must occur in the ingress block. Thus, recirculation is necessary, adding a known constant delay. For P4-TAS, we ported the implementation of P4-PSFP from Intel Tofino™ to Tofino™ 2 where the larger pipeline allows the GCL position to be computed in one pass. We also removed the optional maximum frame size filter, eliminating the need for recirculation. If required, the filter can be re-added, at the cost of a recirculation.
sGCL entries in P4-PSFP are modeled as MAT entries with the range matching type. However, the range matching type is limited in the TNA and only \qty 20bits can be matched. Timestamps in the TNA are \qty 48bits with nanosecond granularity. Therefore, in P4-PSFP, \qty 20bits are cut out of the middle of the timestamp to enable the range matching type and enable an appropriate time resolution. Thus, GCLs have a minimum resolution of \qty 2 and a maximum resolution of approximately \qty 4. GCL entries with a lower resolution, or GCLs that last longer cannot be defined in P4-PSFP. However, due to hardware limitations, P4-TAS requires small intervals between tGCL entries where a minimum resolution of \qty 2 is too large. This is further elaborated in Section V-B3. Therefore, we employ an algorithm called range-to-ternary conversion [ 40 ] to increase the resolution of time slices. This algorithm allows to model a single range entry using multiple ternary entries.
The algorithm takes an integer range [ L, R ] [L,R] representing a time slice and breaks it down into the smallest possible set of prefixes that collectively cover the entire range. It does this by repeatedly selecting the largest prefix starting at the current lower bound that remains fully within the range. These selected blocks together ensure complete coverage of the interval [ 40 ]. Some example conversions are given in Figure 7. Each block in Figure 7 denotes a ternary entry that covers parts of the range. The ∗ * denotes a “don’t care” bit, meaning the bit can take either value 0 or 1.
In a GCL, time slices are defined as consecutive, non-overlapping ranges. Under these constraints, Sun [ 41 ] has proven that the solution is both correct and unique.
With this algorithm, GCLs have a resolution of \qty 1 \nano to \qty 78. The upper bound of \qty 78 exceeds the requirements of GCL periods by far and is not necessary in TSN networks. However, the full \qty 48bits timestamp range is available for matching, and reducing the resolution does not have a benefit. The number of ternary table entries required by this conversion algorithm to model GCLs is evaluated in Section V-C.
In this section, we evaluate the P4-TAS implementation. First, we identify and quantify internal delays including the traffic generator accuracy, the queue opening delay, and the TAS control frame delay. Next, we externally measure the duration of tGCL entries and introduce gate switching intervals (GSIs) to mitigate transitional behavior between tGCL entries resulting from the queue opening delay. Then, we assess the scalability of P4-TAS by analyzing the number of supported tGCL and sGCL entries, and the maximum number of streams for identification of DetNet and TSN flows. Finally, we compare P4-TAS to available TAS implementations.
Most TSN scheduling approaches assume ideal switch behavior and neglect implementation-specific effects such as internal delays or jitter. Stüber et al. [ 8 ] address this by proposing a scheduling algorithm that accounts for such inaccuracies. They emphasize the need to consider hardware-induced variability in TAS configurations. While their work focuses on scheduling-level robustness, we take a complementary approach by identifying and quantifying undocumented internal delay sources in a hardware implementation. These findings can support the design of more accurate and robust schedules.
Franco et al. [ 42 ] profile the latency behavior of the Intel Tofino™ ASIC. They analyze factors such as parsing depth and MAT complexity. However, beyond processing delays, additional delay sources exist within TSN bridges that are not typically disclosed [ 9 ]. We quantify several of them in our P4-TAS implementation on the Intel Tofino™ 2 platform. While the measurement results are specific to the P4-TAS implementation on the Intel Tofino™ 2 ASIC, the sources of those delays are also present in other hardware [ 9, 35 ].
First, we evaluate the accuracy of the internal traffic generator which affects the timing of period-completion frames. We then analyze the queue opening delay of the AFC mechanism. Finally, we measure a delay introduced by the packet generator used for TAS control frames, and give a summary of the measurements.
P4-TAS uses the internal packet generator to signal the completion of each tGCL cycle with a configured period h h as described in Section IV-B1. A period-completion frame is generated every h h ns, and the timestamp of the j j -th period, denoted as t j h t^{h}_{j}, is stored in a register. Due to limitations of the packet generator, small timing deviations may occur. To quantify this effect, we measure the difference between the timestamps of consecutive period-completion frames, i.e., t j + 1 h t^{h}_{j+1} and t j h t^{h}_{j}, relative to the configured period h h. The deviation δ ^ TG \hat{\delta}_{\text{TG}} is defined in Equation 1: δ ^ TG \displaystyle\hat{\delta}_{\text{TG}} = (t j + 1 h − t j h) − h. \displaystyle=(t^{h}_{j+1}-t^{h}_{j})-h. (1)
This value is recorded as a time series in a register in the data plane. Based on use cases identified by Stüber et al. [ 43 ], we select representative periods h h: \qty 500 for factory automation, \qty 2 for industrial isochronous traffic, and \qty 128 for aerospace applications. Additionally, we include \qty 10, \qty 499 and \qty 501 to analyze edge cases and artifacts, and \qty 400 since this period is used for the evaluation in Section V-B. For each period, we record the timestamps of 16,000 period-completion frames. Figure 8 shows the results.
The boxplot in Figure 8 shows the median as a red line, the first and third quartiles as the edges of the box, and whiskers that extend to 1.5 times the interquartile range. Values outside this range are plotted as outliers. A positive δ ^ TG \hat{\delta}_{\text{TG}} indicates that the actual period exceeded the configured value by δ ^ TG \hat{\delta}_{\text{TG}} while a negative value means it was shorter by that amount.
Most periods show deviations below δ ^ TG = \qty 2 \hat{\delta}_{\text{TG}}=\qty{2}{}, with all outliers staying within ± \pm \qty 11. An exception occurs at a period of \qty 400 and \qty 500 which shows a wider spread with less outliers. We attribute this to internal scheduling behavior of the packet generator in the Intel Tofino™ switching ASIC. Shifting the period slightly, e.g., to \qty 499 or \qty 501, results in deviations similar to the other configurations.
Although these deviations are small, they can impact the periodicity computation. If a period-completion frame arrives late, the computed relative position within the current GCL cycle may exceed the period h h, which would index an out-of-period entry. To ensure that all frames are assigned to a valid tGCL entry, P4-TAS clamps any calculated position ≥ h \geq h to the final entry of the cycle. Conversely, if a period-completion frame arrives early, the periodicity mechanism in Section IV-B1 semantically evaluates the position modulo h h, so the result always lies in [ 0, h) [0,h). Therefore, the deviation from the configured period is compensated and all frames are mapped to existing GCL entries.
In the TNA, there is a small but non-zero delay between writing the AFC value, i.e., between initiating a queue state change, and the actual update of the queue state in the hardware [ 44 ]. To quantify internal delays in the AFC mechanism, we measure the time between issuing a queue state change and the actual release of TSN frames. We denote this queue opening 1 1 1 Measurements showed that queue opening and closing delays are distributed in the same way in the TNA. delay as δ ^ queue \hat{\delta}_{\text{queue}}. This delay impacts TSN precision and is rarely documented in available hardware. The measurement procedure is implemented in the data plane of P4-TAS and is shown in Figure 9.
In Figure 9, a closed queue is first filled with TSN frames (step 1). When a TAS control frame matches a tGCL entry that opens the queue, it triggers a queue opening via AFC and records the timestamp t change t_{\text{change}} (step 2). The dequeuing timestamp t deq t_{\text{deq}} of the first TSN frame leaving the queue is then used to compute δ ^ queue \hat{\delta}_{\text{\text{queue}}} as shown in Equation 2 (step 3):
δ ^ queue \displaystyle\hat{\delta}_{\text{\text{queue}}} = t deq − t change. \displaystyle=t_{\text{deq}}-t_{\text{change}}. (2)
This value is stored as a time series in a register of the data plane for all observed transitions (step 4).
The tGCL for this measurement is configured with eight consecutive entries, one per priority. Each entry opens the corresponding priority queue for \qty 100, so that the schedule cycles through all eight priorities in turn. TSN traffic is generated using P4TG [ 45, 46, 47 ] at \qty 400 with randomized priorities and \qty 64 frames. This ensures that the queues are saturated. The experiment is run for \qty 60. Figure 10 shows the complementary cumulative distribution function (CCDF) of the measured queue opening delay δ ^ queue \hat{\delta}_{\text{queue}}.
Most delays are below δ ^ queue = \qty 11 \hat{\delta}_{\text{queue}}=\qty{11}{} with a tail extending up to \qty 63 and a mean of μ (δ ^ queue) = \qty 14.63 \mu(\hat{\delta}_{\text{queue}})=\qty{14.63}{}. These results reveal small but measurable internal delays. In particular, the queue opening delay can cause transitional behavior at tGCL boundaries where frames from the previous entry may still be transmitted briefly after the next entry has started. The impact of this effect and the role of gate switching intervals (GSIs) are evaluated in Section V-B3.
For TAS control frames, the internal packet generator is configured to generate a frame every nanosecond. The frames are sequentially generated in batches of eight, with each frame controlling one of the eight priority queues. In practice, however, a frame cannot be generated every nanosecond. Instead, a small delay occurs between frame generation which limits the granularity at which queue state updates can be triggered. To quantify this phenomenon, we collect the timestamp of each TAS control frame in the data plane of P4-TAS and compute the delay δ ^ control \hat{\delta}_{\text{control}} between two consecutive frames i i and i + 1 i+1:
δ ^ control \displaystyle\hat{\delta}_{\text{control}} = t i + 1 − t i. \displaystyle=t_{i+1}-t_{i}. (3)
We collect 100,000 values for δ ^ control \hat{\delta}_{\text{control}}, all calculated in the data plane and stored in a time series register. The resulting histogram is shown in Figure 11.
The measured median is δ ^ control,M = \qty 9 \hat{\delta}_{\text{control,M}}=\qty{9}{}, with only a few frames showing a slightly higher delay of up to \qty 12. Thus, transmission gate states can be updated only every \qty 9. Because frames are generated sequentially in batches of eight, updates for different priority queues are offset sequentially by \qty 9 and cannot occur simultaneously. Further, this means that the transmission gate state update of the same priority can be triggered every 8 ⋅ δ ^ control ≈ \qty 72 8\cdot\hat{\delta}_{\text{control}}\approx\qty{72}{}. This value should be seen as a worst-case upper bound. In practice, the effective delay can be close to zero if a control frame arrives just before a scheduled gate change. Such a short delay only matters if the tGCL entry resolution is on the order of \qty 72 which is much smaller than typical tGCL entry durations [ 8 ].
Table I gives an overview of the identified and measured internal delays in the best and in the worst case.
Those internal delays accumulate to Δ internal \Delta_{\text{internal}} shown in Equation 4:
Δ internal \displaystyle\Delta_{\text{internal}} = δ TG + δ queue + δ control. \displaystyle=\delta_{\text{TG}}+\delta_{\text{queue}}+\delta_{\text{control}}. (4)
The internal delay Δ internal \Delta_{\text{internal}} may reduce or extend the duration of a tGCL entry. Figure 12 illustrates this effect for three consecutive tGCL entries of configured duration d d.
If the preceding tGCL entry i − 1 i-1 experiences a negative internal delay, it is shortened while tGCL entry i i is extended. In addition, tGCL entry i i itself may experience a positive delay. In this case, the actual duration of tGCL entry i i becomes
d ^ i = d + | Δ internal i − 1 | + Δ internal i. \displaystyle\hat{d}_{i}=d+|\Delta^{i-1}_{\text{internal}}|+\Delta^{i}_{\text{internal}}. (5)
In the worst case, Δ internal i \Delta^{i}_{\text{internal}} is composed of the maximum traffic generator deviation, queue opening delay, and control traffic delay: Δ internal, max i = \qty 11 + \qty 63 + \qty 12 = \qty 86 \Delta^{i}_{\text{internal},\max}=\qty{11}{}+\qty{63}{}+\qty{12}{}=\qty{86}{}. Further, Δ internal i − 1 \Delta^{i-1}_{\text{internal}} can be negative if the traffic generator deviation is negative and all other delays are close to zero, yielding up to \qty 11 of shortening. This implies that a tGCL entry may be extended by up to \qty 86, or be shortened by \qty 11. Further, through correlation of consecutive tGCL entries, a tGCL entry may be extended by up to \qty 97 as shown in Figure 12.
In the best case, a TAS control traffic frame arrives exactly at the switchover point to a new tGCL entry, resulting in a control traffic delay of δ control, min = \qty 0 \delta_{\text{control},\min}=\qty{0}{}. Combined with the measured best case queue delay δ queue, min = \qty 1 \delta_{\text{queue},\min}=\qty{1}{}, and traffic generator accuracy δ TG,min = 0 \delta_{\text{TG,min}}=0, the best case internal delay is Δ internal, min = \qty 1 \Delta_{\text{internal},\min}=\qty{1}{}.
These values therefore define a theoretical bound for deviations in tGCL entry duration. The following evaluation section examines how often and to what extent such deviations occur in practice.
P4-TAS enables the configuration of tGCLs and their periods with nanosecond granularity. However, the internal delays characterized in Section V-A may introduce deviations between the configured and the actual durations of tGCL entries. This section evaluates the accuracy of configured tGCL entries by comparing the expected duration with the measured duration observed in the data plane. First, we present the testbed and describe the measurement procedure. Then, we analyze the results and introduce gate switching intervals (GSIs) to improve timing accuracy.
The testbed for the external tGCL entry measurement is shown in Figure 13.
Traffic is generated with P4TG [ 45, 46, 47 ] at a rate of \qty 514Mpps using minimum-size \qty 64 frames and a constant inter-arrival time, i.e., no bursts. Each frame is assigned a random priority sampled from a uniform distribution and is encapsulated with MPLS to validate the DetNet translation. A tGCL with a period of \qty 400 divided into eight \qty 50 entries is configured in P4-TAS. During each entry, only one of the eight queues is open, corresponding to one priority. Incoming MPLS traffic is translated into a TSN stream, after which the configured tGCL is applied based on the resulting TSN stream identifier. After shaping by the TAS, the traffic is forwarded to a third Tofino™ switch which records frame arrival times per priority in a dedicated P4 program.
The measurement procedure in the dedicated P4 program on the third switch is based on detecting changes in priority within the received stream. It assumes that frames of only one priority π ∈ { 0, …, 7 } \pi\in\{0,\ldots,7\} arrive at the measurement switch during each tGCL entry as configured in P4-TAS. A series of timestamps of the first and last frame in a tGCL entry, i.e., of the same priority, is collected and stored in the data plane. This is illustrated in Figure 14.
For priority π = 0 \pi=0, the arrival time of the first frame in the i i -th tGCL entry is stored as t first i, π = 0 t^{i,\pi=0}_{\text{first}}. When the next priority π = 1 \pi=1 appears, the arrival time of the last frame of the previous priority π = 0 \pi=0 is stored as t last i, π = 0 t^{i,\pi=0}_{\text{last}}, and the new frame marks t first i + 1, π = 1 t^{i+1,\pi=1}_{\text{first}}. This is shown in step 1 in Figure 14. The control plane calculates the duration of entry i i for priority π \pi as follows:
d ^ i π = t last i, π − t first i, π. \displaystyle\hat{d}_{i}^{\pi}=t^{i,\pi}_{\text{last}}-t^{i,\pi}_{\text{first}}. (6)
The measured tGCL entry duration is then compared with the configured tGCL entry duration of d = \qty 50 d=\qty{50}{}, and the deviation δ ^ slice i, π \hat{\delta}^{i,\pi}_{\text{slice}} is obtained as
δ ^ slice i, π = d ^ i π − d. \displaystyle\hat{\delta}^{i,\pi}_{\text{slice}}=\hat{d}_{i}^{\pi}-d. (7)
A negative value for δ ^ slice i, π \hat{\delta}^{i,\pi}_{\text{slice}} thus means that the measured tGCL entry duration was shorter than the configured duration while a positive value means that it was longer. A total of 32,764 values for δ ^ slice i, π \hat{\delta}^{i,\pi}_{\text{slice}} is collected.
The identified internal queue opening/closing delay identified in Section V-A2 causes queue state transitions to occur during a short interval instead of instantaneously. This may cause transitional behavior where queues of a tGCL entry are not yet closed while queues of the next tGCL entry have already begun forwarding. As a result, frames from two tGCL entries are transmitted simultaneously, violating the configured tGCL. This overlap is a phenomenon of P4-TAS, not an artifact of the measurement, and must be addressed. To mitigate this effect, we introduce gate switching intervals (GSIs) which are illustrated in Figure 15.
gate switching intervals are short, explicit tGCL entries in which all queues are closed. They are inserted between tGCL entries. These GSIs suppress transitional forwarding behavior and isolate each tGCL entry. We configured GSIs of \qty 30 which was sufficient to eliminate overlap without significantly impacting available transmission time. While the worst-case queue opening delay measured in Section V-A reaches \qty 63, a \qty 30 GSI provides sufficient isolation because the GSI itself is subject to the same internal delays. This effectively extends the GSI ’s duration and ensures that queue state transitions complete before the next scheduled entry begins. Larger GSIs did not improve the results.
First, we measured the deviation of the observed values from the configured duration of tGCL entries without introducing GSIs. The resulting statistics were inconsistent, with a mean deviation across all measurements of μ (δ ^ slice) = \qty − 22.8 \mu(\hat{\delta}_{\text{slice}})=\qty{-22.8}{} and a median of \qty 450. These apparent deviations are not meaningful because consecutive entries frequently overlapped at their boundaries as explained in Section V-B3.
The deviation of the measured values from the configured duration of tGCL entries using GSIs is presented in Figure 16. The metric is computed per priority, i.e., for each π \pi and entry i i as δ ^ slice i, π \hat{\delta}^{i,\pi}_{\text{slice}}, and the histogram is shown aggregated across all priorities because the behavior is identical for all.
The measured distribution in Figure 16 shows two dominant modes: one around \qty -60ns and one around \qty 30ns, separated by a valley around the median of \qty -19. The bimodal distribution results from how delays of consecutive entries interact. A large delay at a boundary makes the current tGCL entry longer than configured, creating the positive cluster. The following tGCL entry then starts late and becomes shorter, creating the negative cluster. A deviation of exactly zero is unlikely since it would require two consecutive delays to be almost identical which is rare in practice. The overall median is slightly negative, reflecting that shortened entries occur somewhat more frequently. The minimum of \qty -239ns represents a rare worst case where one tGCL entry is extended by nearly the maximum possible delay and the neighboring tGCL entry experiences no delay and is consequently shortened by the same amount.
Scalability is a critical aspect for TSN and DetNet deployments which often involve large numbers of scheduled traffic streams. However, many scheduling algorithms overlook hardware resource constraints such as limited MAT capacity [ 8 ]. In this section, we evaluate the scalability of our P4-TAS implementation by analyzing the number of supported tGCL and sGCL entries, and the number of streams for DetNet and TSN stream identification.
Many TSN scheduling algorithms assume an unlimited number of GCL entries [ 8 ]. However, real hardware imposes strict limits due to finite memory resources which may make a schedule undeployable if exceeded. Therefore, we evaluate the number of tGCL and sGCL entries that can be stored in the proposed P4-TAS implementation. First, we analyze how many MAT entries are required per GCL entry. Then, we describe the available GCL sizes in P4-TAS.
Internal delays
No. streams
Δ internal, max = \Delta_{\text{internal, max}}= 86 ns
Predict6G Open Source TSN Platform [ 25 ]
10k entries
In P4-TAS, the tGCL is modeled as a MAT in the egress P4 control block which matches on the relative timestamp and on one of the eight queues. Therefore, for each tGCL entry, eight range MAT entries are required, i.e., one for each gate. The sGCL is modeled as a MAT in the ingress P4 control block which matches on the relative timestamp and on the stream gate identifier. Thus, for an sGCL entry, only a single range MAT entry is required.
Because range matching in the TNA is limited, a range-to-ternary conversion is employed to enable matching on the relative timestamp. The conversion algorithm described in Section IV-C2 replaces a single range MAT entry with multiple ternary MAT entries to increase the resolution of matched timestamps. However, this approach also increases the number of required MAT entries.
Let w w denote the number of bits used to represent the range. Gupta et al. [ 40 ] showed that a range of width w w bits can be transformed into at most 2 ⋅ w − 2 2\cdot w-2 ternary entries. Consequently, in the worst case, modeling a tGCL containing n n tGCL entries results in up to 8 ⋅ n ⋅ (2 ⋅ w − 2) 8\cdot n\cdot(2\cdot w-2), and a sGCL entry in n ⋅ (2 ⋅ w − 2) n\cdot(2\cdot w-2) ternary MAT entries. In practice, the actual number is often significantly lower due to favorable alignment. For example, if the range’s width is a power of two and properly aligned, a single ternary entry suffices. The tGCL configuration from Section V-B which used a period of \qty 400 divided into eight \qty 50 tGCL entries with additional \qty 30 GSIs required 1512 ternary MAT entries. Numerous studies have proposed optimized range-to-ternary conversion algorithms aimed at reducing ternary entry counts [ 49, 50, 51, 52 ], and these may be explored in future work.
The Intel Tofino™ 2 can generate up to 16 different periodic streams. Since one of those is required for the continuous TAS control traffic, P4-TAS can configure 15 streams for period-completion frames. Therefore, 15 different GCL periods can be configured which can be shared between PSFP and TAS.
The tGCL MAT in P4-TAS can hold 39,000 MAT entries which is a result of the available hardware resources. The MAT is therefore large enough to accommodate multiple tGCLs. We increased the size of the stream gate MAT of PSFP from 2048 in P4-PSFP [ 6 ] to 6000 MAT entries. This is possible because the implementation is ported to the Tofino™ 2 ASIC which has more resources available.
The ternary match operates on a \qty 48bit timestamp enabling resolutions of up to \qty 78h. While such a range exceeds practical requirements, reducing the number of matched bits has no impact due to internal hardware alignment.
These resource limits show that P4-TAS can support multiple tGCLs and sGCLs, ensuring deployability of realistic TSN schedules.
To evaluate the scalability of stream identification in our implementation, we analyze the structure and capacity of the MAT used for DetNet and TSN streams.
A single MAT handles both DetNet and TSN stream identification. It uses ternary keys consisting of the S-Label for DetNet streams and Ethernet destination address, VLAN ID, and IPv4 source and destination address for TSN streams [ 4 ]. The use of ternary matches enables wildcarding and aggregation. For example, an entry matching only on the S-Label enables DetNet-to-TSN translation while another matching on MAC destination and VLAN ID supports TSN-to-DetNet translation or TSN stream identification.
The MAT supports 8196 entries which allows at least 8196 DetNet or TSN streams to be identified. In cases where IP-based identification is used, ternary aggregation can further increase the number of identifiable streams. A survey by Stüber et al. [ 10 ] reports deployments with up to 10,812 streams, indicating that our implementation can support realistic industrial-scale scenarios with appropriate use of wildcarding. These results show that the design is scalable and capable of supporting a number of streams typical in TSN/DetNet deployments.
In this section, we summarize and compare capabilities of several TAS-capable platforms, including our P4-TAS prototype on a P4-programmable ASIC. The overview is shown in Table II.
Similar to P4-TAS, the Predict6G open-source platform provides TAS and DetNet integration, but its documentation does not specify configurable time resolution, internal delay behavior, or scalability [ 25 ]. Commercial platforms such as NXP’s SJA1105TEL [ 33, 34 ], Microchip’s SparX-5i family [ 35 ] and PD-IES008 [ 36, 37 ], and Relyum’s RELY-TSN12 [ 48 ] provide hardware support for TAS and PSFP. However, publicly available specifications typically stop at time granularity, queue counts, or GCL sizes while omitting internal delay sources that ultimately determine schedule precision. Although these devices advertise nanosecond-level configuration granularity, our evaluation with P4-TAS shows that practical gate updates are constrained by internal delays in the range of tens of nanoseconds. This phenomenon is further supported by Eppler et al. [ 9 ] who report internal TAS delays of approximately \qty 2.6 \micro in Relyum switches. This demonstrates that internal TAS timing effects are inherent to the mechanism itself and not specific to P4-based implementations, and can be significantly larger in proprietary devices. Since such delays are rarely documented, the effective precision of commercial solutions is difficult to assess from datasheets alone. Further, those delays are internal and cannot be measured in commercial black-box switches.
Our P4-TAS prototype achieves a comparable time configuration granularity of \qty 1 while also documenting measured internal delays. Specifically, we observed a worst-case internal delay for a tGCL entry of Δ internal,max = \qty 86 \Delta_{\text{internal,max}}=\qty{86}{} in the evaluation. In contrast, vendor platforms either do not document such values (e.g., NXP SJA1105TEL, Microchip PD-IES008) or only disclose partial information (e.g., SparX-5i which specifies a queue opening delay of δ queue = \qty 512 \delta_{\text{queue}}=\qty{512}{}). The transparency in P4-TAS allows a more realistic assessment of achievable schedule precision. In terms of scalability, P4-TAS supports a larger number of flows (≥ \geq 8196) and larger GCLs (39k for TAS and 6k for PSFP) compared to the commercial platforms. For GCL entries, the range-to-ternary conversion overhead must be considered which we evaluated in Section V-C1.
A further distinction is line-rate throughput. While most commercial TSN-capable switch ASICs target 1–25 Gb/s per port in automotive and industrial domains, P4-TAS operates at up to 400 Gb/s per port. This enables its use not only in TSN deployments but also in high-speed data center environments where integration with DetNet becomes relevant. Hence, P4-TAS extends the design space beyond today’s embedded and industrial use cases.
Overall, the comparison shows that commodity hardware already supports TAS and PSFP functionality, but vendors disclose little about their internal timing behavior. This lack of transparency makes it difficult to design schedules with nanosecond accuracy. P4-TAS fills this gap by explicitly characterizing internal delays, enabling more predictable and transparent use of TSN.
We presented P4-TAS, a P4-based implementation of the Time-Aware Shaper (TAS) for TSN on the Intel Tofino™ 2. To achieve periodicity of tGCLs, we leveraged a mechanism for periodic behavior in P4 switches using the internal packet generator as a clock source. Building on this foundation, we introduced a mechanism for precise queue state control for the TAS using an internally generated, continuous stream of TAS control traffic. P4-TAS also incorporates PSFP, where we improved our earlier P4-PSFP design by eliminating recirculation and increasing the GCL time resolution to the nanosecond scale using a range-to-ternary algorithm. Additionally, P4-TAS includes an MPLS/ TSN translation layer enabling TSN traffic shaping and policing to be applied to DetNet flows at line rate up to \qty 400. Beyond functional capabilities, our implementation provides transparent insights into internal timing behavior which is rarely documented in commercial platforms.
Our evaluation covered three aspects. First, we identified and quantified undocumented internal delay sources, including traffic generator inaccuracy, queue opening delay, and TAS control frame delays. We identified a theoretical worst-case accumulated delay of about \qty 86 for a tGCL entry, which is orders of magnitude smaller than the microsecond-scale gate transition delays reported for some commercial TSN switches [ 9 ]. Second, we externally measured the duration of tGCL entries and compared it to the configured duration. In this process, we identified that the internal queue opening delay leads to transitional behavior where queues of a tGCL entry are not yet closed while queues of the next tGCL entry have already begun forwarding. Therefore, we introduced gate switching intervals (GSIs), short explicit tGCL entries in which all queues are closed, to mitigate this effect. Third, we analyzed scalability, demonstrating support for 39,000 tGCL entries and more than 8,196 flows, covering the requirements of current industrial deployments.
Compared with existing ASIC- and FPGA-based TSN platforms, P4-TAS offers similar configurability in terms of time granularity, but additionally exposes internal delays that directly affect scheduling precision. This transparency allows schedules to be designed with awareness of hardware-induced deviations, something not possible with today’s black-box hardware. Moreover, P4-TAS supports line rates up to \qty 400Gb/s per port and seamless DetNet / TSN translation, extending applicability from industrial and automotive networks to high-throughput environments such as data centers and carrier backbones
Future work will focus on improving scalability, for example by optimizing range-to-ternary usage, and on investigating how delay characterization can be incorporated into scheduling algorithms to increase robustness against hardware-level variability. Further, we will explore the integration of P4-TAS into a PTP -synchronized multi-hop TSN testbed to validate gate scheduling and latency guarantees under realistic TSN -specific traffic patterns.