返回论文解读

sync reader

TSNBench:评测大模型在 TSN 领域能力的基准

TSNBench: Benchmarking LLM Proficiency in Time-Sensitive Networking · 2026-05-10

仿真与测试调度算法CC BY,可公开对照

本页提供英文原文段落与中文逐段译稿。译稿包含自动复核状态;标记为需人工复核的段落应回到 PDF/HTML 校对公式、表格和符号。

本站范围
全文逐段对照
内容来源
本地英文段落 + 中文译稿
阅读规模
201/201 段已生成译稿
中文逐段译稿
P001已复核

近年来,大语言模型(LLMs)在工程(Jackson et al., 2025; Guo et al., 2025)、医学(Xie et al., 2025; Liu et al., 2023; Li et al., 2024)、临床实践(Kweon et al., 2024)、计算机网络(Sharma and Yegneswaran, 2023)、电信(Maatouk et al., 2026; Ferrag et al., 2026; Oluwaseyi et al., 2025; Gajjar et al., 2025)以及自动化(Shen et al., 2024)等不同领域取得的最新进展,已经显示出在协助工程师、从业者、研究人员(Huang et al., 2023; Sun et al., 2024)和医生解决现实世界问题方面的突破性性能。系统工程师正越来越多地使用 LLM 来设计和配置网络(Wang et al., 2024a)、生成代码以及分析网络日志。由此,它们正在进入新的领域:自动驾驶车辆、航空航天(Fiori et al., 2024; Sanchez-Garrido et al., 2021)、国防(Elliott, 2023)和工业通信(Zhang et al., 2024)等安全关键应用领域。在这些语境中,LLM 的准确性、可靠性和一致性远不只是排行榜指标,因为它们已经成为工程需求。

术语 LLM、safety-critical application domains、leaderboard metrics、engineering requirements 已保留或准确转译;引用年份和作者未改动;逻辑上从通用进展过渡到安全关键场景,再强调指标性质变化,未发现明显问题。

P002已复核

时间敏感网络(Time-Sensitive Networking, TSN)(31)由 IEEE 802.1 工作组(Working Group, WG)标准化,是一种第 2 层以太网技术,为安全关键应用提供确定性通信保证。TSN 部署通常根据时序关键性来区分流量。具有保证时延和有界抖动的安全关键周期性通信被归类为时间触发(time-triggered, TT)(Ademaj et al., 2019)流量,并使用 IEEE 802.1Qbv 定时门控机制进行服务。TT 传输由门控控制列表(Gate-Control List, GCL)控制,该列表使用精确方法离线计算,例如基于 SMT 的综合(Craciunas et al., 2016),或使用启发式方法计算(Pop et al., 2016; Gavriluţ et al., 2018; Bujosa et al., 2022)。相比之下,需要有界端到端时延但抖动控制要求较不严格的周期性或偶发性通信,被归类为音视频桥接(Audio Video Bridging, AVB)流量(Böhm and Wermser, 2021; Bruckner et al., 2019)。因此,数十或数百微秒的最坏情况时延(Worst-Case Delay, WCD)估计误差是显著的,因为它们可能消耗时序裕量、违反截止期限,或导致不可行的 TSN 配置。在任务关键型部署中,这类误差可能产生严重后果。例如,配置错误的 TSN 网络可能导致机械臂错过关键装配步骤、制动系统在高速公路上失效、飞行器控制系统作出错误响应、防御机制崩溃,或航天器错过关键信号。这些失效可能源于单个错误配置导致的亚毫秒级时序违例。这些风险凸显了 TSN 系统中准确分析和配置的重要性,尤其是在 LLM 正日益集成到网络管理工作流中的情况下。因此,必须严格评估它们的领域熟练程度。然而,据我们所知,现有基准尚未评估 LLM 在 TSN 方面的熟练程度。

TSN、IEEE 802.1 WG、layer-2 Ethernet、TT、GCL、SMT、AVB、WCD 等术语均已保留缩写并中文解释;“tens or hundreds of microseconds”“sub-millisecond”数字量级未遗漏;“sporadic”译为“偶发性”符合网络流量语境;(31)可能是参考编号而非标准号,已原样保留;未发现明显问题。

P003已复核

为填补这一空白,我们提出 TSNBench,这是第一个用于评估 LLM 在 TSN 中熟练程度的基准,包含两个互补的评估组件。第一个组件是一个由专家验证的、包含 939 个问题的多项选择问答(multiple-choice question and answer, MCQA)数据集,该数据集由 83 篇同行评议研究论文生成,生成过程使用了来自不同模型家族的三个 LLM,并由五名领域专家进行严格审查,每名专家都拥有超过八年的 TSN 研究经验。第二个组件是一组开放式问题,要求针对两种广泛部署的 TSN 机制进行多步骤 WCD 计算,即基于信用的整形器(Credit-Based Shaper, CBS)(32)和循环排队与转发(Cyclic Queuing and Forwarding, CQF)(33; J. Yan, W. Quan, X. Jiang, and Z. Sun (2020)),覆盖不同的网络拓扑和流量流;其中,CBS 的真值使用经过验证的网络演算(Network Calculus, NC)求解器(Zhao et al., 2018)计算,CQF 的真值则使用闭式数学上界(Wang et al., 2023)计算。这些开放式 WCD 问题旨在作为对独立模型能力的闭卷压力测试,而不是作为自由文本 LLM 时序输出的部署工作流。关于 TSN、NC、CBS 和 CQF 的详细背景分别在附录 7、8、9 和 10 中给出。

939、83、3 个 LLM、5 名专家、8 年以上经验等数字完整保留;CBS、CQF、NC、WCD 术语一致;“closed-book stress test of standalone model capability”译为“独立模型能力的闭卷压力测试”准确;“free-text LLM timing outputs”含义较专业,译为“自由文本 LLM 时序输出”可接受;未发现明显问题。

P004已复核

尽管 MMLU(Hendrycks et al., 2021)和 MMLU-Pro(Wang et al., 2024b)等通用基准会评估跨越初等数学、历史和法律的广泛学科知识,但它们从根本上并不适合安全关键领域的特定领域评估。回答一道关于小学历史的多项选择题,与回答 TSN 术语问题并在给定网络拓扑下依据 NC 约束正确计算 WCD,在类别上是不同的。如果没有一个能够捕捉这种差异的基准,就没有一种有原则的方法来衡量 LLM 在确定性网络领域中的进展。TSNBench 正是为了揭示这一空白而设计的。

MMLU、MMLU-Pro、NC、WCD 等名称和缩写保留;“categorically different”译为“在类别上是不同的”忠实但略直译,符合逐字逐句要求;逻辑为通用基准不足、任务性质不同、需要专门基准,未发现明显问题。

P005需人工复核

我们评估了 16 个 LLM,涵盖开源和闭源模型,以及通用架构和专门面向推理的架构。我们的结果揭示了一种显著的脱钩现象:模型在 MCQA 上达到 67% 到 95% 的准确率,却在开放式 WCD 计算上出现实质性失败。表现最佳的模型 GPT-5 在 CBS 上取得了 36.2% 的平均绝对百分比误差(Mean Absolute Percentage Error, MAPE),而大多数模型超过 80%。在一个数十微秒的时序违例,甚至只是 1000 μs 截止期限的 1%,都可能导致系统失效的领域中,这一点令人担忧。

16、67 到 95%、36.2%、80%、1000 μs、1% 等数字和单位保留;原文存在 “1000 μ \mu s deadline” 的疑似 LaTeX/抽取重复或残缺,译文按 1000 μs 处理;因公式/符号抽取存在风险,需人工核对原 PDF。

P006需人工复核

我们的关键贡献如下:1. 第一个由专家验证的 TSN 基准:TSNBench 通过 939 道由专家验证、源自同行评议 TSN 文献的 MCQ,评估 LLM 对 TSN 机制的知识。2. 开放式时序分析任务:TSNBench 包含针对 CBS 和 CQF 的开放式 WCD 计算任务,其中 CBS 的真值使用经过验证的 NC 求解器计算,CQF 的真值使用闭式数学上界计算。3. 跨 16 个 LLM 的评估:我们评估了开源模型和闭源模型,包括通用模型和专门面向推理的模型,并表明高 MCQA 准确率并不能可靠预测准确的 WCD 计算。

本段为贡献列表压成一个段落,1、2、3 三项均完整保留;939、16 等数字正确;MCQ 与前文 MCQA 略有不同,按原文保留 MCQ;“closed-form mathematical bounds”译为“闭式数学上界”,可理解但“界”可能需结合上下文确认为“上界”;需人工核对是否应译为“闭式数学上界”。

P007已复核

第一个由专家验证的 TSN 基准:TSNBench 通过 939 道由专家验证、源自同行评议 TSN 文献的 MCQ,评估 LLM 对 TSN 机制的知识。

与 P006 第一项重复,可能是论文列表项被抽取为独立段落;939、MCQ、TSN 机制均保留;未发现明显问题。

P008需人工复核

开放式时序分析任务:TSNBench 包含针对 CBS 和 CQF 的开放式 WCD 计算任务,其中 CBS 的真值使用经过验证的 NC 求解器计算,CQF 的真值使用闭式数学上界计算。

与 P006 第二项重复,可能是列表项抽取造成;CBS、CQF、WCD、NC 均保留;“closed-form mathematical bounds”译为“闭式数学上界”存在术语选择风险,结合 P003 可能应为“闭式数学上界”;需人工复核术语一致性。

P009已复核

跨 16 个 LLM 的评估:我们评估了开源模型和闭源模型,包括通用模型和专门面向推理的模型,并表明高 MCQA 准确率并不能可靠预测准确的 WCD 计算。

与 P006 第三项重复,可能是列表项抽取造成;16、MCQA、WCD 均保留;“reasoning-specialized models”译为“专门面向推理的模型”准确;未发现明显问题。

P010已复核

总之,TSNBench 为研究界提供了第一个用于评估 LLM 在 TSN 中熟练程度的严格评估资源,为正在探索 LLM 辅助 TSN 管理的实时网络社区,以及试图理解 LLM 在安全关键、计算要求高的领域中的局限性的机器学习社区,都提供了有价值的洞见。

“research community”“real-time networking community”“machine learning community”三类对象关系保留;“safety-critical, computationally demanding domains”译为“安全关键、计算要求高的领域”准确;未发现明显问题。

P011已复核

基准测试和数据集对于衡量 LLM 的进展以及识别关键差距和局限性至关重要(Hendrycks et al., 2021; Wang et al., 2024b)。诸如 MMLU(Hendrycks et al., 2021)和 MMLU-Pro(Wang et al., 2024b)这样的通用知识基准,通过多项选择题来评估广泛的学科知识,包括初等数学、历史、计算机科学和法律。特定领域基准已经将这一范式扩展到医学(Xie et al., 2025; Liu et al., 2023; Li et al., 2024)、临床实践(Kweon et al., 2024)、法律(Guha et al., 2023)、代码生成(Hua et al., 2025; Huang et al., 2024)以及科学研究(Sun et al., 2024)。尽管这些基准推动了显著进展,但它们并非为评估安全关键网络任务而设计。大多数基准依赖多项选择评估,并且没有任何一个基准评估模型是否能够执行安全关键网络领域所需的多步骤计算推理。TSNBench 通过引入 MCQA 和开放式 WCD 计算问题来填补这一空白,其标准答案由最先进的 NC 求解器验证,从而提供了现有通用基准均无法涵盖的 TSN 评估。

术语 MCQA、WCD、NC 保留缩写;“safety-critical networking tasks”译为“安全关键网络任务”符合技术语境;引用、数字、因果逻辑均保留。未发现明显问题。

P012已复核

在过去几年中,已有若干基准评估了 LLM 在网络和电信领域的能力。TeleQnA(Maatouk et al., 2026)提出了一个用于电信领域的 MCQ 数据集,该数据集由研究文档和 3GPP 标准生成,并由领域专家验证。6G-Bench(Ferrag et al., 2026)提出了一个基于 MCQ 的 6G 网络数据集,包含 3,722 个难题,这些问题通过自动化过滤和人工专家审查得到验证。除问答基准之外,NetConfEval(Wang et al., 2024a)在网络配置任务上评估 LLM,并表明 LLM 能够简化和自动化复杂的网络管理任务。

MCQ、3GPP、TeleQnA、6G-Bench、NetConfEval 等名称保留;数字 3,722 准确;“validated through automated filtering and expert human review”译义完整。未发现明显问题。

P013已复核

LLM 在 TSN 管理和编排中的应用仍处于非常早期的阶段,目前仅有有限的初步研究。Windmann et al.(2025)探索了使用 LLM 配置混合 5G/TSN 网络的方法,即通过协助用户完成手动配置任务,并在 5G-TSN 网络中建议配置。然而,这项工作仍属初步研究,并未提供实验结果。总体而言,既有工作没有提供一个跨 TSN 机制系统评估 LLM 能力的基准或严格评估,也没有评估用于 WCD 分析的计算推理能力。TSNBench 通过提供首个结构化基准来填补这一空白,该基准同时覆盖通过 MCQA 考察的陈述性 TSN 知识,以及通过开放式 WCD 评估考察的计算推理。

“management and orchestration”译为“管理和编排”;“declarative TSN knowledge”译为“陈述性 TSN 知识”较贴近机器学习评测语境;WCD 分析与开放式 WCD 评估逻辑准确。未发现明显问题。

P014已复核

不同于医学(Xie et al., 2025)、5G(Oluwaseyi et al., 2025; Maatouk et al., 2026)、通用人类知识(Phan et al., 2026; Hendrycks et al., 2021; Wang et al., 2024b)、编码(Hua et al., 2025; Huang et al., 2024)以及法律(Guha et al., 2023)等成熟领域,目前不存在用于 LLM 评估的开源 TSN 数据集(Zhang et al., 2024; Peng et al., 2023; Zanbouri et al., 2025; Adil et al., 2026)。正如(Liu et al., 2023)所强调的,数据来源决定了数据集的可靠性,而生成高质量数据集是进行有意义基准测试的关键前提。我们在下文描述 TSNBench 的构建流水线,完整细节见附录 11。

“established domains”译为“成熟领域”;“open-source TSN dataset”译为“开源 TSN 数据集”;附录编号 11 保留。未发现明显问题。

P015已复核

已发表的研究论文和标准是构建特定领域数据集最可靠的来源之一(Liu et al., 2023)。由于 TSN 知识主要来源于同行评审研究和 IEEE 802.1 TSN 标准,我们整理了一组开放获取研究文档作为源语料库。为避免版权问题,并排除结果不正确或方法存在缺陷的论文,我们仅纳入已发表的开放获取论文。对于没有开放获取版本的论文,我们使用已经发表或已被接收的 arXiv 版本,排除结果未经验证的未发表预印本。在可能的情况下,我们还收集具有适当署名的作者手稿版本。为确保质量,我们优先选择来自声誉良好会议或期刊的高被引论文,同时考虑发表时间线,因为近期论文自然具有较少引用。总计,我们收集了 83 篇研究论文,覆盖范围广泛的 TSN 机制,包括 Time-Aware Shaper(TAS)、CBS、CQF、基于 NC 的可调度性分析、性能评估、硬件实验、组合整形器,例如 TAS+CBS(Zhao et al., 2022),以及 Multi-CQF(Alexandris et al., 2022)。关于 TSN、相关工作及其机制的详细背景见附录 7。

IEEE 802.1 TSN、arXiv、TAS、CBS、CQF、NC、TAS+CBS、Multi-CQF 均保留;“venues”译为“会议或期刊”是合理意译;数字 83 和附录 7 准确。未发现明显问题。

P016已复核

与其他通信领域类似(Andrews et al., 2014; Saad et al., 2020; Ma et al., 2019),TSN 使用专门词汇。一个成功理解 TSN 的 LLM 应当能够对 TSN 术语进行正确推理。不能区分 TAS 和 CBS,或不能正确展开 TSN 特定缩写的模型,不能被认为精通 TSN。为捕捉这一维度,我们提取 TSN 文献中广泛使用的关键词和缩写,并用它们指导 MCQA 生成。如表 1 所示,所有术语均使用 Claude Sonnet 4 从 83 篇研究文档中提取,并以 JSON 格式存储。每篇文档都经过预处理,以移除非相关内容,包括作者姓名、单位、图、表、URL 和伪代码。我们指示模型仅提取文档内部定义的术语,不依赖预训练知识,并从来源中提供每个术语的缩写、全称以及一到两句话的定义。随后,由领域专家审查提取出的集合以解决重复项;在存在冲突的情况下,保留较长的定义。图 1 展示了这一流水线。

“pretrained knowledge”译为“预训练知识”;“one-to-two-sentence definition”译为“一到两句话的定义”;表 1、图 1、83 篇均准确。未发现明显问题。

P017已复核

为优化时间并减少人工工作量,我们采用基于 LLM 的方法从研究文档生成 MCQA。关键词文件与研究文档一起作为额外输入提供,在生成过程中作为独立来源来补充研究论文内容。我们使用来自不同模型家族的三个模型,即 Claude Sonnet 4、GPT-4o mini 和 Llama 3.1 70B,如表 1 所示。这些模型是有意选择的,以确保多样化的风格和推理能力,从而减少生成偏差。所有模型使用相同的系统提示词,并且每篇研究论文以轮转方式恰好分配给一个模型。在生成之前,每篇文档中的非相关部分都会被移除,例如作者信息、单位、参考文献、URL、图、表和伪代码。

“round-robin manner”译为“轮转方式”;三个模型名称准确;“exactly one model”译为“恰好分配给一个模型”保留限定。未发现明显问题。

P018已复核

LLM 生成的 MCQA 不能直接用于基准测试,因为它们可能包含表述不正确的问题、不完整的选项,或含糊且不正确的答案选项。为处理生成模型引入的位置偏差,在人工专家审查之前,答案选项会被随机打乱,并更新正确答案标签以反映新的排序。

“positional bias”译为“位置偏差”;“correct answer label”译为“正确答案标签”;逻辑为先随机打乱再专家审查,准确。未发现明显问题。

P019已复核

鉴于 TSN 的安全关键性质,严格的人工验证是必要的。我们邀请了五位 TSN 领域专家:三位拥有超过 15 年研究经验的资深教授,以及两位拥有超过 8 年专业经验的博士后研究人员。每个问题都以四种结果进行独立评估:(i)接受 - 正确且清晰;(ii)修改 - 需要为清晰性或正确性进行修改;(iii)拒绝 - 问题不正确、具有误导性或不相关;或(iv)存疑 - 专家不确定,并将其交给其余评审者以达成共识。未达成共识的问题会被丢弃。完整审查标准见附录 11.1 和表 5。表 2 总结了数据集统计信息,图 2 展示了完整流水线。

五位专家构成、15 年、8 年均准确;四类评估结果完整;“doubtful”译为“存疑”;附录 11.1、表 5、表 2、图 2 均保留。未发现明显问题。

P020需人工复核

MCQA 评估陈述性 TSN 知识,而开放式问题则评估 LLM 是否能够执行真实 TSN 部署中所需的多步骤数学推理。我们评估 WCD 计算,因为 WCD 是 TSN 网络设计中的核心关键性能指标(KPI),并直接决定网络是否满足其严格的时序要求。我们为该评估选择了两种 TSN 机制:CBS 和 CQF。CBS 被广泛部署于音视频流量,并且需要基于 NC 的分析,因此在数学上具有较高要求。CQF 是一种较近标准化的 TSN 机制,在给定路由和周期时长(T T)的情况下,其 WCD 可以通过闭式方程计算,从而提供一种互补评估,将公式应用与 NC 复杂性隔离开来。合在一起,这两种机制覆盖了具有实际意义的一系列 WCD 计算难度。CBS 的标准答案 WCD 值使用经过验证的最先进 NC 工具(Zhao et al., 2018)计算,CQF 的标准答案 WCD 值则使用闭式数学上界计算。我们将所有标准答案 WCD 值随问题一同发布,以支持未来的开源社区评估。每个开放式问题均由领域专家制定,如图 3 所示,并由三个组成部分构成:网络拓扑、流信息和流路由。在 TSNBench 中,使用三种拓扑来覆盖广泛场景:(i)单交换机拓扑(图 15),(ii)中等规模网格拓扑(图 16),以及(iii)表示工业网络的环形拓扑(图 17)。每种拓扑都由通过以太网链路连接的端节点和交换机构成,单播流量流从发送方传输到单个接收方。流由以太网帧组成,其最大载荷受最大传输单元(MTU)限制。进一步的拓扑、流和路由细节见附录 11.4。

“routing and cycle duration (T T)”中的 “T T” 可能是公式识别或排版残缺,需人工核对原文公式;CBS、CQF、WCD、KPI、NC、MTU 均保留;图 3、图 15、图 16、图 17、附录 11.4 和 Zhao et al., 2018 准确。因存在疑似公式/符号残缺,需人工复核。

P021已复核

对于 MCQA 和开放式评估,每个提示都将模型的角色定义为 TSN 专家。对于 MCQA,我们使用零样本提示,不提供上下文内示例,这代表了一种保守方法,用于衡量模型内在的 TSN 熟练程度,确保输出性能反映的是模型的领域知识,而不是上下文内模式匹配。对于开放式问题,我们同样使用零样本设置,不提供示例 WCD 计算,也不提供 NC 或 CQF 方程,确保模型独立回忆并应用正确的计算方法。对于这两类问题,模型都被要求在给出答案的同时提供置信度分数。

术语 MCQA、TSN、WCD、NC、CQF 均已保留;“zero-shot prompting”“in-context examples”“confidence score”含义完整。未发现数字、公式或缩写风险。 **状态:** 已复核

P022已复核

开放式提示由三个可变组成部分构成:网络拓扑、流参数以及预先计算的最短路径路由。对于每种机制下的全部 100 个开放式评估实例,均使用相同的提示模板,仅这三个组成部分发生变化。固定的网络常量在整个过程中保持不变,以确保不同模型和不同实例之间具有可比性。关于开放式提示设计的详细讨论见附录 11.3。

数字 100 和附录 11.3 已保留;“per mechanism”译为“每种机制下”,语义准确。未发现明显问题。 **状态:** 已复核

P023需人工复核

对于 MCQA 数据集,性能以正确回答问题所占的百分比来衡量,并报告为准确率。对于开放式问题,我们通过将每个模型预测的 WCD 值与真实值进行比较,评估其计算推理能力。对于 CBS,真实 WCD 值使用基于 NC 的总流分析(Total Flow Analysis, TFA)推导得到。具体而言,对于在 h 处、属于通过 h 聚合的同一优先级 \(M_i\) 的流集合 \(\mathcal{F}_{M_i}^{h}\) 中的流 \(f\),其最坏情况时延上界 \(D_f^h\) 等于所有同优先级 \(M_i\) 流在 h 处聚合时的最坏情况时延上界 \(D_{M_i}^h\),即 \[ D_f^h = D_{M_i}^h = hDev(\alpha_{M_i}^{h}, \beta_{M_i}^{h}) = \sup_{t \geq 0}\left\{\inf\left\{\tau \geq 0 \mid \alpha_{M_i}^{h}(t) \leq \beta_{M_i}^{h}(t+\tau)\right\}\right\}, \] 其中,\(\alpha_{M_i}^{h}(t)\) 表示通过 h 的优先级 \(M_i\) 聚合流的到达曲线,\(\beta_{M_i}^{h}(t)\) 表示这些对应流的服务曲线。某条流的端到端 WCD 通过沿其路由对各端口时延上界求和得到。完整的 NC 方法和证明见附录 8。

原文公式存在明显抽取噪声与重复排版,如 \(D f h D_{f}^{h}\)、\(h h\)、分隔符与公式重复;译文按可识别 LaTeX 公式重构。术语 CBS、NC、TFA、WCD、arrival curve、service curve 已保留/准确翻译。因公式来自噪声文本,需人工对照论文 PDF 核验。 **状态:** 需人工复核

P024已复核

对于 CQF,最坏情况端到端时延由闭式表达式给出: \[ \mathrm{WCD}=f_i.\phi+(\mathrm{SW_{num}}+1)\cdot \mathrm{T}+\xi, \] 其中,\(f_i.\phi\) 是源节点处的流偏移量,单位为 \(\mu s\);\(\mathrm{SW_{num}}\) 是该流路由上的交换机数量;\(\mathrm{T}\) 是周期持续时间,单位为 \(\mu s\);\(\xi\) 表示网络特定时延,包括处理时延、传播时延、交换时延以及时间同步误差。该上界的推导和证明见附录 10。

公式、变量 \(f_i.\phi\)、\(\mathrm{SW_{num}}\)、\(\mathrm{T}\)、\(\xi\) 和单位 \(\mu s\) 已保留。原文中 “f i. ϕ” 排版不规范,但含义可由 LaTeX 确认。未发现明显问题。 **状态:** 已复核

P025已复核

我们评估了 16 个最先进的 LLM,覆盖开源和闭源模型,并涵盖通用架构与推理专用架构。附录 12 的表 6 给出了模型 ID 和所属组织在内的完整模型列表。所有模型均通过其各自官方供应商 API 访问,且未进行任何微调:GPT(OpenAI API)、DeepSeek(DeepSeek API)、Mistral(Mistral AI API)、Claude(Anthropic API)、Gemini(Google AI API)、Grok(xAI API),以及 Llama 和 Qwen(Hugging Face inference router)。所有客户端操作,包括提示构造、API 处理、响应解析和指标计算,均在一台标准工作站上执行。为评估可重复性和随机性,每个 MCQA 问题和开放式问题都在两种温度设置下评估三次:确定性设置(T = 0.0)和随机性设置(T = 0.7)。由于 TSN 广泛用于安全关键领域,确定性响应是必要的,因为非确定性会削弱基于 LLM 的 TSN 推理的可靠性。对于不公开 temperature 参数的模型,评估使用供应商默认配置,如表 5 所注明。完整的成本和延迟细节见附录 12 的表 8。

数字 16、三次、T = 0.0、T = 0.7、表 5/6/8、附录 12 均已保留;API 名称和模型族名称未误译。注意 temperature 在中文中保留为参数名更清晰。未发现明显问题。 **状态:** 已复核

P026已复核

由于 MCQ 是使用评估中包含的模型家族生成的,如表 1 所示,污染是一个潜在问题。因此,我们将被评估模型划分为生成器家族(Claude、GPT、Llama)和非生成器家族(所有其余模型),并比较它们的平均 MCQA 准确率。生成器家族模型达到 88.8% 的平均准确率,而非生成器家族模型达到 91.0%。生成器家族模型的表现并不优于非生成器家族模型,因此我们没有观察到系统性优势的证据。该分析并不能排除所有可能的污染路径,但它处理了这一特定担忧。开放式计时任务受到影响的可能性较低,因为其拓扑、流和路由输入是专门为 TSNBench 构建的。

MCQ/MCQA 区分已保留;88.8% 与 91.0% 无误;“contamination pathways”译为“污染路径”符合基准测试语境。未发现明显问题。 **状态:** 已复核

P027已复核

评估指标:模型在 MCQA 数据集上的性能使用准确率衡量,定义为 939 个问题中被正确回答的问题百分比,并在三次运行上取平均。我们还报告 Expected Calibration Error(ECE)(Pavlovic, 2025)和 Brier score(Hoessly, 2026),用于评估模型表达的置信度与其实际正确性之间的一致性。在 TSN 这类安全关键领域中,校准尤其关键,因为高置信度的错误答案可能导致误导性的配置决策、截止期限违背,或工业与汽车系统中的网络不稳定。因此,我们还评估 Confidently Wrong(CW)率,用于确定错误答案中模型表达高置信度(\(\geq 0.8\))的比例。所有校准指标均在完整的 939 道 MCQA 数据集上,按每个模型三次运行计算。

939、三次、\(\geq 0.8\)、引用年份均已保留;ECE、Brier score、CW 作为指标名保留英文并解释,避免歧义。未发现明显问题。 **状态:** 已复核

P028已复核

结果与讨论:表 5 报告了全部 16 个模型的准确率、平均(avg.)一致性、校准情况和平均延迟。表现最好的模型是 Claude Sonnet 4.5(95.3%)和 GPT-5(95.0%),其中 Claude Sonnet 4.5 还取得了最低的 Brier score(0.0429),表明其准确率和校准都很强。Llama 3.2 1B 的准确率最低(67.4%),这与其参数量相比其他模型显著更小相一致。

表 5、16、95.3%、95.0%、0.0429、67.4% 均已保留;avg. consistency 译为“平均一致性”。未发现明显问题。 **状态:** 已复核

P029已复核

一个值得注意的发现来自推理模型。尽管 o3、GPT-5 和 DeepSeek-V3.2(Thinking)具有更强的一般推理能力,但它们在 MCQA 上并未优于最好的非推理模型,分数均低于 Claude Sonnet 4.5。这表明,TSN MCQA 性能主要由领域知识驱动,而不是由一般推理能力驱动,并且推理专用架构在陈述性知识检索任务上的优势有限。

模型名 o3、GPT-5、DeepSeek-V3.2(Thinking)、Claude Sonnet 4.5 已保留;“declarative knowledge retrieval tasks”译为“陈述性知识检索任务”,语义准确。未发现明显问题。 **状态:** 已复核

P030已复核

校准结果揭示了不同模型之间的关键差异。虽然大多数模型校准良好(ECE < 0.06),但 o3 尽管准确率达到 94.7%,却具有最高的 ECE(0.1874),同时又取得最低的 CW 率(3.4%),很少对错误答案赋予高置信度(参见图 5)。相比之下,许多非推理模型的 CW 率为 100%,会对错误答案赋予高置信度。Mistral Medium 3.1 具有最高的平均置信度(0.9779),同时保持 92.1% 的准确率。所有模型的拒答率均为零,表明 MCQA 数据集不会触发响应拒绝。

ECE < 0.06、94.7%、0.1874、3.4%、100%、0.9779、92.1% 均已保留;图 5 引用保留。未发现明显问题。 **状态:** 已复核

P031已复核

图 6 展示了在 MCQA 数据集上对全部 16 个被评估模型得到的可靠性图。

术语“reliability plot”译为“可靠性图”合理;数字 16、数据集 MCQA 保留无误;未发现明显问题。

P032已复核

每个图都显示了观测到的准确率与模型表达出的置信度之间的关系,并按置信度范围进行分箱。一个校准完全理想的模型会落在灰色虚线对角线上。这意味着模型的置信度将与其实际准确率完全一致。红色阴影区域表示过度自信,即模型的置信度超过其实际准确率。绿色阴影区域表示自信不足,即模型的实际准确率高于其表达出的置信度所暗示的水平。

“binned”译为“分箱”符合校准图语境;overconfidence/underconfidence 分别译为“过度自信/自信不足”;逻辑关系完整;未发现明显问题。

P033已复核

在安全关键型 TSN 部署中,过度自信比自信不足危险得多。一个答案错误但表达出高置信度的模型,可能会用错误的 WCD 估计值或配置错误的调度参数误导网络工程师。相比之下,一个自信不足的模型如果在正确答案上表达不确定性,则会促使进行额外验证。

safety-critical 译为“安全关键型”;WCD 保留为术语;因果与对比逻辑完整;未发现明显问题。

P034已复核

无论其实际准确率如何,大多数被评估模型都位于高置信度区域(0.8 到 1.0)。这表明这些模型往往表现出过度自信。

数值范围 0.8 到 1.0 保留无误;“regardless of”逻辑已体现;未发现明显问题。

P035已复核

Grok 4.1 Fast (NR)、Mistral Medium 3.1、Mistral Large 3 和 Ministral 3 8B 的 CW 率达到 100%,这意味着所有错误答案都落在高置信度范围内。这代表了 TSN 部署中最关键的校准行为。GPT-4o、Gemini 2.5 Flash、Llama 3.2 1B 和 Qwen3 8B 同样表现出超过 95% 的 CW 率。一个值得注意的例外是 o3,它是唯一一个主要落在绿色自信不足区域的模型,CW 率仅为 3.4%。尽管 o3 在所有被评估模型中具有最高的 ECE(0.1874),但从校准角度看,它是被评估模型中最安全的,因为它很少在错误的 MCQA 答案上表达高置信度。这凸显了聚合校准指标与安全相关校准行为之间的一个重要区别。DeepSeek-V3.2 (NT) 达到最低的 ECE(0.0105),表明其整体校准较强,但仍保持 96.4% 的 CW 率,这说明较低的 ECE 并不能保证安全且现实的置信度行为。

模型名、CW、ECE、MCQA 均保留;数字 100%、95%、3.4%、0.1874、0.0105、96.4% 无误;“safest”译为“最安全”仅限校准角度,已按原文限定;未发现明显问题。

P036需人工复核

评估指标:对于开放式问题,我们报告两个广泛使用的指标:平均绝对误差(MAE)和平均绝对百分比误差(MAPE),并按每个测试用例(TC)计算。每个 TC 由 \(n\) 条流组成,记为 \(f_i\),其中 \(i = 1 \cdots n\)。对于每条流 \(f_i\),\(\hat{y}_{{TC}_{x},f_i}\) 表示模型针对 \({TC}_x\) 预测的 \(f_i\) 的 WCD,\(y_{{TC}_{x},f_i}\) 表示编号为 \(x\) 的 TC 中流 \(f_i\) 的真实 WCD;该真实值对于 CBS 使用经过验证的 NC 求解器计算,对于 CQF 使用式 22 计算。每个 TC 的 MAE 定义为: \[ \text{MAE}_{{TC}_{x}}=\frac{1}{n}\sum_{i=1}^{n}|\hat{y}_{{TC}_{x},f_i}-y_{{TC}_{x},f_i}|, \] 其中 \(x\) 表示 TC 索引,且 \(x\in\{1,\ldots,100\}\)。每个 TC 的 MAPE 定义为: \[ \text{MAPE}_{{TC}_{x}}=\frac{1}{n}\sum_{i=1}^{n}\frac{|\hat{y}_{{TC}_{x},f_i}-y_{{TC}_{x},f_i}|}{y_{{TC}_{x},f_i}}\times 100 \] 模型的总体 MAE 和 MAPE 通过对全部 100 个 TC 取平均得到: \[ \text{MAE}=\frac{1}{100}\sum_{x=1}^{100}\text{MAE}_{{TC}_{x}},\qquad \text{MAPE}=\frac{1}{100}\sum_{x=1}^{100}\text{MAPE}_{{TC}_{x}} \] 我们还报告跨 TC 的 MAE 中位数,作为一种对异常 TC 更稳健的度量。关于评估指标的进一步示例和细节见附录 13 和表 9。

输入文本中公式存在重复识别片段,如“n n”“f i f_{i}”“x x”,已按公式语义整理为标准数学记法;公式编号 (3)(4)(5) 在原段中出现但未在译文公式后保留编号,可能需要与排版上下文核对;“Eq. 22”译为“式 22”;因公式 OCR/抽取明显混杂,需人工复核。

P037已复核

结果与讨论:表 3 展示了在全部 100 个 TC 上,CBS 和 CQF 两种机制的 WCD 计算结果。核心发现是,MCQA 准确率与计算推理性能之间存在显著脱节。在 MCQA 上达到 90% 以上准确率的模型,在开放式 WCD 计算上仍然显著失败;其中表现最好的模型 GPT-5 在 CBS 上取得的 MAE 中位数为 92.4 μs,而这令人担忧,因为工业 TSN 流量可能具有严格的时序要求(Ekrad 等,2025)。详细的逐 TC 结果见附录 13。

“dissociation”译为“脱节”符合语境;数字 100、90%、92.4 μs 保留无误;引用 Ekrad et al., 2025 已中文化;未发现明显问题。

P038需人工复核

对于 CBS,大多数模型产生较大误差,其中许多模型超过 200 μs 的 MAE 和 70% 的 MAPE。若干模型表现出不同的失效模式。在 CBS 上,Llama 3.2 1B 对少于 50 个被评估 TC 作出响应,对少数 TC 返回全零 WCD 值,并对一些 TC 返回部分错误的值,且所有响应中的流覆盖都不完整。Grok 4.1 Fast (Reasoning) 返回被截断的 JSON,提供了流配置文件元数据,但没有提供 WCD 值,这表明该模型达到了输出长度限制。DeepSeek-V3.2 (Thinking) 在两种机制下合计超过 70 个 TC 返回空响应。CBS 所需的基于 NC 的计算在数学上要求高且复杂,而零样本设置表明,大多数模型无法独立回忆或正确应用完整的 NC 方法。在产生有效 CBS 响应的模型中,GPT-5 取得最佳表现(MAE 150.2 μs,MAPE 36.2%)。值得注意的是,OpenAI 推理模型和 Grok 4.1 Fast 在 CBS 上的表现优于非推理模型,其中 GPT-5 的 MAE 显著低于所有非推理模型,这表明多步骤数学推理能力即使不能提升 MCQA 准确率,也会为基于 NC 的 WCD 计算提供优势。

数字 200 μs、70%、少于 50、超过 70、150.2 μs、36.2% 保留无误;“for few TCs”译为“对少数 TC”但原文可能应为“a few TCs”,语义风险较低;“across both mechanisms”译为“两种机制下合计”需结合表格确认是否指 CBS 和 CQF 总计;状态建议人工核对该处。

P039已复核

对于 CQF,性能差异更大,MAE 中位数范围从 1.2 μs(GPT-4o)到 1,046 μs(Ministral 3 8B),MAPE 范围从 41.8%(Mistral Large 3)到 1705.5%(Ministral 3 8B)。尽管 GPT-4o 在 CBS 上完全失败,但它在 CQF 上达到最低的 MAE 中位数(1.2 μs,MAPE 61.9%),这表明它能够正确应用 CQF 的闭式方程。Mistral Large 3 在 CQF 上达到最低 MAPE(41.8%),表明它在所有被评估模型中具有最准确的相对 WCD 估计。Llama 3.2 1B 表现出最严重的幻觉失效,它编造了多达 1,013 条流(flow 0–1012),而不是为实际流(每个 TC 少于 30 条流)预测 WCD,并且对全部流都返回 WCD = 0。Qwen3 8B 由于反复 API 超时,未能为 CBS 或 CQF 产生任何响应。Ministral 3 8B 尽管是一个小模型,但对 CBS 和 CQF 都产生了有效响应,只是误差很大(CBS 的 MAPE 为 25498.1%,CQF 的 MAPE 为 1705.5%),这表明上下文处理是正确 WCD 计算的必要条件,但并非充分条件。

数字 1.2 μs、1,046 μs、41.8%、1705.5%、61.9%、1,013、0–1012、少于 30、25498.1% 均保留;“closed-form equation”译为“闭式方程”;逻辑完整;未发现明显问题。

P040已复核

MCQA 与开放式问题的比较:图 7 展示了模型在 MCQA 和开放式问题这两种评估类型之间的性能差异。右侧图显示了不同模型在单交换机拓扑上的 MAE,而左侧图给出了 MCQA 准确率。除 Llama 3.2 1B 外,所有模型的 MCQA 准确率都保持在较高水平,超过 80%。然而,对于截止期限处于 1000 到 5000 μs 范围内的 TSN 流而言,MAE 仍然很显著。图 18 进一步展示了模型在环形拓扑中针对 MCQA 和开放式问题的性能差异。

“one-switch topology”译为“单交换机拓扑”;数字 80%、1000 到 5000 μs 保留;figure 左右描述完整;未发现明显问题。

P041已复核

我们提出 TSNBench,这是首个用于评估大语言模型(LLM)在时间敏感网络(Time-Sensitive Networking,TSN)方面能力的基准。它包含 939 道经专家验证的多项选择题(MCQs),并针对基于信用的整形器(Credit-Based Shaper,CBS)和循环排队与转发(Cyclic Queuing and Forwarding,CQF)这两种机制分别包含 100 道开放式问题。真实标注的 WCD 值是这样计算得到的:对于 CBS,使用经过验证的网络演算(Network Calculus,NC)求解器;对于 CQF,使用闭式数学上界。我们评估了 16 个 LLM,并发现模型在 MCQA 上达到 67-95% 的准确率,但在开放式 WCD 计算上显著失败;即使是最佳模型(GPT-5),在 CBS 上仍然得到 36.2% 的平均绝对百分比误差(Mean Absolute Percentage Error,MAPE)。尽管 CBS 已被广泛研究并且是一种较早的机制,模型仍无法正确应用 NC;相比之下,CQF 由于具有更简单的闭式方程,模型处理得更成功。这证实 WCD 计算性能受数学复杂性支配,而不是受机制成熟度支配。TSNBench 表明,MCQ 基准会大幅高估 LLM 在安全关键领域中的能力。

术语 CBS、CQF、WCD、NC、MAPE、MCQ/MCQA 均已保留并译出;数字 939、100、16、67-95%、36.2% 保持一致;“ground truth WCD values”译为“真实标注的 WCD 值”较符合基准语境;逻辑中“CBS 更成熟但表现差、CQF 方程更简单但表现好”已保留。未发现明显问题。

P042已复核

局限性与未来方向:TSNBench 有三个主要局限。第一,MCQA 数据集由开放获取的研究论文生成,这限制了对某些机制的覆盖。第二,开放式评估仅覆盖 CBS 和 CQF。扩展到 TAS 是一个自然的下一步,尽管其 NP-hard 的门控控制列表(Gate Control List,GCL)综合问题带来了 CBS 和 CQF 之外的额外挑战。第三,开放式任务评估的是模型在闭卷提示下的独立零样本行为,不应被解释为安全关键 TSN 系统的推荐部署工作流。未来版本的 TSNBench 的重要方向包括:在让 LLM 生成可检查产物、并由确定性分析工具验证这些产物的设置中评估 LLM;以及评估在提示中提供 NC 方程是否能提高 WCD 计算准确率。

“closed-book prompt”译为“闭卷提示”,“standalone zero-shot model behavior”译为“独立零样本行为”,含义完整;NP-hard、GCL、NC、WCD 保留;因果和限制关系清楚。未发现明显问题。

P043已复核

尽管 TSNBench 填补了一个重要的研究空白,并提出了朝着评估 LLM 中 TSN 能力迈出的一步,但它仍有若干局限:

原文为引出局限列表的过渡句;“TSN capabilities in LLMs”译为“LLM 中 TSN 能力”略显直译,但含义可理解为模型掌握 TSN 的能力。未发现明显问题。

P044已复核

数据集范围:TSNBench 目前在开放式问题中仅覆盖 CBS 和 CQF。为了完整覆盖整个 TSN 机制体系,有必要评估其他 TSN 机制。

“entire TSN mechanism”原文表达可能不够自然,按语境译为“整个 TSN 机制体系”;CBS、CQF 保留;逻辑无缺失。未发现明显问题。

P045已复核

提示设计:对于 CBS 的 NC WCD 计算或 CQF 的上界时延计算,TSNBench 并未向模型提供任何数学方程作为输入。

NC WCD、CBS、CQF 均保留;“upper bound delay calculation”译为“上界时延计算”;句意完整。未发现明显问题。

P046已复核

MCQA 范围:MCQ 完全使用已发表的研究论文开发,并未使用 IEEE 标准来生成 MCQ。解决许可问题,并利用标准纳入基于 IEEE 802.1 标准的 MCQ,将增强整个 MCQA 数据集。

IEEE 802.1 标准编号保留;“license issue”译为“许可问题”;“MCQs are solely developed...”和“IEEE standards are not used...”两个限定关系均保留。未发现明显问题。

P047已复核

拓扑覆盖:TSNBench 的开放式问题目前覆盖三种不同拓扑:单交换机拓扑、中等规模网状拓扑和环形拓扑。覆盖多样化的拓扑和流参数将带来更全面的评估。

“one-switch, medium-mesh, and ring topology”分别译为“单交换机拓扑、中等规模网状拓扑和环形拓扑”;“flow parameters”译为“流参数”;无数字风险。未发现明显问题。

P048需人工复核

为了解决 TSNBench 的局限,我们提出在未来版本的 TSNBench 中加入以下补充和改进。1. 更大且更多样化的数据集:我们当前的 TSNBench 数据集覆盖三种拓扑类型中的 100 个 TC。在未来版本中,我们将纳入更大、更复杂且具有更高流数量的拓扑。随着模型性能提升,应将更复杂的开放式评估与复杂拓扑以及组合式 TSN 机制结合起来。2. 额外的调度机制:TSNBench 目前评估 CBS 和 CQF。未来版本应扩展到 TAS 和 ATS,以覆盖更广泛的 TSN 标准套件。3. 更新的 MCQA:我们的 MCQA 数据集是使用开源研究文档开发的。在未来工作中,我们将使用直接从 TSN 标准制定的 MCQA 来更新该数据集。4. 微调和领域适配模型。TSNBench 目前评估的是未经过任何 TSN 特定微调的通用 LLM。未来版本应对基于 TSN 标准和网络演算文献训练的领域适配模型进行基准测试。

编号 1-4、100 TCs、三种拓扑类型、CBS/CQF/TAS/ATS、MCQA、LLM、TSN 均保留;“TCs”未展开,原文未给出全称,保留缩写以避免误译;“combined TSN mechanisms”译为“组合式 TSN 机制”;无公式残缺。TCs 具体含义可能依赖前文,需注意上下文。

P049需人工复核

更大且更多样化的数据集:我们当前的 TSNBench 数据集覆盖三种拓扑类型中的 100 个 TC。在未来版本中,我们将纳入更大、更复杂且具有更高流数量的拓扑。随着模型性能提升,应将更复杂的开放式评估与复杂拓扑以及组合式 TSN 机制结合起来。

与 P048 中第 1 点内容一致;100 TCs、三种拓扑类型和“更高流数量”均保留;TCs 原文未展开,可能需要结合前文确认其准确含义。

P050已复核

额外的调度机制:TSNBench 目前评估 CBS 和 CQF。未来版本应扩展到 TAS 和 ATS,以覆盖更广泛的 TSN 标准套件。

CBS、CQF、TAS、ATS、TSN 标准套件均保留;逻辑为“当前覆盖有限,未来扩展机制范围”;未发现明显问题。

P051已复核

更新后的 MCQA:我们的 MCQA 数据集是使用开源研究文档开发的。在未来工作中,我们将使用直接从 TSN 标准制定出的 MCQA 来更新该数据集。

MCQA 缩写保留;“formulated directly from TSN standards”译为“直接从 TSN 标准制定出”较贴近原意。未发现明显问题。

P052已复核

微调模型和领域适配模型。TSNBench 目前评估的是通用 LLM,并未进行任何特定于 TSN 的微调。未来版本应当对基于 TSN 标准和网络演算文献训练的领域适配模型进行基准测试。

“network calculus”译为“网络演算”;“domain-adapted models”译为“领域适配模型”。未发现明显问题。

P053已复核

TSNBench 使实时系统社区和机器学习社区能够客观地衡量 LLM 在安全关键确定性网络中的管理和部署辅助方面的性能与就绪程度。通过突出 TSN 的关键方面,以及模型在 MCQA 与计算推理之间的性能差距,TSNBench 警示了模型能力不足这一问题,而这种能力不足可能导致错误配置和安全关键问题。该基准为改进面向确定性网络的 LLM 提供了具体方向。TSNBench 进一步突出了使用 LLM 的潜在收益,从而可实现 TSN 网络管理和部署的自动化。此外,由 NC 求解器计算并开源的真实值 WCD 值,为整个社区进一步评估不同基准测试数据集提供了可靠资源。

WCD、NC、MCQA 保留;“ground truth WCD values”译为“真实值 WCD 值”略显重复但保留术语完整性;“alerts the incompetence”原文表达不自然,译为“警示了模型能力不足”符合语义。未发现明显问题。

P054已复核

虽然 TSNBench 旨在推进关于 LLM 在 TSN 中熟练程度的研究,但我们承认以下潜在负面影响。

“proficiency”译为“熟练程度”;逻辑转折“While”已保留。未发现明显问题。

P055已复核

对模型输出的过度依赖:基于 TSNBench 提供的开放访问数据集训练的模型,可能会在 WCD 分析任务上达到较高准确率,这可能导致从业者在没有独立验证的情况下,直接在真实世界部署中部署此类模型。LLM 产生的任何 WCD 值或网络配置决策,都应当在真实世界部署之前,使用经过形式化验证的求解器和 NC 工具进行验证。

“open-access dataset”译为“开放访问数据集”;“formally verified solvers”译为“经过形式化验证的求解器”;WCD、NC 保留。未发现明显问题。

P056已复核

来自 MCQA 性能的虚假信心:我们的结果表明,强 MCQA 性能并不会迁移到开放式 WCD 估计。一名仅依据 MCQA 基准来评估 LLM 的从业者或系统工程师,可能会错误地得出该模型适合 TSN 配置任务的结论,从而在需要时序保证的系统中导致不安全部署。

“open-ended WCD estimation”译为“开放式 WCD 估计”;“timing guarantees”译为“时序保证”。未发现明显问题。

P057已复核

数据污染和基准过拟合:由于 TSNBench 作为开放访问数据集发布,未来模型可能会直接基于该基准问题进行训练,从而导致夸大的性能,而这种性能并不能反映真实的 TSN 推理能力。我们建议研究人员在测试用例中引入随机化,以防止结果中的偏差。研究人员在解释那些训练数据可能与 TSNBench 数据集重叠的模型结果时,应当保持谨慎。

“benchmark overfitting”译为“基准过拟合”;“inflated performance”译为“夸大的性能”;逻辑和因果关系完整。未发现明显问题。

P058已复核

数据集的误用:该数据集可用于训练模型来配置 TSN 网络。由于 TSN 应用具有安全关键性质,此类模型有可能被攻击者利用,以操纵网络配置、引入时序违例,或者在工业和汽车系统中故意造成截止期限错失。

“deadline misses”译为“截止期限错失”;“safety-critical nature”译为“安全关键性质”。未发现明显问题。

P059已复核

时间敏感网络(Time-Sensitive Networking,TSN)(Finn,2018)是一组对 IEEE 802.1 标准的修正和补充;自 2012 年提出以来,它已成为在以太网网络上实现确定性和实时通信的最相关技术之一。TSN 通过引入有界时延、低抖动和高可靠性机制来扩展标准以太网,使其适用于工业自动化、汽车系统和专业音视频网络等应用。图 8 展示了一个带有流的简单 TSN 网络。

“amendments and additions”译为“修正和补充”;“bounded latency, low jitter, and high reliability”指标完整;Finn 引用保留。未发现明显问题。

P060已复核

在 TSN 中,端站之间的通信基于以太网帧在由相互连接的以太网链路和 TSN 交换机构成的网络中的传输。这些交换机以及端站的输出端口实现了一种队列架构,该架构最多包含八个先进先出(First-In-First-Out,FIFO)队列,每个队列都与 IEEE 802.1Q(31)中定义的八个流量优先级之一相关联。TSN 并不仅限于有线领域。对确定性通信日益增长的需求已经扩展到无线领域,并使无线 TSN 网络受到显著关注。尽管 TSN 从根本上说是一种 IEEE 802.1 桥接以太网技术,但无线与 5G-TSN(Debnath 等,2023a)的集成需要额外的适配或转换功能,以及能够跨异构网络段保持确定性时延保证的时间同步机制。我们在图 9 中展示了一个 5G-TSN 系统,其中 TSN 发送方通过网络中的一个 TSN 交换机和 5G 系统,向无线接收节点发送混合关键性流量类型。TSN 中一些最常用的缩写见表 4。

IEEE 802.1Q(31)、FIFO、5G-TSN、Debnath 等引用均保留;“mixed criticality traffic types”译为“混合关键性流量类型”。图表上下文未提供,但本段自身无公式残缺。未发现明显问题。

P061已复核

帧会根据其优先级被分类到不同的流量类别,并分配到出口队列,传输选择通常由严格优先级机制控制。工业 TSN 流量通常被划分为等时流量、循环同步流量、循环异步流量、网络控制流量、告警与事件、配置与诊断以及尽力而为流量等流量类型(Ademaj et al., 2019)。这些流量类型需要不同的时序保证:安全关键的等时流量通常被映射为时间触发(TT)流量,需要有保证的时延和有界抖动,并且通常由时间触发机制处理,例如 Time-Aware Shaper(TAS)(Craciunas et al., 2016; Serna Oliver et al., 2018)。相比之下,需要有界端到端时延但对抖动控制要求不那么严格的循环同步或循环异步流量,通常被映射为 AVB 流流量,并且常由 Credit-Based Shaper(CBS)支持(Zhao et al., 2018)。TSN 还定义了 Asynchronous Traffic Shaping(ATS)(Specht and Samii, 2016; Debnath et al., 2023b; Nasrallah et al., 2019)、帧抢占(FP)(Debnath et al., 2024)以及 Cyclic Queuing and Forwarding(CQF)(Wang et al., 2023; Debnath et al., 2025a; Yan et al., 2020)等机制,用于在不同的流量和部署假设下提供确定性通信。

术语 TT、TAS、AVB、CBS、ATS、FP、CQF 已保留;引用年份与作者未改动;“strict priority”译为“严格优先级机制”合理;未发现明显问题。

P062需人工复核

这些机制调节帧在何时以及如何被传输,使网络能够提供有界时延、抖动以及受控带宽分配等保证。在 TSNBench 的 MCQA 数据集中,我们覆盖了不同 TSN 机制的基础知识,包括 TAS、CBS、ATS、CQF 和 CBS。这些 MCQA 在性质上是理论性的,覆盖对这些机制的基本理解,而不进入其数学或分析细节。相比之下,对于开放式机制题,我们评估模型执行数值分析、构建数学方程以及为网络中的流找到 WCD 值的能力。为此,我们选择了两种 TSN 机制:CBS 和 CQF。使用 CBS 机制的流的 WCD 值通过 NC 分析计算,而该分析在数学上较为复杂。因此,我们也评估 CQF 机制,将其作为一种更简单的机制。使用 CQF 机制的流的 WCD 值可以直接利用流的路由和周期时长来计算。CQF 和 CBS 的详细工作机制与架构在附录 9 和附录 10 中进行了详细描述。NC 理论以及数学方程在附录 8 中进一步解释。

原文列表中 “CBS” 出现两次,按要求保留,可能是原文笔误;WCD、NC、MCQA 等缩写已保留;“open-ended mechanisms”按上下文译为“开放式机制题”,存在轻微语境依赖;附录编号 8、9、10 未改动。

P063需人工复核

网络演算(Network Calculus,NC)是一种基于 min-plus 代数来计算通信网络中最坏情况界的理论。其基本范式涉及两个算子:卷积 ⊗ \otimes

本段在“卷积 ⊗ \otimes”后结束,公式明显延续到下一段;术语 “min-plus algebra”译为“min-plus 代数”以保留专名;由于段落/公式被切分,需人工核对排版完整性。

P064需人工复核

\((f ⊗ g)(t) = \inf_{0 ≤ s ≤ t}\{ f(t − s) + g(s) \}\),\((f\!\otimes\!g)(t)\!=\!\inf_{0\leq s\leq t}\{\!f(t\!-\!s)\!+\!g(s)\!\}\),(6)以及反卷积 ⊘ \oslash,

公式符号、上下界、inf、编号(6)已保留;本段同时包含可读公式与 LaTeX 形式,按原文保留;句首承接上一段且句尾继续到下一段,属于公式切分上下文,需人工核对。

P065需人工复核

\((f ⊘ g)(t) = \sup_{s ≥ 0}\{ f(t + s) − g(s) \}\)。\((f\!\oslash\!g)(t)\!=\!\sup_{s\geq 0}\{f(t\!+\!s)\!-\!g(s)\!\}\)。(7)

反卷积符号 ⊘、sup、条件 \(s ≥ 0\)、公式编号(7)已保留;本段为公式残片,需结合前后段确认排版和语义完整性。

P066已复核

基于该代数,可以构造到达曲线和服务曲线,分别用于描述任意时间区间内的最大到达流量数据和最小服务能力。在混合 TSN/TAS+CBS 架构中,ET 流量的服务不仅受到带宽预留的约束,还受到高优先级 TT 流量的约束。我们采用最先进的网络演算模型(Zhao et al., 2021, 2024),以在 TSN/TAS+CBS 架构中为具有任意数量 SR 类的 ET 流保证截止期限。由于在我们的开放式 CBS 问题中没有任何 TAS 机制,因此我们使用不包含 TAS 机制的 TSN/TAS+CBS 架构,其中网络中的 AVB 流仅采用 CBS 机制。

arrival curve、service curve、ET、TT、SR、AVB、CBS 等术语处理一致;“deadline guarantees”译为“截止期限保证”合理;原文 “Since,” 语法略不自然但含义清楚;未发现明显问题。

P067已复核

如(Zhao et al., 2024)所述,服务曲线 \(\beta(t)\) 用于约束最小服务能力,满足 \(\mathcal{R}^{*}(t)\geq\left(\mathcal{R}\otimes\beta\right)(t)\)。(8)函数 \(\mathcal{R}(t)\)(相应地,\(\mathcal{R}^{*}(t)\))是输入(相应地,输出)累积函数,用于统计截至时间 \(t\) 到达服务器(相应地,从服务器离开)的该流总数据比特数。服务曲线的一个典型例子是速率-时延形式,\(\beta_{R,T}(t)=R[t-T]^{+}\)(9),其中服务速率为 \(R\),时延为 \(T\)。记号 \([x]^{+}\) 在 \(x\geq 0\) 时等于 \(x\),否则等于 0。

公式(8)(9)、\(\mathcal{R}\)、\(\mathcal{R}^{*}\)、\(\beta_{R,T}\)、\([x]^+\) 含义已保留;“departure from the server”译为“从服务器离开”修正为自然中文;未发现明显问题。

P068需人工复核

在混合 TSN/TAS+CBS 架构中,对于任意 SR Class \(M_i\)(\(i\in[1,N_{SR}]\)),考虑输出端口 \(h\) 处 TT 流量影响的 CBS 服务曲线(Zhao et al., 2021)为:\(\beta^{h}_{M_{i}}(t)=idSl^{h}_{M_{i}}\left[t-\frac{\alpha_{TAS}^{h}(t)}{C}-\frac{c_{M_{i}}^{h,\max}}{idSl^{h}_{M_{i}}}\right]^{+}_{\uparrow}\),(10)其中 \(c_{M_{i}}^{h,\max}\) 是 SR Class \(M_i\) 的信用上界,\(c_{M_{i}}^{h,\max}=idSl^{h}_{M_{i}}\cdot\frac{\sum_{j=1}^{i-1}c_{M_{j}}^{h,\min}-l^{h,\max}_{>i}}{\sum_{j=1}^{i-1}idSl^{h}_{M_{j}}-C}\),(11)其中 \(l^{h,\max}_{>i}=\max_{j>i}\{l^{h,\max}_{M_{j}},l^{h,\max}_{BE}\}\) 是在 \(h\) 处优先级低于 Class \(M_i\) 的最大帧大小,\(l^{h,\max}_{M_{j}}\) 是在 \(h\) 处 Class \(M_i\) 的最大帧大小,而 \(c_{M_{i}}^{h,\min}\) 是 SR Class \(M_i\) 的信用下界,\(c_{M_{i}}^{h,\min}=sdSl^{h}_{M_{i}}\cdot\frac{l^{h,\max}_{M_{i}}}{C}\)。(12)式(10)中的 \(\alpha_{TAS}^{h}(t)\) 是由 GCL 调度的 TT 流量的到达曲线。

公式(10)(11)(12)及 \(idSl\)、\(sdSl\)、\(C\)、GCL 等符号/缩写已保留;原文称 \(l^{h,\max}_{M_j}\) 是 “Class \(M_i\)” 的最大帧大小,符号与文字可能不一致,已按原文翻译但需核对是否应为 Class \(M_j\);\([\,]^{+}_{\uparrow}\) 记号含义未在本段解释,需结合上下文确认。

P069需人工复核

到达曲线 \(\alpha(t)\) 用于约束流的到达过程,满足 \(\mathcal{R}(t)\leq\left(\mathcal{R}\otimes\alpha\right)(t)\)。(13)到达曲线的一个典型例子是突发-速率形式,

公式(13)和 \(\alpha(t)\)、\(\mathcal{R}(t)\)、\(\otimes\) 已保留;本段以“突发-速率形式,”结尾,公式延续到下一段,需人工核对段落切分。

P070需人工复核</final>

\(\alpha(t)=b+\rho\cdot t\),(14)

公式(14)已保留;本段为公式片段,未给出 \(b\) 和 \(\rho\) 的解释,需结合后续段落确认完整性。

P071需人工复核

当 \(t > 0\) 时成立,而在其他情况下为 0,其中参数 \(b\) 表示最大突发容忍度,\(\rho\) 表示该流的长期速率。

原文开头疑似承接上一段公式;“for t > 0 t>0”存在重复识别,但含义明确。术语、数字、逻辑、公式/缩写未发现其他明显问题。

P072已复核

对于每个在其源端 ES \(h_{0}\) 处的 ET 流 \(f\),到达曲线可以建模为:

ES、ET 保留为缩写;“source ES \(h_0\)”译为“源端 ES \(h_0\)”符合上下文。未发现明显问题。

P073已复核

\[ \alpha_{f}^{h_{0}}(t)=b_{f}^{h_{0}}+\rho_{f}^{h_{0}}t, \tag{15} \] 其中 \(b_{f}^{h_{0}}=l_{f}\),且 \(\rho_{f}^{h_{0}}=l_{f}/P_{f}\)。流 \(f\) 在中间节点 \(h\) 处的到达曲线,是流 \(f\) 从服务器 \(h^{-}\) 离开时的输出到达曲线,

公式、编号、参数关系已保留;“server \(h^{-}\)”译为“服务器 \(h^{-}\)”。段落末尾以逗号结束,明显承接下一段公式,逻辑完整但依赖后文。未发现明显问题。

P074已复核

\[ \alpha_{f}^{h}(t)=\alpha_{f}^{h^{-}}\oslash\delta_{D_{f}^{h^{-}}}(t), \tag{16} \]

公式符号 \(\oslash\)、纯延迟函数 \(\delta\)、上标 \(h^{-}\)、公式编号均已保留。未发现明显问题。

P075已复核

其中,\(D_{f}^{h^{-}}\) 是流 \(f\) 在服务器 \(h^{-}\) 处排队的时延上界,而 \(\delta_{D}(t)\) 是纯延迟函数。

术语“latency upper bound”译为“时延上界”,“pure-delay function”译为“纯延迟函数”;公式符号与缩写未发现明显问题。

P076已复核

在 \(h\) 处,SR 类 \(M_{i}\) 的 ET 流的聚合到达曲线通过对各个流的到达曲线求和得到。它还结合了链路整形曲线和 CBS 整形曲线,以提高分析结果的紧致性。 \[ \alpha^{h}_{M_{i}}(t)= \sum_{h^{-}\in\mathcal{H}} \sum_{f\in\mathcal{F}_{M_{i}}^{h^{-},h}} \alpha^{h}_{f}(t) \wedge\sigma_{link}^{h^{-},h}(t) \wedge\sigma_{M_{i}}^{h^{-},h}(t), \tag{17} \]

SR、ET、CBS 缩写已保留;“tightness”译为“紧致性”。公式中的 \(\wedge\) 与两类整形曲线已保留。原文公式排版有多处空格识别噪声,但结构可辨。未发现明显问题。

P077需人工复核

其中,\(x\wedge y=\min\{x,y\}\),\(\sigma_{link}^{h^{-},h}(t)\) 是从前一输出 \(h^{-}\) 到当前输出端口 \(h\) 的链路整形曲线: \[ \sigma_{link}^{h^{-},h}(t)=Ct+l_{M_{i}}^{h^{-},h,\max}, \tag{18} \] 其考虑了从 \(h^{-}\) 到 \(h\) 的类 \(M_{i}\) 流的最大帧大小 \(l_{M_{i}}^{h^{-},h,\max}\) 所带来的分组化影响。\(\sigma_{M_{i}}^{h^{-},h}(t)\) 是从 \(h^{-}\) 到 \(h\) 的类 \(M_{i}\) 的 CBS 整形曲线: \[ \sigma_{M_{i}}^{h^{-},h}(t)= idSl^{h^{-}}_{M_{i}} \left[ t-\frac{\beta^{h^{-}}_{TAS}(t)}{C} +\frac{c_{M_{i}}^{h^{-},\max}-c_{M_{i}}^{h^{-},\min}}{idSl^{h^{-}}_{M_{i}}} \right] +l_{M_{i}}^{h^{-},h,\max}, \tag{19} \] \(\beta^{h}_{TAS}(t)\) 表示在输出端口 \(h\) 上提供给 TT 流量的最小服务。

公式较长且原文存在 OCR/排版空格噪声,已按 LaTeX 片段还原;\(idSl\)、\(c^{\max}\)、\(c^{\min}\)、\(\beta_{TAS}\)、\(C\) 等符号均保留。需注意原文中公式使用 \(\beta^{h^{-}}_{TAS}(t)\),随后解释写 \(\beta^{h}_{TAS}(t)\),上下标存在上下文差异,建议人工复核。

P078已复核

使用基于网络演算(NC)的总流分析(Total Flow Analysis, TFA),对于在 \(h\) 处的流 \(f\in\mathcal{F}_{M_{i}}^{h}\),其最坏情况时延上界 \(D_{f}^{h}\) 等于在 \(h\) 处聚合的所有具有相同优先级 \(M_{i}\) 的流的最坏情况时延上界 \(D_{M_{i}}^{h}\),即 \[ D_{f}^{h}=D_{M_{i}}^{h} =hDev(\alpha_{M_{i}}^{h},\beta_{M_{i}}^{h}) =\sup_{t\geq 0} \left\{ \inf \left\{ \tau\geq 0 \mid \alpha^{h}_{M_{i}}(t) \leq \beta^{h}_{M_{i}}(t+\tau) \right\} \right\}. \tag{20} \]

NC、TFA 缩写及全称已保留;\(hDev\)、\(\sup\)、\(\inf\)、\(\tau\)、不等式与公式编号已保留。原文公式末尾含换行符号“\\\\”,属于排版残留,译文未保留。未发现明显问题。

P079已复核

其中,\(\alpha^{h}_{M_{i}}(t)\) 是式(17)中类 \(M_{i}\) 的聚合流到达曲线,而 \(\beta^{h}_{M_{i}}(t)\) 是式(10)中类 \(M_{i}\) 的服务曲线。随后,通过对流 \(f\) 沿其路径经过的各端口时延界进行求和,得到该流的最坏情况端到端时延上界。

式(17)、式(10)引用已保留;“per-port latency bounds”译为“各端口时延界”。未发现明显问题。

P080已复核

基于信用的整形器(Credit-Based Shaper, CBS)是一种 TSN 机制,旨在防止较低优先级流量饥饿,同时为较高优先级队列保证预留的一部分带宽,从而通过有界端到端时延提供可靠性。分配给使用 CBS 的队列的流量通常称为音视频桥接(Audio Video Bridging, AVB)流量。

CBS、TSN、AVB 缩写及全称已保留;“starvation”译为“饥饿”,“bounded end-to-end delays”译为“有界端到端时延”。未发现明显问题。

P081已复核

这里,我们基于(Bujosa Mateu, 2024)中的描述展开。在 CBS 中,每个 AVB 队列都关联一个信用值。当某个帧正在等待传输,或者当信用值为负时,该信用值会随时间增加;而在某个帧正在被传输期间,该信用值会降低。此外,如果信用值为正,并且没有 AVB 帧正在等待传输,则该信用值会立即重置为 0。信用值增加和降低的速率分别由参数 idleSlope 和 sendSlope 定义。每个实现 CBS 的队列都配置有自己的 idleSlope 和 sendSlope 值,这些值决定其被分配的带宽份额。特别地,为一个队列保留的带宽表示为式(21)。只有当其信用值为零或为正时,队列才有资格进行传输。

术语 CBS、AVB、idleSlope、sendSlope 已保留;“eligible for transmission”译为“有资格进行传输”符合调度语境;数字 0、引用和式(21)未遗漏。未发现明显问题。

P082已复核

保留带宽的计算如下: \[ \mathrm{Reserved\;BW}=\frac{\textit{idleSlope}}{\textit{idleSlope}+\textit{sendSlope}}\cdot BW \tag{21} \]

公式符号 Reserved BW、idleSlope、sendSlope、BW 和编号(21)已保留;输入文本中存在重复的纯文本公式与 LaTeX 公式,译文采用规范公式呈现。需注意 CBS 标准语境中 sendSlope 通常可能为负值,原式照译未改。

P083已复核

考虑图 11 所示的示例,其中包括两个 AVB 队列和一个 Best Effort(BE)队列。帧 1 和帧 4 被分配到较高优先级的 AVB 队列,而帧 2 和帧 3 分别属于较低优先级的 AVB 队列和 BE 队列。

图 11、两个 AVB 队列、一个 BE 队列,以及帧 1、2、3、4 的归属关系均已保留;Best Effort 缩写 BE 已保留。未发现明显问题。

P084已复核

在时间 T0,两个 AVB 队列都有资格进行传输。由于采用严格优先级调度,较高优先级的 AVB 队列(优先级 2)被选中,并传输帧 1。在此次传输期间,该队列的信用值降低,而较低优先级 AVB 队列的信用值增加,因为它正在等待。

时间 T0、优先级 2、帧 1 和信用值变化方向均准确;“strict priority scheduling”译为“严格优先级调度”。未发现明显问题。

P085已复核

在时间 T1,较高优先级 AVB 队列已经累积了负信用值,因此不再有资格进行传输。结果,较低优先级 AVB 队列被选中,并传输帧 2,即使一个更高优先级的帧(帧 4)正在等待。在此期间,较低优先级队列的信用值降低,而较高优先级队列的信用值恢复。

时间 T1、负信用值、帧 2、帧 4 以及两个队列的信用值变化均已保留;因果关系“therefore / As a result”已表达。未发现明显问题。

P086已复核

到时间 T2 时,两个 AVB 队列都具有负信用值,使它们没有资格进行传输。因此,BE 队列被选中,并传输帧 3,尽管此时有一个更高优先级的 AVB 帧正在等待。

时间 T2、两个 AVB 队列负信用值、BE 队列、帧 3、较高优先级 AVB 帧等待均未遗漏;逻辑转折“despite”已表达。未发现明显问题。

P087已复核

最后,在时间 T3,较高优先级 AVB 队列的信用值已经恢复到零,使其再次有资格进行传输。因此,帧 4 被传输。

时间 T3、信用值恢复到零、再次 eligible、帧 4 均准确;逻辑关系清楚。未发现明显问题。

P088已复核

循环排队与转发(Cyclic Queuing and Forwarding,CQF)(Debnath et al., 2025b)是一种 TSN 整形机制,它在整个网络中使用单一的周期时长,记为 \(T\)。\(T\) 是我们放置 TSN 流的最小调度单元。此外,\(T\) 定义了网络中流的端到端时延粒度。在 TSNBench 中,\(T\) 的单位为 \(\mu s\)。在 TSN 交换机中,网络中的每个出端口都有八个队列。TSN 流根据其优先级存储在队列中。在 CQF 中,对于每个出端口,会使用两个队列:一个偶数队列和一个奇数队列。图 14 展示了具有两个队列(偶数和奇数)的 CQF 基本工作图。如图 14 所示,CQF 通过采用两个队列来工作,例如用于 TT 流的 \(Q_{8}\) 和 \(Q_{7}\),并以乒乓方式运行它们,其中在第一个周期时隙(\(T_{1}\))中,\(Q_{7}\) 接收,\(Q_{8}\) 传输。在第二个周期时隙(\(T_{2}\))期间,\(Q_{8}\) 接收,\(Q_{7}\) 传输。为某个流选择或分配一个周期时隙,意味着为该流选择周期时隙编号(在超周期 \(H\) 内)以及队列。

CQF、TSN、TT、\(T\)、\(\mu s\)、\(Q_{8}\)、\(Q_{7}\)、\(T_{1}\)、\(T_{2}\)、超周期 \(H\) 均已保留;“T T”与“H H”应为 PDF 抽取导致的重复,已规范为 \(T\)、\(H\)。原文中“depending on its priority”指 TSN flows,存在单复数不一致,译文按语义处理为“其优先级”。未发现明显问题。

P089已复核

在 TSNBench 的 CQF 评估中,我们通过提示词将周期时长(\(T\))和特定于网络的时延作为输入提供给模型。

CQF evaluation、cycle duration \(T\)、network-specific delays、prompt、model 均已准确表达;“delays”译为“时延”符合网络语境。未发现明显问题。

P090需人工复核

WCD CQF:CQF 网络中 TT 流的最坏情况端到端时延量化如下: \[ \mathrm{Max\;Delay}=f_i.\phi+(\mathrm{SW_{num}}+1)\cdot\mathrm{T}+\xi \tag{22} \] 其中,\(f_i \cdot \phi\) 是流 \(f_i\) 的偏移量,单位为 \(\mu s\);\(\mathrm{SW_{num}}\) 是 TT 流路由中的交换机总数;\(\mathrm{T}\) 是周期时长,单位为 \(\mu s\);\(\xi\) 表示特定于网络的时延:处理时延、传播时延和时间同步误差(sync error,\(\mathrm{sync_{error}}\))。

WCD CQF、Max Delay、\(f_i\)、\(\phi\)、\(\mathrm{SW_{num}}\)、\(\mathrm{T}\)、\(\xi\)、\(\mu s\)、sync error 和 \(\mathrm{sync_{error}}\) 均已保留;输入公式中写作 \(f_i.\phi\),解释文字写作 \(f_i\cdot\phi\),存在点号/乘号不一致,译文公式照录点号、解释按原文解释保留乘号,建议人工关注是否为 PDF 识别或排版差异。

P091已复核

为了在所有人工评审者之间保持相同的标准,我们使用以下规则来评估 MCQA 数据集。MCQA 数据集中的每个问题都有四个可能选项。

术语 MCQA 保留为缩写;“human reviewers”译为“人工评审者”,“options”译为“选项”。数字“四个”已保留。未发现明显问题。

P092已复核

1. 接受:i. 技术上正确。ii. 表述清晰且自包含。iii. 选项无歧义。iv. 解释准确且充分。v. 正确答案实际上确实是正确答案。2. 拒绝:i. 不正确或具有误导性。ii. 构造很差,已超出修订范围。iii. 与 TSN 无关。iv. 信息不完整。v. 过于依赖论文。vi. 重复问题。3. 修订:i. 在语法、清晰度或措辞方面存在轻微问题。ii. 选项需要改进。iii. 解释需要完善。4. 存疑:i. 具有论文特异性,或对问题正确性不确定。ii. 解释看起来有疑问。iii. 需要进一步澄清。对于一个存疑的多项选择题,我们会阅读该研究论文并重新评估该问题。之后,决定可以是接受、拒绝或修订;如果它仍然存疑,我们会将其发送给另一位专家评审者,以便作出基于共识的小组决定。

保留 TSN 缩写;MCQA 上下文对应“多项选择题”。编号、子项 i-vi、四类判断均已保留。“Too paper-dependent”译为“过于依赖论文”可能也可理解为“过于依赖特定论文内容”,但语义未偏离。未发现明显问题。

P093已复核

接受:i. 技术上正确。ii. 表述清晰且自包含。iii. 选项无歧义。iv. 解释准确且充分。v. 正确答案实际上确实是正确答案。

该段与 P092 中“Accept”部分重复,疑似由列表抽取造成的重复段落;译文按输入逐段对应保留。术语、编号和逻辑未发现明显问题。

P094已复核

技术上正确。

该段是 P093 子项的拆分重复,可能来自列表结构抽取;译文忠实对应。未发现明显问题。

P095已复核

表述清晰且自包含。

“self-contained”译为“自包含”,符合技术语境;该段是列表子项拆分。未发现明显问题。

P096已复核

选项无歧义。

“Unambiguous options”译为“选项无歧义”,简洁准确;无数字、公式或缩写风险。未发现明显问题。

P097已复核

解释准确且充分。

“Accurate and sufficient explanation”译为“解释准确且充分”,术语和逻辑无风险。未发现明显问题。

P098已复核

正确答案实际上确实是正确答案。

原文表达带有强调性重复,译文保留了这种强调;无数字、公式或缩写风险。未发现明显问题。

P099已复核

拒绝:i. 不正确或具有误导性。ii. 构造很差,已超出修订范围。iii. 与 TSN 无关。iv. 信息不完整。v. 过于依赖论文。vi. 重复问题。

保留 TSN 缩写;六个拒绝条件均已对应翻译。“Poorly constructed beyond revision”译为“构造很差,已超出修订范围”准确表达不可通过修改挽回的含义。未发现明显问题。

P100已复核

不正确或具有误导性。

该段是 P099 子项的拆分重复;译文忠实对应。未发现明显问题。

P101已复核

构造很差,已无法通过修改加以修订。

“beyond revision” 表示超出可修订范围,可理解为无法通过修改挽救;未发现数字、公式或缩写风险。

P102已复核

与 TSN 无关。

TSN 保留为 Time-Sensitive Networking 的缩写;未发现明显问题。

P103已复核

信息不完整。

原文为简短判定项,无数字、公式或缩写;未发现明显问题。

P104已复核

过度依赖论文内容。

“paper-dependent” 指题目或内容对特定论文依赖过强;未发现明显问题。

P105已复核

重复问题。

原文为分类标签,译为“重复问题”准确;未发现明显问题。

P106已复核

修订:i. 语法、清晰度或措辞方面存在轻微问题。ii. 选项需要改进。iii. 解释需要完善。

保留 i、ii、iii 的枚举结构;“Options” 按多项选择题语境译为“选项”;未发现明显问题。

P107已复核

语法、清晰度或措辞方面存在轻微问题。

与 P106 中第 i 项一致;未发现明显问题。

P108已复核

选项需要改进。

“Options” 译为“选项”符合题目评审语境;未发现明显问题。

P109已复核

解释需要完善。

“refinement” 译为“完善”保留了需进一步打磨的含义;未发现明显问题。

P110已复核

存疑:i. 具有论文特定性,或对问题正确性不确定。ii. 解释看起来值得质疑。iii. 需要进一步澄清。对于一个存疑的多项选择题,我们会阅读研究论文并重新评估该问题。之后,决定可以是接受、拒绝或修订;如果它仍然存疑,我们会将其发送给另一位专家审稿人,以进行基于共识的小组决策。

“Doubtful” 译为“存疑”适合作为评审标签;“multiple-choice question” 译为“多项选择题”;accept/reject/revise 分别译为“接受、拒绝或修订”。逻辑为先读论文并重新评估,再作决定,若仍存疑则转交另一位专家审稿人形成共识;未发现明显问题。

P111需人工复核

特定于论文,或对问题正确性不确定。

原句较短,可能是评审标签或表格条目;“Paper-specific”译为“特定于论文”较直译。上下文可能缺失,需人工确认该短语在表格中的准确语义。

P112已复核

解释似乎存在疑问。

未发现明显问题。

P113已复核

需要进一步澄清。

未发现明显问题。

P114已复核

审查数据集时遵循的关键原则:我们确保这些 MCQA 在技术上准确,并且与 TSN 基础知识保持一致。我们避免设置刁钻问题,并且相比复杂性更偏好清晰性。同一套规则被提供给所有参与该数据集工作的专家审查者,他们同时担任人工评判者。审查之后,领域专家修订了 185 个问题,如下方表 5 所示。

MCQA 保留为缩写,语义为多项选择问答;TSN 保留。数字 185、表 5 已保留。逻辑关系“审查之后”明确。未发现明显问题。

P115已复核

我们在下方给出来自我们的 MCQA 数据集的三个代表性示例问题。

数字“三个”已保留;MCQA 缩写保留。未发现明显问题。

P116已复核

Q1 TSN 关键词 在 TSN 流量管理中,TAS 代表什么?A. 传输访问调度器 B. 流量分析系统 C. 时间感知整形器 D. 流量准入服务 正确答案:C

TAS 在 TSN 中通常对应 Time-Aware Shaper,译为“时间感知整形器”合理。选项 A-D 与正确答案 C 已保留。题目和选项来自样例题,格式较紧凑但信息完整。未发现明显问题。

P117已复核

Q2 研究论文 在循环排队与转发(Cyclic Queuing and Forwarding, CQF)网络中,在线性拓扑中每个交换机经常排队着最大传输单元(Maximum Transmission Unit, MTU)大小的帧时,什么根本性限制会阻止使用可靠性帧复制与消除(Frame Replication and Elimination for Reliability, FRER)实现有效的容错?A. CQF 的乒乓队列切换会与 FRER 的帧消除机制产生时序冲突。B. EMI 干扰会同等地破坏原始帧和复制帧,从而使空间冗余无效。C. FRER 无法检测由 EMI 引起的比特错误,因为它缺乏循环冗余校验(Cyclic Redundancy Check, CRC)验证能力。D. 线性拓扑无法提供 FRER 的空间冗余方法所需的不相交路径,从而迫使增加昂贵的硬件。正确答案:D

CQF、MTU、FRER、EMI、CRC 缩写均已保留并补充中文。正确答案 D 已保留。题干较长,限定条件“linear topology”“each switch”“MTU sized frames frequently queued”均已翻译。未发现明显问题。

P118已复核

Q3 研究论文 尽管时间感知整形器(Time Aware Shaper, TAS)能够提供有保证的端到端时延,是什么根本性挑战使其实现变得复杂?A. 需要将所有网络设备同步到一个共同的时间参考。B. 需要同时为每个流量类别维护单独的队列。C. 难以估计可变长度帧的最坏情况传输时间。D. 门控控制列表的综合,这是一个 NP-complete 问题。正确答案:D

TAS 术语已保留。NP-complete 保留英文术语,避免误译;也可译为“NP 完全问题”。正确答案 D 已保留。未发现明显问题。

P119已复核

对于 CBS 和 CQF 机制,WCD 计算使用了两种不同的方法。NC 用于计算 CBS 的 WCD,而解析式数学计算用于求出 CQF 机制的 WCD。由于这两种机制的工作方式不同,我们设计了针对每种机制定制的提示词。

CBS、CQF、WCD、NC 均保留为缩写;WCD 可能指 worst-case delay,NC 可能指 network calculus,但原文未展开,未擅自补充。逻辑关系“whereas”“Since”已体现。未发现明显问题。

P120已复核

角色:我们首先定义模型的角色:“你是一名专家级时间敏感网络(TSN)编排器。” 我们注入三个网络输入:(i)网络拓扑,(ii)TSN 流信息,以及(iii)流的路由。我们使用 prompt-as-program(Reynolds and McDonell, 2021)方法来分离网络拓扑、流信息和流路由。所有这些内容都以文本格式提供。然而,为了评估不同的拓扑、流和路由,我们将它们与提示词逻辑分离开来。这确保提示词在不同网络拓扑和参数之间保持相同。

引用 Reynolds and McDonell, 2021 已保留;三项输入编号与内容完整。prompt-as-program 保留英文术语以避免误译。逻辑转折“However”已翻译。未发现明显问题。

P121已复核

常量:为了正确计算 WCD,需要有关网络参数的信息。为防止模型自行假设这些数值,并使所有模型中的常量值保持一致,我们在提示词中提供这些信息。

术语 WCD 保留缩写;“network parameters”译为“网络参数”;逻辑为“提供常量以避免模型假设并保持一致”。未发现明显问题。

P122需人工复核

CBS 开放式问题的常量:Bandwidth = 100 Mbps,Propagation delay = 1 μs,Switching delay = 1 μs,Time synchronization error = 1 μs,网络中的交换机为 cut-through switches,IdleSlope = 75%。

原文存在字符间异常空格与重复公式文本,已按可识别含义合并;Bandwidth、Propagation delay、Switching delay、Time synchronization error、IdleSlope 等参数名保留英文以避免指标歧义;cut-through switches 可译为“直通式交换机”,但为保持术语精确暂保留英文。因原文抽取格式异常,需人工复核。

P123已复核

通过控制这些网络参数,我们直接缓解了关于数值的幻觉和假设。

“hallucinations and assumptions about numerical values”译为“关于数值的幻觉和假设”;因果关系清晰。未发现明显问题。

P124已复核

架构限制:TSN 支持多种会影响流的服务质量(QoS)和 WCD 的架构。提示词通过以下指令限制模型只能使用一种 TSN 机制。

QoS、WCD 缩写保留;“multiple architectures”译为“多种架构”;“through the following directive”译为“通过以下指令”。未发现明显问题。

P125已复核

对于 CBS 机制,我们使用:

CBS 保留缩写;该段为引出后续指令的短句。未发现明显问题。

P126已复核

TSN 机制:仅允许使用 Credit-Based Shaper(CBS,IEEE 802.1Qav);所有流均为 AVB Class A,PCP = 6,并且仅使用队列 6。

Credit-Based Shaper、CBS、IEEE 802.1Qav、AVB Class A、PCP、队列编号均保留;“Only ... is allowed”译为“仅允许使用”。未发现明显问题。

P127已复核

对于 CQF 机制,我们使用:

CQF 保留缩写;该段为引出后续指令的短句。未发现明显问题。

P128已复核

TSN 机制:仅允许使用 Cyclic Queuing and Forwarding(CQF,IEEE 802.1Qch);所有流均为 TT,PCP = 7,并且仅使用队列 7(奇数)和队列 6(偶数)。

Cyclic Queuing and Forwarding、CQF、IEEE 802.1Qch、TT、PCP、队列编号均保留;“odd/even”译为“奇数/偶数”。未发现明显问题。

P129已复核

我们的推理是,让模型选择 TSN 架构或机制是一个单独的基准测试问题,在该问题中,模型会根据架构设计性能接受评估。在 TSNBench 中,我们的目标是在 TSN 领域对 LLM 进行基准测试。如果没有明确限制,模型可能会选择错误或不适当的机制,从而产生一种无法满足各流 QoS 需求的幻觉架构。该限制迫使模型使用单一的解空间。它进一步确保不同模型给出的 WCD 并非由架构故障或机制选择歧义造成,而是由指定机制内部的计算和实现错误造成。

“architecture design performance”译为“架构设计性能”;“hallucinated architecture”译为“幻觉架构”;逻辑为先说明机制选择是另一类任务,再解释限制可排除架构或机制选择因素。未发现明显问题。

P130已复核

结构化输出:我们通过提示词指示模型严格以 JSON 格式提供输出(Yang 等,2026)。

Structured Output 译为“结构化输出”;JSON 保留;引用 Yang et al., 2026 译为“Yang 等,2026”。未发现明显问题。

P131已复核

对于开放式问题,有三个可变条目:网络拓扑、流信息和流路由。我们使用 K-最短路径算法来确定各个流的路由。随后,这些路由被直接提供给模型,作为进一步评估的输入。

术语“K-shortest path algorithm”译为“K-最短路径算法”合适;“flows”按 TSN 语境译为“流”;逻辑为先确定路由再输入模型,未发现明显问题。

P132已复核

所使用的网络拓扑:对于开放式问题,我们选择了三种不同的拓扑来评估模型:单交换机拓扑、中等网状拓扑和工业环形拓扑。图 15、图 16 和图 17 分别表示 TSNBench 中使用的单交换机拓扑、中等网状拓扑和环形拓扑。

“one-switch topology”“medium-mesh topology”“industrial ring topology”分别译为“单交换机拓扑”“中等网状拓扑”“工业环形拓扑”合理;图号 15、16、17 与对应顺序一致,未发现明显问题。

P133已复核

流参数:我们如下展示 TSNBench 中使用的流信息。

“Flow parameters”和“flow information”分别译为“流参数”和“流信息”符合上下文;该段引出后续数据,未发现明显问题。

P134需人工复核

流信息 TC1_flows.txt 0,node2_1,node5_2,2500,709,965 1,node5_4,node3_2,2500,610,825 2,node0_4,node0_1,1000,786,887 3,node2_3,node4_3,2500,1088,1233 4,node0_4,node3_3,1000,1015,488 5,node0_4,node0_1,2500,926,501...

该段主要为文件名与逗号分隔数据,保留原始节点名、数字和省略号更稳妥;但缺少列名解释,且末尾为截断数据“...”,表格上下文不完整。

P135已复核

真实 WCD 值 CBS 机制下所有开放式测试用例中各个流的真实 WCD 值,是使用一个已验证的 NC 工具计算得到的(Zhao et al., 2018; Debnath et al., 2025c; Gavriluţ and Pop, 2020)。对于 CQF 机制的 WCD,我们使用式 22 中给出的数学方程。

“Ground Truth WCD Values”译为“真实 WCD 值”;CBS、CQF、WCD、NC 均保留缩写;引用年份与标注保持一致;“Eq. 22”译为“式 22”合理,未发现明显问题。

P136已复核

我们在 TSNBench 上评估了开源和闭源的最新大语言模型。模型的详细列表连同其模型编号和快照一起在表 6 中给出。这确保结果能够被社区复现。

“state-of-the-art LLMs”译为“最新大语言模型”可接受,也可译为“最先进的大语言模型”;“model numbers and snapshots”译为“模型编号和快照”保持字面含义;表号 6 正确,未发现明显问题。

P137已复核

我们在两种不同配置下评估模型:(i)默认温度设置(0.7),以及(ii)温度设置为 0.0;MCQA 和开放式问题均采用这两种配置。由于在安全关键网络中,我们希望确保确定性的结果。因此,我们评估当温度设置为 0.0 时,LLM 是否能够提供一致的结果。对于不支持 temperature 参数的模型,我们使用默认温度进行评估。

温度数值 0.7 和 0.0 保持准确;MCQA、LLM、temperature 参数保留;“As in safety-critical networks”按因果关系译为“由于在安全关键网络中”基本合理,但原句表达略不自然,可能也可理解为“正如在安全关键网络中那样”。未发现影响技术含义的明显问题。

P138已复核

表 7 给出了在默认温度和温度设置为 0.0 时,模型在 MCQA 数据集上的准确率和平均一致性。平均一致性表示模型在三次运行中给出相同结果的能力。

“accuracy”译为“准确率”,“average consistency”译为“平均一致性”;“three runs”译为“三次运行”;表号 7 正确,未发现明显问题。

P139已复核

模型的成本和延迟是研究社区的重要评估参数。在基准评估上花费大量资金,是研究团队面临的一个现实瓶颈。此外,并非所有模型都能够在本地进行评估。表 8 展示了 TSNBench 的 MCQA 和开放式问题的成本与延迟。评估 MCQA 相比评估开放式问题要便宜得多。

“cost and latency”译为“成本和延迟”;“research groups”译为“研究团队”;“relatively much cheaper”译为“要便宜得多”符合语义;表号 8 正确,未发现明显问题。

P140已复核

我们为开放式问题提供 MAE 和 MAPE 评估。一个示例计算如下:

MAE、MAPE 保留指标缩写;该段引出后续公式或计算示例,但当前段落未包含具体公式,因此本段自身未发现明显问题。

P141已复核

MAE 和 MAPE 计算示例:考虑一个在三个测试用例(TCs)上进行评估的模型。这三个 TC 可能具有不同的拓扑、不同的流和流参数,以及不同的路由。对于每个 TC,我们有表 9 中所示的真实值和预测的 WCD 值。真实值是针对 CBS 使用一个 NC 求解器计算得到的,并且针对 CQF 使用一个数学方程计算得到的。

术语 MAE、MAPE、TC、WCD、CBS、NC、CQF 均保留并翻译了上下文含义;数字“三个”、表 9 未遗漏;“ground-truth”译为“真实值”较合适。未发现明显问题。

P142需人工复核

按 TC 计算的 MAE:假设 TC1、TC2 和 TC3 分别包含三个、两个和三个流。{f1, f2, f3} ∈ TC1;{f1, f2} ∈ TC2;{f1, f2, f3} ∈ TC3。令 Γ(f0) 表示 TC1 中流 f0 的绝对误差,β(f0) 表示由 LLM 模型给出的流 f0 的预测 WCD,Ω(f0) 表示流 f0 的真实值。我们按如下方式计算 Γ(f0):Γ(f0)=|β(f0)-Ω(f0)|。在给定示例中,对于 TC1,令 Γ(f0)=12、Γ(f1)=30,并且 Γ(f2)=10。类似地,对于 TC2,Γ(f0)=8 且 Γ(f1)=45;对于 TC3,Γ(f0)=20、Γ(f1)=15,并且 Γ(f2)=0。我们按如下方式计算 TC1、TC2 和 TC3 的 MAE,分别表示为 MAE_TC1、MAE_TC2 和 MAE_TC3:MAE_TC1=(12+30+10)/3=17.3 μs;MAE_TC2=(8+45)/2=26.5 μs;MAE_TC3=(20+15+0)/3=11.7 μs。

公式 Γ(f0)=|β(f0)-Ω(f0)|、三个 TC 的流集合、误差数值和 MAE 结果均已保留;单位 μs 保留。原文先称 TC1 流集合为 {f1,f2,f3},后续又使用 f0、f1、f2,存在下标不一致风险,可能为原文或抽取问题。

P143已复核

对于每个模型,我们有 100 个测试用例,并且最终 MAE 是在所有测试用例上取平均值(在此示例中为 3 个测试用例),表示为:MAE=(17.3+26.5+11.7)/3=18.5 μs。记为 α(f0) 的按流计算的 MAPE 按如下方式计算:α(f0)=|β(f0)-Ω(f0)|/Ω(f0)×100。对于 TC1,我们按如下方式计算 MAPE:MAPE_TC1=(α(f0)+α(f1)+α(f2))/3=8.7%。类似地,TC2 和 TC3 的 MAPE 给出如下:MAPE_TC2=11.5%;MAPE_TC3=3.7%。每个模型的最终 MAPE 是在这 3 个测试用例上取平均值:MAPE=(8.7+11.5+3.7)/3=8.0%。在 TSNBench 中,无论网络中的流数量是多少,所有测试用例都对模型性能作出同等贡献。根据网络架构,所有流都同等关键,并且需要相同的优先对待。这确保了对于每个网络场景,所有流都被赋予相同权重。

100 个测试用例、示例中的 3 个测试用例、MAE/MAPE 公式与数值均已保留;α(f0)、β(f0)、Ω(f0) 符号未丢失。原文 “all test cases contributes” 有语法问题,但不影响含义;“all flows are weighted equally” 与前句“所有测试用例同等贡献”在层级上可能需要结合全文理解。未发现明显翻译问题。

P144已复核

测试用例:TC1。TSN 机制:CBS。你是一名专家级时间敏感网络(TSN)编排器。你的任务是计算每条 TSN 流的最坏情况时延(WCD)。输入:网络拓扑(TC1_topo.txt);流信息(TC1_flows.txt);流的路由(TC1_route.txt)。常量:链路带宽 = 100 Mbps;传播时延 = 1 μs;交换时延 = 1 μs;时间同步误差 = 1 μs;网络中的交换机是直通式交换机。TSN 机制:只允许使用基于信用的整形器(CBS,IEEE 802.1Qav);所有流都是 AVB Class A,PCP = 6,并且只使用队列 6。任务:1. 使用给定的拓扑、流和流的路由,映射每个出口端口的队列,并收集从该端口经过的流集合。2. 对于每个出口端口,使用给定的 IdleSlope,然后计算 SendSlope。3. 对于每条流,根据其帧大小和周期性构造到达曲线。4. 对于每个端口,推导一个下界 CBS 服务曲线。5. 使用网络演算方法计算每条流的最坏情况时延(WCD),单位为微秒(μs)。6. 根据你的答案提供 0.0 到 1.0 之间的置信度分数。1.0 表示可由给定信息在零歧义的情况下通过数学或程序过程证明。0.0 表示零置信度。

文件名、常量数值、单位、IEEE 802.1Qav、AVB Class A、PCP=6、queue 6、IdleSlope、SendSlope、Network Calculus 均已保留或准确翻译。该段是任务提示,缺少实际拓扑/流/路由文件上下文,但翻译本身未发现明显问题。

P145已复核

使用给定的拓扑、流和流的路由,映射每个出口端口的队列,并收集从该端口经过的流集合。

“egress port” 译为“出口端口”,“traversing from that port” 译为“从该端口经过”保留了方向含义;未发现明显问题。

P146已复核

对于每个出口端口,使用给定的 IdleSlope,然后计算 SendSlope。

IdleSlope 和 SendSlope 为 CBS 相关术语,保留英文形式较稳妥;未发现明显问题。

P147已复核

对于每条流,根据其帧大小和周期性构造到达曲线。

“arrival curve” 译为“到达曲线”,“frame size and periodicity” 译为“帧大小和周期性”;未发现明显问题。

P148已复核

对于每个端口,推导一个下界 CBS 服务曲线。

“lower-bounded CBS service curve” 译为“下界 CBS 服务曲线”,保留了服务曲线具有下界约束的含义;未发现明显问题。

P149已复核

使用网络演算方法计算每条流的最坏情况时延(WCD),单位为微秒(μs)。

WCD、Network Calculus、单位 μs 均已保留;未发现明显问题。

P150已复核

根据你的答案提供 0.0 到 1.0 之间的置信度分数。1.0 表示可由给定信息在零歧义的情况下通过数学或程序过程证明。0.0 表示零置信度。

数值范围 0.0 到 1.0、端点含义均准确保留;未发现明显问题。

P151需人工复核

Grok 4.1 Fast(非推理)输出:F0: 1452.0, F1: 1124.0, F2: 678.0, F3: 1234.0, F4: 1567.0, F5: 689.0, F6: 456.0, F7: 789.0, F8: 567.0, F9: 589.0, F10: 678.0, F11: 645.0, F12: 1123.0, F13: 789.0, F14: 456.0, F15: 678.0, F16: 789.0, F17: 1567.0, F18: 456.0, F19: 1234.0。模型推理与解释输出:使用的参数:`send_slope`:`75% × 100 Mbps = 75 Mbps = 75000 bits/μs`;`frame_overhead`:24 字节(以太网头部 + FCS);`mtu`:1522 字节(最大帧大小);`queue`:6(PCP=6,AVB Class A);`cbs_credit_max`:`MTU × 8 = 12176` 比特。模型给出的 WCD 方程:`α(t)=L_max/Period+t/Period`(到达曲线);`β(t)=R_send × t - C`(CBS 服务曲线);`WCD=h × (prop_delay + switch_delay + sync_error) + max(α ∩ β)`。`SendSlope = IdleSlope × Bandwidth = 0.75 × 100 Mbps`;`R_send = SendSlope = 75 Mbps`;`C = credit_max = MTU × 8 bits`;`h = 来自路径的跳数`。每条流的 WCD:`WCD_Fi = Σ_{h∈path(Fi)}(d_prop^h + d_switch^h + d_sync^h) + max_{t≥0}(α_Fi(t) ∩ β_port(t))`。

数字、流编号和参数名均已保留。原文中 `75000 bits/μs` 与 `75 Mbps` 的单位换算存在明显风险,按原文保留;`α(t)` 方程疑似缺少乘法或括号,`max(α ∩ β)` 语义不明确,均需结合上下文判断。`send_slope`、CBS、WCD、PCP、AVB Class A 等术语已保留。

P152已复核

Grok 4.1 Fast(非推理)输出:

仅为输出标题,模型名和“非推理”含义已保留,未发现明显问题。

P153已复核

F0: 1452.0, F1: 1124.0, F2: 678.0, F3: 1234.0, F4: 1567.0, F5: 689.0, F6: 456.0, F7: 789.0, F8: 567.0, F9: 589.0, F10: 678.0, F11: 645.0, F12: 1123.0, F13: 789.0, F14: 456.0, F15: 678.0, F16: 789.0, F17: 1567.0, F18: 456.0, F19: 1234.0。

20 个流编号 F0-F19 及对应数值已逐项保留;原文未给出单位,译文未擅自补充单位,未发现明显问题。

P154需人工复核

模型第 2 次运行(Grok 4.1 Fast(非推理)):模型在第 2 次运行期间给出的 WCD 方程:帧大小:`L_i = payload_i + 24 bytes`,即 `= payload_i + 4 bytes`;传输时间:`T_xi = 8L_i/R`;到达曲线:`α_i(t) = (L_i/p_i) · t`;CBS 服务曲线:`β(t)=min{α_hi(t), sendSlope · t + credit_max}`;最坏情况时延:`WCD_i = max_{h∈path} Σ(T_prop,h + T_switch,h + T_queue,h) + T_sync`;排队时延:`T_queue,h = L_i/(αC) + burst_interference`;发送斜率:`sendSlope = idleSlope · R - idleSlope · R = αR(1 - 2α)`;每条流的 WCD:`WCD_i = hop_count_i · (1 + 1) + max_queue_delay + 1`。专家解释:即使温度设置为 0.0,该模型在不同运行中也会使用不同方程。在分析第一次运行时,模型在评估被分析队列所经历的干扰/阻塞时犯了若干根本性错误,也就是在到达曲线和 CBS 服务曲线两方面都存在错误。首先,它假设了错误的最大突发大小,并且没有考虑沿路径后续跳处变化的输出到达曲线。其次,它假设了非零最大信用值;在该场景中,由于较低优先级、非 CBS 流量的非抢占帧,这一点是可能的。然而,模型将错误的最大信用值应用到了 CBS 服务曲线,并错误理解了相应速率(75%)之间的关系。此外,在最终 WCD 计算中,模型只考虑了被分析帧的到达曲线。这是不正确的:正确方法需要使用通过该队列传输的所有帧的聚合到达曲线。也不清楚 `max(α ∩ β)` 操作是否表示 `α` 与 `β` 之间的最大水平偏差。输出到达曲线没有被纳入后续跳的计算。另一方面,第二次运行表现出更严重的幻觉,给出了错误的到达曲线计算,没有考虑任何突发大小;还给出了错误的 CBS 服务曲线,考虑了错误的服务速率(sendSlope)以及相应速率与最大信用值之间的错误关系,并且同样也没有考虑输出到达曲线等。GPT-5 输出:F0: 714.65, F1: 821.79, F2: 171.48, F3: 687.17, F4: 907.26, F5: 171.48, F6: 183.9, F7: 453.91, F8: 611.08, F9: 181.3, F10: 611.08, F11: 321.35, F12: 1124.12, F13: 483.67, F14: 172.43, F15: 489.26, F16: 521.33, F17: 918.73, F18: 181.3, F19: 624.12。

原文中 `payload_i + 24 bytes` 后又写 `payload_i + 4 bytes`,二者矛盾,已按原文保留;`sendSlope = idleSlope · R - idleSlope · R = αR(1 - 2α)` 公式逻辑疑似不一致;`α_hi(t)`、`max(α ∩ β)`、`T_xi` 等符号可能存在识别或排版残缺。段落合并了第 2 次运行、专家解释和 GPT-5 输出,译文按原文顺序保留。

P155已复核

模型第 2 次运行(Grok 4.1 Fast(非推理)):

仅为小节标题,模型名、运行次数和“非推理”已保留,未发现明显问题。

P156需人工复核

模型在第 2 次运行期间给出的 WCD 方程:帧大小:`L_i = payload_i + 24 bytes`,即 `= payload_i + 4 bytes`;传输时间:`T_xi = 8L_i/R`;到达曲线:`α_i(t) = (L_i/p_i) · t`;CBS 服务曲线:`β(t)=min{α_hi(t), sendSlope · t + credit_max}`;最坏情况时延:`WCD_i = max_{h∈path} Σ(T_prop,h + T_switch,h + T_queue,h) + T_sync`;排队时延:`T_queue,h = L_i/(αC) + burst_interference`;发送斜率:`sendSlope = idleSlope · R - idleSlope · R = αR(1 - 2α)`;每条流的 WCD:`WCD_i = hop_count_i · (1 + 1) + max_queue_delay + 1`。

`payload_i + 24 bytes` 与 `payload_i + 4 bytes` 在原文中互相冲突,已保留;多个公式疑似存在识别或数学逻辑问题,尤其是 `sendSlope` 等式、`α_hi(t)` 与 `max_{h∈path} Σ(...)` 的写法。术语 WCD、CBS、sendSlope、credit_max 已保留。

P157已复核

专家解释:

仅为标题,未发现明显问题。

P158需人工复核

即使温度设置为 0.0,该模型在不同运行中也会使用不同方程。在分析第一次运行时,模型在评估被分析队列所经历的干扰/阻塞时犯了若干根本性错误,也就是在到达曲线和 CBS 服务曲线两方面都存在错误。首先,它假设了错误的最大突发大小,并且没有考虑沿路径后续跳处变化的输出到达曲线。其次,它假设了非零最大信用值;在该场景中,由于较低优先级、非 CBS 流量的非抢占帧,这一点是可能的。然而,模型将错误的最大信用值应用到了 CBS 服务曲线,并错误理解了相应速率(75%)之间的关系。此外,在最终 WCD 计算中,模型只考虑了被分析帧的到达曲线。这是不正确的:正确方法需要使用通过该队列传输的所有帧的聚合到达曲线。也不清楚 `max(α ∩ β)` 操作是否表示 `α` 与 `β` 之间的最大水平偏差。输出到达曲线没有被纳入后续跳的计算。另一方面,第二次运行表现出更严重的幻觉,给出了错误的到达曲线计算,没有考虑任何突发大小;还给出了错误的 CBS 服务曲线,考虑了错误的服务速率(sendSlope)以及相应速率与最大信用值之间的错误关系,并且同样也没有考虑输出到达曲线等。

逻辑转折“First/Second/However/Furthermore/On the other hand”均已保留。`non-preemption frame of lower priority of non-CBS traffic` 译为“较低优先级、非 CBS 流量的非抢占帧”,但该短语结构较复杂,需结合 TSN 语境确认。`max(α ∩ β)` 语义不明确,已按原文保留。

P159已复核

GPT-5 输出:

仅为输出标题,模型名已保留,未发现明显问题。

P160已复核

F0: 714.65, F1: 821.79, F2: 171.48, F3: 687.17, F4: 907.26, F5: 171.48, F6: 183.9, F7: 453.91, F8: 611.08, F9: 181.3, F10: 611.08, F11: 321.35, F12: 1124.12, F13: 483.67, F14: 172.43, F15: 489.26, F16: 521.33, F17: 918.73, F18: 181.3, F19: 624.12。

20 个流编号 F0-F19 及对应数值已逐项保留;原文未给出单位,译文未擅自补充单位,未发现明显问题。

P161需人工复核

模型推理与解释输出:sendSlope \(\mathrm{sendSlope}=-25~\mathrm{Mbps}\),另写为 \(-5~\mathrm{Mbps}\);到达曲线:\(b=\sum_{\forall F_i|\text{共享输出端口}}L_i~\mathrm{bits}\),\(r=\sum_{\forall F_i|\text{共享输出端口}}\frac{L_i}{p_i}~\mathrm{bits}/\mu\mathrm{s}\),\(\alpha(t)=b+rt\);\(T_{\mathrm{tech,total}}=T_{\mathrm{prop}}+T_{\mathrm{switching}}+T_{\mathrm{sync}}=hops\times1~\mu\mathrm{s}+(hops-1)\times1~\mu\mathrm{s}+(hops-1)\times1~\mu\mathrm{s}=13~\mu\mathrm{s}\),另写为 \(3~\mu\mathrm{s}\);\(WCD_{\mathrm{total}}=WCD_{\mathrm{sum}}+T_{\mathrm{tech,total}}=701.65~\mu\mathrm{s}+13~\mu\mathrm{s}=714.65~\mu\mathrm{s}\),另写为 \(0165~\mu\mathrm{s}+3~\mu\mathrm{s}=1465~\mu\mathrm{s}\)。专家解释:GPT 使用了正确的到达曲线和服务曲线。它的主要错误是未能在分析中包含输出端口(输出到达曲线)。Ministral 3 8B 输出:F0: 1250.0,F1: 1100.0,F2: 500.0,F3: 1500.0,F4: 600.0,F5: 1100.0,F6: 2500.0,F7: 1200.0,F8: 500.0,F9: 500.0,F10: 2000.0,F11: 500.0,F12: 1000.0,F13: 1000.0,F14: 2000.0,F15: 500.0,F16: 1500.0,F17: 1000.0,F18: 500.0,F19: 2000.0。模型推理与解释输出:带宽:\(R=100~\mathrm{Mbps}=12.5~\mathrm{MB/s}\),另写为 \(00~\mathrm{Mbps}=25~\mathrm{MB/s}\);传播时延:\(T_{\mathrm{prop}}=1~\mu\mathrm{s}\) 每跳;交换时延:\(T_{\mathrm{switch}}=1~\mu\mathrm{s}\) 每个交换机;时间同步误差:\(T_{\mathrm{sync}}=1~\mu\mathrm{s}\);IdleSlope:\(\mathrm{IdleSlope}=0.75R=75~\mathrm{Mbps}=9.375~\mathrm{MB/s}\),另写为 \(75R=5~\mathrm{Mbps}=375~\mathrm{MB/s}\);直通交换:\(T_{\mathrm{buffer}}=0\);载荷开销:\(L_{\mathrm{overhead}}=18~\mathrm{B}+4~\mathrm{B}=22~\mathrm{B}\),另写为 \(8~\mathrm{B}+4~\mathrm{B}=2~\mathrm{B}\);最大帧大小:\(L_{\max}=1433~\mathrm{B}+22~\mathrm{B}=1455~\mathrm{B}\),另写为 \(433~\mathrm{B}+2~\mathrm{B}=455~\mathrm{B}\);最大传输时间:\(T_{\mathrm{tx,max}}=\frac{1455\times8}{100~\mathrm{Mbps}}=9.24~\mu\mathrm{s}\),另写为 \(24~\mu\mathrm{s}\)。

段内存在多处互相矛盾或疑似识别残缺的数字与公式,如 \(-25/-5~\mathrm{Mbps}\)、\(13/3~\mu\mathrm{s}\)、\(714.65/1465~\mu\mathrm{s}\)、\(100/00~\mathrm{Mbps}\)、\(12.5/25~\mathrm{MB/s}\)、\(22/2~\mathrm{B}\)、\(1455/455~\mathrm{B}\)、\(9.24/24~\mu\mathrm{s}\)。P161 还似乎合并了后续专家解释、模型输出和下一段推理内容,表格或抽取上下文可能缺失。

P162需人工复核

模型推理与解释输出:sendSlope \(\mathrm{sendSlope}=-25~\mathrm{Mbps}\),另写为 \(-5~\mathrm{Mbps}\);到达曲线:\(b=\sum_{\forall F_i|\text{共享输出端口}}L_i~\mathrm{bits}\),\(r=\sum_{\forall F_i|\text{共享输出端口}}\frac{L_i}{p_i}~\mathrm{bits}/\mu\mathrm{s}\),\(\alpha(t)=b+rt\);\(T_{\mathrm{tech,total}}=T_{\mathrm{prop}}+T_{\mathrm{switching}}+T_{\mathrm{sync}}=hops\times1~\mu\mathrm{s}+(hops-1)\times1~\mu\mathrm{s}+(hops-1)\times1~\mu\mathrm{s}=13~\mu\mathrm{s}\),另写为 \(3~\mu\mathrm{s}\);\(WCD_{\mathrm{total}}=WCD_{\mathrm{sum}}+T_{\mathrm{tech,total}}=701.65~\mu\mathrm{s}+13~\mu\mathrm{s}=714.65~\mu\mathrm{s}\),另写为 \(0165~\mu\mathrm{s}+3~\mu\mathrm{s}=1465~\mu\mathrm{s}\)。

术语 sendSlope、到达曲线、\(T_{\mathrm{tech,total}}\)、\(WCD_{\mathrm{total}}\) 已保留。存在明显数字和公式冲突:\(-25\) 与 \(-5~\mathrm{Mbps}\)、\(13\) 与 \(3~\mu\mathrm{s}\)、\(701.65+13=714.65\) 与 \(0165+3=1465\)。其中 \(0165\) 可能为识别错误。

P163已复核

专家解释:

标题性短语,术语和结构无明显风险。

P164已复核

GPT 使用了正确的到达曲线和服务曲线。它的主要错误是未能在分析中包含输出端口(输出到达曲线)。

“arrival and service curves”译为“到达曲线和服务曲线”,“output port (output arrival curves)”译为“输出端口(输出到达曲线)”,逻辑完整,未发现明显问题。

P165已复核

Ministral 3 8B 输出:

模型名 Ministral 3 8B 保留,标题性短语无明显问题。

P166已复核

F0: 1250.0,F1: 1100.0,F2: 500.0,F3: 1500.0,F4: 600.0,F5: 1100.0,F6: 2500.0,F7: 1200.0,F8: 500.0,F9: 500.0,F10: 2000.0,F11: 500.0,F12: 1000.0,F13: 1000.0,F14: 2000.0,F15: 500.0,F16: 1500.0,F17: 1000.0,F18: 500.0,F19: 2000.0。

所有流编号 F0 至 F19 与数值均逐项保留。原文未给出单位,译文未擅自补充单位,未发现明显问题。

P167需人工复核

SendSlope:\(\mathrm{SendSlope}=\mathrm{IdleSlope}(1-\mathrm{IdleSlope})=9.375~\mathrm{MB/s}\times0.25=2.34375~\mathrm{MB/s}\),另写为 \(375~\mathrm{MB/s}\times25=34375~\mathrm{MB/s}\);突发大小:\(B_{\mathrm{burst},i}=\mathrm{SendSlope}\cdot p_i\);CBS 服务曲线:\(S(t)=\mathrm{SendSlope}\left(t-\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}\right)\),且 \(t\geq\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}\)。WCD 计算:流 \(F_x\) 的到达曲线:\(R(t)=\min\{B_x,S_xt\}\);\(B_x=L_{\mathrm{payload},x}+L_{\mathrm{overhead}}\);\(S_x=\frac{8L_{\mathrm{payload},x}}{p_x}\)。每个端口的服务曲线:\(S(t)=\mathrm{SendSlope}\left(t-\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}\right)\),且 \(t\geq\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}\)。\(WCD=\max\left(0,\frac{B_{\mathrm{burst}}-\mathrm{SendSlope}\cdot Period}{\mathrm{SendSlope}}\right)+T_{\mathrm{prop}}+T_{\mathrm{switch}}+T_{\mathrm{sync}}\)。专家解释:即使它是一个小模型,该智能体也给出了很差的结果。与在 Grok 4.1 Fast (Non-Reasoning) 中观察到的情况类似,它未能考虑流的到达曲线,并且使用了错误的 CBS 服务曲线。此外,它只分析单个队列,而不是计算端到端路径上每个设备输出端口处的时延。而且,它引入了单位不一致,例如将以字节表示的帧大小除以以 \(\mathrm{bits}/\mu s\) 给出的链路速率,这会导致错误结果。

术语 SendSlope、IdleSlope、Burst size、CBS service curve、WCD、arrival curve 已保留或准确翻译。公式存在疑似识别问题和单位风险:\(\mathrm{IdleSlope}(1-\mathrm{IdleSlope})\) 在量纲上可疑;\(375~\mathrm{MB/s}\times25=34375~\mathrm{MB/s}\) 明显与前述 \(9.375\times0.25=2.34375\) 冲突;\(\mathrm{bits}/\mu s\) 原文写作 “bits/ μ \mu s” 存在抽取噪声。段落还合并了模型推理和专家解释,可能来自相邻块拼接。

P168需人工复核

SendSlope:\(\mathrm{SendSlope}=\mathrm{IdleSlope}(1-\mathrm{IdleSlope})=9.375~\mathrm{MB/s}\times0.25=2.34375~\mathrm{MB/s}\),另写为 \(375~\mathrm{MB/s}\times25=34375~\mathrm{MB/s}\);突发大小:\(B_{\mathrm{burst},i}=\mathrm{SendSlope}\cdot p_i\);CBS 服务曲线:\(S(t)=\mathrm{SendSlope}\left(t-\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}\right)\),且 \(t\geq\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}\)。WCD 计算:流 \(F_x\) 的到达曲线:\(R(t)=\min\{B_x,S_xt\}\);\(B_x=L_{\mathrm{payload},x}+L_{\mathrm{overhead}}\);\(S_x=\frac{8L_{\mathrm{payload},x}}{p_x}\)。每个端口的服务曲线:\(S(t)=\mathrm{SendSlope}\left(t-\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}\right)\),且 \(t\geq\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}\)。\(WCD=\max\left(0,\frac{B_{\mathrm{burst}}-\mathrm{SendSlope}\cdot Period}{\mathrm{SendSlope}}\right)+T_{\mathrm{prop}}+T_{\mathrm{switch}}+T_{\mathrm{sync}}\)。

公式和缩写已尽量保留。存在明显公式/数字风险:SendSlope 计算式量纲可疑,且 \(9.375~\mathrm{MB/s}\times0.25=2.34375~\mathrm{MB/s}\) 与抽取出的 \(375~\mathrm{MB/s}\times25=34375~\mathrm{MB/s}\) 冲突;CBS 服务曲线和 WCD 公式可能需要结合表格或上下文确认。

P169已复核

专家解释:

标题性短语,术语和结构无明显风险。

P170需人工复核

即使它是一个小模型,该智能体也给出了很差的结果。与在 Grok 4.1 Fast (Non-Reasoning) 中观察到的情况类似,它未能考虑流的到达曲线,并且使用了错误的 CBS 服务曲线。此外,它只分析单个队列,而不是计算端到端路径上每个设备输出端口处的时延。而且,它引入了单位不一致,例如将以字节表示的帧大小除以以 \(\mathrm{bits}/\mu s\) 给出的链路速率,这会导致错误结果。

“small model”译为“小模型”,“agent”译为“智能体”,“arrival curves of flows”译为“流的到达曲线”,“wrong CBS service curve”按原意译为“错误的 CBS 服务曲线”。原文 “bits/ μ \mu s” 存在格式噪声,译为 \(\mathrm{bits}/\mu s\),需确认单位排版。

P171已复核

测试用例:TC1。TSN 机制:CQF。你是一名专业的时间敏感网络(Time-Sensitive Networking, TSN)编排器。你的任务是计算每条 TSN 流的最坏情况时延(worst case delay, WCD)。输入:网络拓扑(TC1_topo.txt)、流信息(TC1_flows.txt)、流的路由(TC1_route.txt)。常量:链路带宽 = 100 Mbps;传播时延 = 1 μs;交换时延 = 1 μs;时间同步误差 = 1 μs;网络中的交换机是直通式交换机。TSN 机制:仅允许循环排队与转发(Cyclic Queuing and Forwarding, CQF, IEEE 802.1Qch);所有流均为 TT,PCP = 7,仅使用队列 7(奇数)和队列 6(偶数)。任务:1. 使用给定的拓扑、流以及流路由,映射每个出口端口的队列,并收集经过该端口的流集合。2. 对整个网络,使用给定的周期时长并计算超周期(Hypercycle)。3. 对每条流,将该流从发送节点出发的偏移量或开始时间设为 0。4. 计算每条流以微秒(μs)为单位的最坏情况时延(WCD)。5. 根据你的答案提供 0.0 到 1.0 之间的置信度分数。1.0 表示可由给定信息以数学方式或程序方式证明且不存在歧义。0.0 表示零置信度。

术语 TSN、CQF、IEEE 802.1Qch、TT、PCP、WCD、Hypercycle 均已保留并翻译;数字、单位、队列编号 7/6、置信度范围 0.0-1.0 未遗漏。输入文本中 `μ \mu s`、文件名存在 OCR/LaTeX 混排,译文按可识别含义规范为 μs 和文件名。

P172已复核

使用给定的拓扑、流以及流路由,映射每个出口端口的队列,并收集经过该端口的流集合。

“egress port”译为“出口端口”,“queues”译为“队列”;动作“map”和“collect”均已保留。未发现明显问题。

P173已复核

对整个网络,使用给定的周期时长并计算超周期(Hypercycle)。

“cycle duration”译为“周期时长”,“Hypercycle”保留术语并译为“超周期”。未发现明显问题。

P174已复核

对每条流,将该流从发送节点出发的偏移量或开始时间设为 0。

数值 0 已保留;“offset or the start time”译为“偏移量或开始时间”,逻辑完整。未发现明显问题。

P175已复核

计算每条流以微秒(μs)为单位的最坏情况时延(WCD)。

WCD 缩写、单位 μs 均已保留。未发现明显问题。

P176已复核

根据你的答案提供 0.0 到 1.0 之间的置信度分数。1.0 表示可由给定信息以数学方式或程序方式证明且不存在歧义。0.0 表示零置信度。

置信度范围 0.0-1.0、两个端点含义均已准确保留。未发现明显问题。

P177需人工复核

Claude Sonnet 的输出:F0:257.72,F1:206.8,F2:105.096,F3:218.704,F4:253.904,F5:104.008,F6:103.808,F7:153.52,F8:119.408,F9:119.752,F10:122.896,F11:118.464,F12:222.04,F13:156.552,F14:103.808,F15:156.576,F16:155.68,F17:256.616,F18:101.696,F19:208.328。模型推理与解释输出:周期时长:= 50 μs,\(\displaystyle\text{Cycle duration:}=0~\mu\mathrm{s}\);所有周期与循环的最小公倍数:LCM(1000, 2500, 5000, 50),\(\displaystyle\text{LCM of all periods and cycle:}\quad\mathrm{LCM}(000,500,000,0)\) = 5000 μs,\(\displaystyle=000~\mu\mathrm{s}\);超周期:= 5000 μs,\(\displaystyle\text{Hypercycle:}=000~\mu\mathrm{s}\)。WCD 方程:\(\displaystyle\text{WCD equation:}\quad WCD = \sum_{\mathrm{hops}}(T_{\mathrm{trans}}+T_{\mathrm{prop}}+T_{\mathrm{sw}}+T_{\mathrm{sync}})+T_{\mathrm{buffering}}\)。\(T_{\mathrm{trans}} = (Payload + Ethernet\_overhead)\cdot 8 / R\),\(\displaystyle=\frac{(\mathrm{Payload}+\mathrm{Ethernet\_overhead})\cdot 8}{R}\);\(T_{\mathrm{prop}} = 1 μs\) 每条链路;\(T_{\mathrm{sw}} = 1 μs\) 每台交换机;\(T_{\mathrm{sync}} = 1 μs\) 每跳;\(T_{\mathrm{buffering}} = (hop\_count - 1)\cdot T_{\mathrm{cycle}}\)。在 CQF 中,如果帧正好在传输窗口关闭之后到达,则它们在每个中间交换机处最多可能等待一个周期时长。对于采用 CQF 的直通式转发,最坏情况会额外增加 \((N-1)*cycle\_duration\),其中 \(N\) 是跳数。跳数 = 5;Payload_bytes = 965 bytes;Frame_size_bytes = 1003 bytes;Transmission_delay = 80.24 μs;总传播时延 = 5 μs;总交换时延 = 4 μs;总同步误差 = 5 μs;CQF 缓冲 = 200 μs;WCD = 257.72 μs。真实 WCD:\(f_1\) 的路由跳数:4,\(T = 50 μs\),\(WCD_{f_1} = f_1\cdot\phi + (SW_{\mathrm{num}}+1)T+\xi = (3+1)\times 50 + 5 = 205 μs\)。专家解释:该模型了解 CQF 的工作机制。然而,该模型把周期时长与各周期一起纳入计算,从而计算出了错误的超周期。此外,模型给出的 CQF WCD 解释是错误的。

段落包含严重 OCR/LaTeX 识别残缺,例如 `50 μs` 同时出现为 `0~\mu s`、`LCM(1000,2500,5000,50)` 同时残缺为 `LCM(000,500,000,0)`,`5000 μs` 残缺为 `000 μs`,`WCD=257.72` 残缺为 `5772`;译文保留了可识别的正确数值并保留部分残缺公式以反映风险。公式中 \(f_1\cdot\phi\)、\(SW_{\mathrm{num}}\)、\(\xi\) 依赖上下文,且 `Ground truth WCD::` 原文有双冒号。需人工复核。

P178已复核

Claude Sonnet 的输出:

模型名 Claude Sonnet 已保留;冒号结构保留。未发现明显问题。

P179已复核

F0:257.72,F1:206.8,F2:105.096,F3:218.704,F4:253.904,F5:104.008,F6:103.808,F7:153.52,F8:119.408,F9:119.752,F10:122.896,F11:118.464,F12:222.04,F13:156.552,F14:103.808,F15:156.576,F16:155.68,F17:256.616,F18:101.696,F19:208.328。

F0-F19 的编号和值逐项保留,未添加单位,因为原文未显式给出单位。未发现明显问题。

P180需人工复核

模型推理与解释输出:周期时长:= 50 μs,\(\displaystyle\text{Cycle duration:}=0~\mu\mathrm{s}\);所有周期与循环的最小公倍数:LCM(1000, 2500, 5000, 50),\(\displaystyle\text{LCM of all periods and cycle:}\quad\mathrm{LCM}(000,500,000,0)\) = 5000 μs,\(\displaystyle=000~\mu\mathrm{s}\);超周期:= 5000 μs,\(\displaystyle\text{Hypercycle:}=000~\mu\mathrm{s}\)。WCD 方程:\(\displaystyle\text{WCD equation:}\quad WCD = \sum_{\mathrm{hops}}(T_{\mathrm{trans}}+T_{\mathrm{prop}}+T_{\mathrm{sw}}+T_{\mathrm{sync}})+T_{\mathrm{buffering}}\)。\(T_{\mathrm{trans}} = (Payload + Ethernet\_overhead)\cdot 8 / R\),\(\displaystyle=\frac{(\mathrm{Payload}+\mathrm{Ethernet\_overhead})\cdot 8}{R}\);\(T_{\mathrm{prop}} = 1 μs\) 每条链路;\(T_{\mathrm{sw}} = 1 μs\) 每台交换机;\(T_{\mathrm{sync}} = 1 μs\) 每跳;\(T_{\mathrm{buffering}} = (hop\_count - 1)\cdot T_{\mathrm{cycle}}\)。在 CQF 中,如果帧正好在传输窗口关闭之后到达,则它们在每个中间交换机处最多可能等待一个周期时长。对于采用 CQF 的直通式转发,最坏情况会额外增加 \((N-1)*cycle\_duration\),其中 \(N\) 是跳数。跳数 = 5;Payload_bytes = 965 bytes;Frame_size_bytes = 1003 bytes;Transmission_delay = 80.24 μs;总传播时延 = 5 μs;总交换时延 = 4 μs;总同步误差 = 5 μs;CQF 缓冲 = 200 μs;WCD = 257.72 μs。

该段包含多处公式/数字识别残缺:如 `50 μs` 与 `0 μs` 并存,`LCM(1000,2500,5000,50)` 被残缺显示为 `LCM(000,500,000,0)`,`5000 μs` 被残缺显示为 `000 μs`,`965 bytes/1003 bytes/80.24 μs/200 μs/257.72 μs` 后续 LaTeX 显示均有丢位风险。译文按可读文本保留主要数值,并保留残缺公式片段提示来源风险。

P181已复核

真实标注 WCD:

WCD 保留为缩写;原文有双冒号,译文按语义处理为标题冒号。未发现明显问题。

P182需人工复核

\(f_1\) 的路由跳数:4,\(T=50~\mu\mathrm{s}\),\(WCD_{f_1}=f_1\cdot\phi+(SW_{\mathrm{num}}+1)T+\xi=(3+1)\times 50+5=205~\mu\mathrm{s}\)。

原文公式存在明显 OCR/抽取错误:同一句中出现 \(T=50~\mu s\) 与 \(T=0~\mu s\),以及 \((3+1)\times 50+5\) 与 \((3+1)\times 0+5\)、\(205~\mu s\) 与类似 \(05~\mu s\) 的残缺结果。译文依据可自洽计算保留为 205 μs,但需人工核对原 PDF。

P183已复核

专家解释:

标题性短语,术语无风险。未发现明显问题。

P184已复核

该模型知道 CQF 的工作机制。然而,该模型在计算 Hypercycle 时错误地将周期时长与各个周期一起纳入计算。此外,该模型给出的 CQF WCD 解释是错误的。

CQF、Hypercycle、WCD 保留为术语缩写;“cycle duration along with the periods” 译为“周期时长与各个周期一起纳入计算”,逻辑上表达了错误包含关系。未发现明显问题。

P185需人工复核

GPT-4o 输出:F0:260.0,F1:210.0,F2:60.0,F3:210.0,F4:260.0,F5:60.0,F6:60.0,F7:110.0,F8:60.0,F9:60.0,F10:60.0,F11:60.0,F12:210.0,F13:110.0,F14:60.0,F15:110.0,F16:110.0,F17:260.0,F18:60.0,F19:210.0。模型推理与解释输出:Hypercycle = \(100~\mu\mathrm{s}\)。\(WCD=(hop\_count\times(cycle\_duration+T_{\mathrm{switch}}+T_{\mathrm{prop}}+T_{\mathrm{sync}}))+\frac{\mathrm{Payload}}{\mathrm{Bandwidth}}\)。其中详细说明了每条流的路径、周期时长、Hypercycle 和 CQF 参数,以及用于 WCD 计算的方程。没有发现缺失输入,但基于典型网络配置作出了一些假设。专家解释:模型计算出的 Hypercycle 是错误的,此外,模型使用的 WCD 方程多出了一个错误组成项 \(\frac{\mathrm{Payload}}{\mathrm{Bandwidth}}\)。最有意思的发现之一是,所有模型在不同运行和不同测试用例中都给出了非常不同的 WCD 方程。Ministral 3 8B 输出:F0:1160.0,F1:1020.0,F2:100.0,F3:1160.0,F4:1260.0,F5:100.0,F6:100.0,F7:100.0,F8:100.0,F9:100.0,F10:100.0,F11:100.0,F12:1160.0,F13:100.0,F14:100.0,F15:100.0,F16:100.0,F17:1260.0,F18:100.0,F19:1160.0。模型推理与解释输出:Hypercycle:\(H=10000~\mu\mathrm{s}\),最大帧大小 = 1526 bytes = 1500 + 26 bytes,最大传输时间 = \(12.208~\mu\mathrm{s}\),最大排队时延 = \(48~\mu\mathrm{s}\),\(WCD=h(T_{\mathrm{prop}}+T_{\mathrm{switch}}+T_{\mathrm{sync}})+T_{\mathrm{tx,max}}\left\lceil\frac{\mathrm{payload}}{\mathrm{max\_frame\_size}}\right\rceil+T_{\mathrm{queue,max}}(h-1)\)。

原文包含多处公式抽取残缺,例如 Hypercycle 同时显示为 100 μs 与类似 00 μs,Ministral 部分的 10000 μs、1526 bytes、12.208 μs、48 μs 也伴随残缺显示。译文按较完整的数学/文本片段处理。该段似乎合并了多个输出、模型解释和专家解释,结构可能来自表格或跨页抽取,需核对上下文。

P186已复核

GPT-4o 输出:

模型名保留;标题性短语。未发现明显问题。

P187已复核

F0:260.0,F1:210.0,F2:60.0,F3:210.0,F4:260.0,F5:60.0,F6:60.0,F7:110.0,F8:60.0,F9:60.0,F10:60.0,F11:60.0,F12:210.0,F13:110.0,F14:60.0,F15:110.0,F16:110.0,F17:260.0,F18:60.0,F19:210.0。

数值和流编号逐项保留;未发现明显问题。

P188需人工复核

模型推理与解释输出:Hypercycle = \(100~\mu\mathrm{s}\)。\(WCD=(hop\_count\times(cycle\_duration+T_{\mathrm{switch}}+T_{\mathrm{prop}}+T_{\mathrm{sync}}))+\frac{\mathrm{Payload}}{\mathrm{Bandwidth}}\)。其中详细说明了每条流的路径、周期时长、Hypercycle 和 CQF 参数,以及用于 WCD 计算的方程。没有发现缺失输入,但基于典型网络配置作出了一些假设。

原文 Hypercycle 公式处存在 “100 μs” 与残缺 “00 μs” 并存的抽取问题;WCD 公式保留原有结构和错误项 \(\frac{\mathrm{Payload}}{\mathrm{Bandwidth}}\)。需人工核对 Hypercycle 数字。

P189已复核

专家解释:

标题性短语,术语无风险。未发现明显问题。

P190已复核

模型计算出的 Hypercycle 是错误的,此外,模型使用的 WCD 方程多出了一个错误组成项 \(\frac{\mathrm{Payload}}{\mathrm{Bandwidth}}\)。最有意思的发现之一是,所有模型在不同运行和不同测试用例中都给出了非常不同的 WCD 方程。

Hypercycle、WCD 保留为术语;\(\frac{\mathrm{Payload}}{\mathrm{Bandwidth}}\) 保留为公式项;“finding” 原文单数搭配不规范,译文按语义处理。未发现明显问题。

P191已复核

Ministral 3 8B 输出:

模型名 Ministral 3 8B 保留不译;该段为输出标签,无数字或公式风险。未发现明显问题。

P192需人工复核

F0:1160.0,F1:1020.0,F2:100.0,F3:1160.0,F4:1260.0,F5:100.0,F6:100.0,F7:100.0,F8:100.0,F9:100.0,F10:100.0,F11:100.0,F12:1160.0,F13:100.0,F14:100.0,F15:100.0,F16:100.0,F17:1260.0,F18:100.0,F19:1160.0。模型推理和解释输出:超周期:H,\(\text{Hypercycle:}\quad H = 10000\,\mu s\),\(=0000~\mu\mathrm{s}\);最大帧大小 = 1526 字节 = 1500 + 26 字节,\(=526~\mathrm{bytes}=500+6~\mathrm{bytes}\);最大传输时间 = \(12.208\,\mu s\),\(=2208~\mu\mathrm{s}\);最大排队时延 = \(48\,\mu s\),\(=8~\mu\mathrm{s}\);\(WCD = h(T_{\mathrm{prop}}+T_{\mathrm{switch}}+T_{\mathrm{sync}}) + T_{\mathrm{tx,max}}\left\lceil\frac{\mathrm{payload}}{\mathrm{max\_frame\_size}}\right\rceil + T_{\mathrm{queue,max}}(h-1)\)。

已保留所有流编号、数值、单位、公式符号和异常识别文本。原文中存在明显 OCR/公式抽取错位,如 \(10000\) 对应 \(=0000\)、1526 对应 526、12.208 对应 2208、48 对应 8;属于输入本身的残缺或识别风险,未擅自修正。

P193需人工复核

跳数:h,\(\text{Hop count:}\quad h = 5\),\(=5\);传播 + 交换 + 同步 = \(5 \times (1 + 1 + 1) = 15\,\mu s\),\(=5\times(1+1+1)=5~\mu\mathrm{s}\);传输时间 = \(12.208 \times \lceil 2500 / 1526 \rceil = 12.208 \times 2 = 24.416\,\mu s\),\(=2208\times\left\lceil\frac{2500}{1526}\right\rceil=2208\times 2=4416~\mu\mathrm{s}\);排队时延 = \(48 \times (5 - 1) = 192\,\mu s\),\(=8\times(5-1)=92~\mu\mathrm{s}\);总计 = \(15 + 24.416 + 192 = 231.416\,\mu s\),\(5+4416+92=31416~\mu\mathrm{s}\);调整后的最坏情况 = \(1160\,\mu s\),\(=160~\mu\mathrm{s}\)。专家解释:模型给出的 WCD 方程是错误的。尽管该模型考虑了路径中存在的跳数、每一跳累积的时延,并且还计算了跳数。然而,该模型遗漏了 WCD 方程中最关键的部分,即周期持续时间。此外,模型所考虑的 WCD 方程中的两个组成部分,即 \(T_{\mathrm{tx,max}}\left\lceil\frac{\mathrm{payload}}{\mathrm{max\_frame\_size}}\right\rceil\) 和 \(T_{\mathrm{queue,max}}(h-1)\),完全是幻觉生成的。这两个组成部分主要导致了该模型较大的 WCD 值。

已保留跳数、计算式、WCD、payload、max_frame_size、\(T_{\mathrm{tx,max}}\)、\(T_{\mathrm{queue,max}}\) 等术语和公式。原文存在多处数学结果冲突或抽取异常,例如 \(5\times(1+1+1)\) 同时给出 15 和 5,12.208 与 2208 混杂,192 与 92 不一致,231.416 与 31416 不一致,1160 与 160 不一致;因此需人工核对原论文版面。

P194需人工复核

跳数:h,\(\text{Hop count:}\quad h = 5\),\(=5\);传播 + 交换 + 同步 = \(5 \times (1 + 1 + 1) = 15\,\mu s\),\(=5\times(1+1+1)=5~\mu\mathrm{s}\);传输时间 = \(12.208 \times \lceil 2500 / 1526 \rceil = 12.208 \times 2 = 24.416\,\mu s\),\(=2208\times\left\lceil\frac{2500}{1526}\right\rceil=2208\times 2=4416~\mu\mathrm{s}\);排队时延 = \(48 \times (5 - 1) = 192\,\mu s\),\(=8\times(5-1)=92~\mu\mathrm{s}\);总计 = \(15 + 24.416 + 192 = 231.416\,\mu s\),\(5+4416+92=31416~\mu\mathrm{s}\);调整后的最坏情况 = \(1160\,\mu s\),\(=160~\mu\mathrm{s}\)。

该段与 P193 的模型计算部分重复,已逐项保留。原文公式和数值仍有明显冲突或 OCR 残缺风险,如 15/5、24.416/4416、192/92、231.416/31416、1160/160,需人工核对。

P195已复核

专家解释:

该段为小标题,术语 Expert Explanation 译为“专家解释”符合上下文。未发现明显问题。

P196已复核

模型给出的 WCD 方程是错误的。尽管该模型考虑了路径中存在的跳数、每一跳累积的时延,并且还计算了跳数。然而,该模型遗漏了 WCD 方程中最关键的部分,即周期持续时间。此外,模型所考虑的 WCD 方程中的两个组成部分,即 \(T_{\mathrm{tx,max}}\left\lceil\frac{\mathrm{payload}}{\mathrm{max\_frame\_size}}\right\rceil\) 和 \(T_{\mathrm{queue,max}}(h-1)\),完全是幻觉生成的。这两个组成部分主要导致了该模型较大的 WCD 值。

WCD、跳数、周期持续时间、payload、max_frame_size 等术语已保留或准确翻译。原文第二句语法略不完整,“Even though ... However ...”存在衔接问题,译文保留其转折关系;公式部分可读但来自抽取文本,仍有轻微上下文缺失风险。

P197已复核

为了理解 WCD 计算失败的性质,我们识别出在不同模型和机制中观察到的五种不同失败模式。

WCD 保留为缩写;“models and mechanisms”译为“模型和机制”,符合 TSNBench 上下文。未发现明显问题。

P198已复核

模型对所有流都返回 \(WCD = 0\),生成了结构上有效的 JSON 响应,但没有任何计算内容。这种失败模式影响了 CBS 上的 GPT-4o 和 DeepSeek-V3.2(Non-thinking),以及 CBS 和 CQF 的所有测试用例中的 Llama 3.2 1B。这表明这些模型能够识别输出格式要求,但无法参与底层 NC 计算,也无法进行 WCD 计算背后的任何推理。

WCD、JSON、CBS、CQF、NC、模型名和 Non-thinking 均保留;“all test cases”译为“所有测试用例”。未发现明显问题。

P199已复核

在给定 TC 中,模型为少于 80% 的流生成有效 WCD 值,从而导致覆盖不完整。这影响了 CBS 上的 Mistral Large 3 和 CBS 上的 Llama 3.3,表明这些模型在大型拓扑中会丢失对流索引的跟踪。

80%、TC、WCD、CBS、模型名均保留;“fewer than 80%”准确译为“少于 80%”。未发现明显问题。

P200已复核

由于上下文窗口限制或 API 超时,模型无法处理完整的开放式提示。这影响了 Qwen3 8B(所有 TC 均发生 API 超时)和 Llama 3.2 1B(超出上下文限制),确认小模型在结构上不适合 TSN 开放式评估。

API、TC、Qwen3 8B、Llama 3.2 1B、TSN 均保留;“open-ended prompt/evaluation”译为“开放式提示/开放式评估”。未发现明显问题。

P201已复核

对于所有开放式测试用例,无论网络拓扑或流数量如何,该模型都返回空响应。这种失效模式仅影响 DeepSeek-V3.2 (Thinking):在所有被评估的拓扑中,包括单交换机、中等网格和环形配置,并且在所有流上,无论是针对 CBS 机制还是 CQF 机制,它都不产生任何输出,既没有 WCD 值,也没有中间推理。

术语方面,open-ended test cases 译为“开放式测试用例”,network topology 译为“网络拓扑”,flow count 译为“流数量”,WCD、CBS、CQF 保留缩写,符合技术论文语境;数字方面未涉及具体数值;逻辑方面准确保留了“regardless of”和“exclusively affects”的限定关系,以及“neither...nor...”的否定并列;公式/缩写未发现残缺。未发现明显问题。

切换查看英文原文
P001Block 4

Recent advances in large language models (LLMs) across different domains such as engineering (Jackson et al., 2025; Guo et al., 2025), medicine (Xie et al., 2025; Liu et al., 2023; Li et al., 2024), clinical practice (Kweon et al., 2024), computer networking (Sharma and Yegneswaran, 2023), telecommunications (Maatouk et al., 2026; Ferrag et al., 2026; Oluwaseyi et al., 2025; Gajjar et al., 2025), and automation (Shen et al., 2024) have shown groundbreaking performance in assisting engineers, practitioners, researchers (Huang et al., 2023; Sun et al., 2024), and doctors in solving real-world problems. System engineers are increasingly using LLMs to design and configure networks (Wang et al., 2024a), generate code, and analyze network logs. With this, they are entering new territory: safety-critical application domains such as autonomous vehicles, aerospace (Fiori et al., 2024; Sanchez-Garrido et al., 2021), defense (Elliott, 2023), and industrial communication (Zhang et al., 2024). In these contexts, the accuracy, reliability, and consistency of LLMs become far more than leaderboard metrics, as they become engineering requirements.

P002Block 5

Time-Sensitive Networking (TSN) (31), standardized by the IEEE 802.1 Working Group (WG), is a layer-2 Ethernet technology that provides deterministic communication guarantees for safety-critical applications. TSN deployments typically separate traffic based on timing criticality. Safety-critical periodic communication with guaranteed latency and bounded jitter is categorized as time-triggered (TT) (Ademaj et al., 2019) traffic and is served using the IEEE 802.1Qbv timed-gate mechanism. TT transmissions are controlled by a Gate-Control List (GCL), computed offline using exact methods such as SMT-based synthesis (Craciunas et al., 2016) or heuristic approaches (Pop et al., 2016; Gavriluţ et al., 2018; Bujosa et al., 2022). In contrast, periodic or sporadic communication requiring bounded end-to-end latency but less stringent jitter control is classified as Audio Video Bridging (AVB) stream traffic (Böhm and Wermser, 2021; Bruckner et al., 2019). Consequently, Worst-Case Delay (WCD) estimation errors of tens or hundreds of microseconds are significant, as they can consume timing margins, violate deadlines, or lead to infeasible TSN configurations. In mission-critical deployments, such errors can have severe consequences. A misconfigured TSN network can cause, for example, a robotic arm to miss a critical assembly step, a brake system to fail on a highway, an aircraft control system to respond incorrectly, a defense mechanism to collapse, or a spacecraft to miss a vital signal. These failures may result from sub-millisecond timing violations caused by a single misconfiguration. These risks highlight the importance of accurate analysis and configuration in TSN systems, especially as LLMs are increasingly integrated into network management workflows. Therefore, their domain proficiency must be rigorously evaluated. However, to the best of our knowledge, no existing benchmark evaluates LLM proficiency in TSN.

P003Block 6

To fill this gap, we introduce TSNBench, the first benchmark for evaluating LLM proficiency in TSN, comprising two complementary evaluation components. The first is a 939-question expert-validated multiple-choice question and answer (MCQA) dataset, generated from 83 peer-reviewed research papers using three LLMs from distinct model families and rigorously reviewed by five domain experts, each with over eight years of TSN research experience. The second is a set of open-ended questions requiring multi-step WCD computation for two widely deployed TSN mechanisms, namely Credit-Based Shaper (CBS) (32) and Cyclic Queuing and Forwarding (CQF) (33; J. Yan, W. Quan, X. Jiang, and Z. Sun (2020)), across varying network topologies and traffic flows, with ground truth computed using a verified Network Calculus (NC) solver (Zhao et al., 2018) for CBS and closed-form mathematical upper bounds for CQF (Wang et al., 2023). These open-ended WCD questions are intended as a closed-book stress test of standalone model capability, rather than as a deployment workflow for free-text LLM timing outputs. Detailed background on TSN, NC, CBS, and CQF is provided in Appendix 7, 8, 9, and 10, respectively.

P004Block 7

While general-purpose benchmarks such as MMLU (Hendrycks et al., 2021) and MMLU-Pro (Wang et al., 2024b) evaluate broad subject knowledge spanning elementary mathematics, history, and law, they are fundamentally unsuited for safety-critical domain-specific evaluation. Answering a multiple-choice question about elementary school history is categorically different from answering TSN terminology questions and correctly computing a WCD under NC constraints for a given network topology. Without a benchmark that captures this distinction, there is no principled way to measure LLM progress in deterministic networking domains. TSNBench is designed precisely to expose this gap.

P005Block 8

We evaluate 16 LLMs comprising open-source and closed-source models, as well as general-purpose and reasoning-specialized architectures. Our results reveal a striking dissociation, where models achieve 67 to 95% accuracy on MCQA yet fail substantially on open-ended WCD computation. The best-performing model, GPT-5, achieves a Mean Absolute Percentage Error (MAPE) of 36.2% on CBS, while most models exceed 80%. This is concerning in a domain where timing violations of tens of microseconds, even 1% of a 1000 μ \mu s deadline, may cause system failures.

P006Block 9

Our key contributions are: 1. First expert-validated TSN benchmark: TSNBench evaluates LLM knowledge of TSN mechanisms through 939 expert-validated MCQs derived from peer-reviewed TSN literature. 2. Open-ended timing-analysis tasks: TSNBench includes open-ended WCD computation tasks for CBS and CQF with ground truth computed using a verified NC solver for CBS and closed-form mathematical bounds for CQF. 3. Evaluation across 16 LLMs: We evaluate both open-source and closed-source models, including general-purpose and reasoning-specialized models, and show that high MCQA accuracy does not reliably predict accurate WCD computation.

P007Block 10

First expert-validated TSN benchmark: TSNBench evaluates LLM knowledge of TSN mechanisms through 939 expert-validated MCQs derived from peer-reviewed TSN literature.

P008Block 11

Open-ended timing-analysis tasks: TSNBench includes open-ended WCD computation tasks for CBS and CQF with ground truth computed using a verified NC solver for CBS and closed-form mathematical bounds for CQF.

P009Block 12

Evaluation across 16 LLMs: We evaluate both open-source and closed-source models, including general-purpose and reasoning-specialized models, and show that high MCQA accuracy does not reliably predict accurate WCD computation.

P010Block 13

In summary, TSNBench provides the research community with the first rigorous evaluation resource for LLM proficiency in TSN, offering valuable insights to both the real-time networking community exploring LLM-assisted TSN management and the machine learning community seeking to understand the limits of LLMs in safety-critical, computationally demanding domains.

P011Block 16

Benchmarking and datasets are essential for measuring LLM progress and identifying key gaps and limitations (Hendrycks et al., 2021; Wang et al., 2024b). General knowledge benchmarks such as MMLU (Hendrycks et al., 2021) and MMLU-Pro (Wang et al., 2024b) evaluate broad subject knowledge including elementary mathematics, history, computer science, and law, using multiple-choice questions. Domain-specific benchmarks have extended this paradigm to medicine (Xie et al., 2025; Liu et al., 2023; Li et al., 2024), clinical practice (Kweon et al., 2024), law (Guha et al., 2023), code generation (Hua et al., 2025; Huang et al., 2024), and scientific research (Sun et al., 2024). While these benchmarks have driven significant progress, they are not designed to evaluate safety-critical networking tasks. Most rely on multiple-choice evaluation, and none assess whether a model can perform the multi-step computational reasoning required in safety-critical networking domains. TSNBench addresses this gap by introducing MCQA and open-ended WCD computation questions with ground truth verified by state-of-the-art NC solvers, providing an evaluation of TSN that no existing general benchmark captures.

P012Block 18

In the last few years, several benchmarks have evaluated LLM proficiency in networking and telecommunications domains. TeleQnA (Maatouk et al., 2026) presents an MCQ dataset for telecommunications, generated from research documents and 3GPP standards and validated by domain experts. 6G-Bench (Ferrag et al., 2026) presents an MCQ-based dataset for 6G networks containing 3,722 difficult questions validated through automated filtering and expert human review. Beyond question-answering benchmarks, NetConfEval (Wang et al., 2024a) evaluates LLMs on network configuration tasks and demonstrates that LLMs can simplify and automate complex network management tasks.

P013Block 20

The application of LLMs to TSN management and orchestration is still at a very early stage, with only limited initial studies available. Windmann et al. (2025) explored the use of LLMs for configuring hybrid 5G/TSN networks by assisting users with manual configuration tasks and suggesting configurations in a 5G-TSN network. However, this work remains preliminary and does not provide experimental results. Overall, prior work does not provide a systematic benchmark or rigorous evaluation of LLM proficiency across TSN mechanisms, nor does it assess computational reasoning capabilities for WCD analysis. TSNBench fills this gap by providing the first structured benchmark covering both declarative TSN knowledge through MCQA and computational reasoning through open-ended WCD evaluation.

P014Block 22

Unlike established domains such as medicine (Xie et al., 2025), 5G (Oluwaseyi et al., 2025; Maatouk et al., 2026), general human knowledge (Phan et al., 2026; Hendrycks et al., 2021; Wang et al., 2024b), coding (Hua et al., 2025; Huang et al., 2024), and law (Guha et al., 2023), no open-source TSN dataset exists for LLM evaluation (Zhang et al., 2024; Peng et al., 2023; Zanbouri et al., 2025; Adil et al., 2026). As highlighted in (Liu et al., 2023), the data source determines the reliability of a dataset, and generating a high-quality dataset is a crucial prerequisite for meaningful benchmarking. We describe the TSNBench construction pipeline below, with full details provided in Appendix 11.

P015Block 24

Published research papers and standards are among the most reliable sources for building domain-specific datasets (Liu et al., 2023). Since TSN knowledge originates primarily from peer-reviewed research and IEEE 802.1 TSN standards, we curate a collection of open-access research documents as our source corpus. To avoid copyright issues and exclude papers with incorrect results or flawed methodologies, we include only published open-access papers. For papers not available in open-access form, we use arXiv versions that have been published or accepted, excluding unpublished preprints with unverified results. Where possible, we also collect author manuscript versions with proper attribution. To ensure quality, we prioritize highly cited papers from reputable venues while accounting for publication timeline, as recent papers naturally have fewer citations. In total, we collect 83 research papers covering a broad range of TSN mechanisms, including Time-Aware Shaper (TAS), CBS, CQF, NC-based schedulability analysis, performance evaluation, hardware experiments, combined shapers such as TAS+CBS (Zhao et al., 2022), and Multi-CQF (Alexandris et al., 2022). Detailed background on TSN, related work, and its mechanisms is given in Appendix 7.

P016Block 26

TSN employs specialized vocabulary, similar to other communication domains (Andrews et al., 2014; Saad et al., 2020; Ma et al., 2019). A successful LLM that understands TSN should be able to reason correctly about TSN terminology. A model that cannot differentiate between TAS and CBS, or cannot correctly expand TSN-specific acronyms, cannot be considered proficient in TSN. To capture this dimension, we extract keywords and acronyms widely used in TSN literature and use them to guide MCQA generation. All terms are extracted from the 83 research documents using Claude Sonnet 4, as shown in Table 1, and stored in JSON format. Each document is preprocessed to remove non-relevant content, including author names, affiliations, figures, tables, URLs, and pseudocode. The model is instructed to extract only terms defined within the document, without relying on pretrained knowledge, and to provide each term’s acronym, full form, and one-to-two-sentence definition from the source. The extracted set is then reviewed by domain experts to resolve duplicates, retaining the longer definition in cases of conflict. Figure 1 illustrates this pipeline.

P017Block 29

To optimize time and reduce manual effort, we use an LLM-based approach to generate MCQAs from research documents. The keyword file is provided alongside the research documents as additional input, serving as an independent source to complement research paper content during generation. We use three models from distinct families, namely Claude Sonnet 4, GPT-4o mini, and Llama 3.1 70B, as shown in Table 1. These models are deliberately selected to ensure diverse styles and reasoning capabilities, thereby reducing generative bias. The same system prompt is used for all models, and each research paper is assigned to exactly one model in a round-robin manner. Non-relevant sections, such as author information, affiliations, references, URLs, figures, tables, and pseudocode, are removed from each document before generation.

P018Block 31

LLM-generated MCQAs cannot be used directly for benchmarking, as they may contain incorrectly formulated questions, incomplete options, or vague and incorrect answer choices. To address positional bias introduced by the generating model, answer options are shuffled randomly prior to human expert review, with the correct answer label updated to reflect the new ordering.

P019Block 33

Given the safety-critical nature of TSN, rigorous human validation is essential. We engage five TSN domain experts: three senior professors with more than 15 years of research experience and two postdoctoral researchers with more than 8 years of expertise. Each question is independently evaluated with four outcomes: (i) accept - correct and clear; (ii) revise - requires modification for clarity or correctness; (iii) reject - the question is incorrect, misleading, or irrelevant; or (iv) doubtful - the expert is uncertain and passes it to remaining reviewers for consensus. Questions without consensus are discarded. Full review criteria are provided in Appendix 11.1 and Table 5. Table 2 summarizes the dataset statistics and Figure 2 illustrates the full pipeline.

P020Block 35

While MCQA evaluates declarative TSN knowledge, open-ended questions assess whether LLMs can perform the multi-step mathematical reasoning required in real TSN deployment. We evaluate WCD computation, as WCD is a central key performance indicator (KPI) in TSN network design and directly determines whether a network meets its stringent timing requirements. We select two TSN mechanisms for this evaluation: CBS and CQF. CBS is widely deployed for audio-video traffic and requires NC-based analysis, making it mathematically demanding. CQF is a more recently standardized TSN mechanism whose WCD can be computed from a closed-form equation given routing and cycle duration (T T), providing a complementary evaluation that isolates formula application from NC complexity. Together, these two mechanisms span a meaningful range of WCD computation difficulty. Ground truth WCD values are computed using a verified state-of-the-art NC tool (Zhao et al., 2018) for CBS and closed-form mathematical upper bound for CQF. We release all ground truth WCD values alongside the questions to support future open-source community evaluations. Each open-ended question is formulated by domain experts, as shown in Figure 3, and comprises three components: network topology, flow information, and flow routing. In TSNBench, three topologies are used to cover a broad range of scenarios: (i) one-switch topology (Figure 15), (ii) medium-mesh topology (Figure 16), and (iii) ring topology, representing industrial networks (Figure 17). Each topology consists of end nodes and switches connected via Ethernet links, with unicast traffic flows transmitted from a sender to a single receiver. Flows consist of Ethernet frames whose maximum payload is bounded by the Maximum Transmission Unit (MTU). Further topology, flow, and routing details are provided in Appendix 11.4.

P021Block 37

For both MCQA and open-ended evaluations, each prompt defines the model’s role as a TSN expert. For MCQA, we use zero-shot prompting with no in-context examples, representing a conservative approach that measures inherent TSN proficiency, ensuring that the output performance reflects the model’s domain knowledge rather than in-context pattern matching. For open-ended questions, we also use a zero-shot setting, providing no example WCD calculations or NC or CQF equations, ensuring the model independently recalls and applies the correct computational methodology. For both question types, the model is asked to provide a confidence score alongside its answer.

P022Block 38

The open-ended prompt comprises three variable components: network topology, flow parameters, and pre-computed shortest path routes. The same prompt template is used across all 100 open-ended evaluation instances per mechanism, with only these three components varying. Fixed network constants are maintained throughout to ensure comparability across models and instances. A detailed discussion of the open-ended prompt design is provided in Appendix 11.3.

P023Block 40

For the MCQA dataset, performance is measured as the percentage of questions answered correctly, reported as accuracy. For the open-ended questions, we evaluate the computational reasoning capability of each model by comparing its predicted WCD values against ground truth values. For CBS, ground truth WCD values are derived using NC-based Total Flow Analysis (TFA). Specifically, the worst-case delay upper bound D f h D_{f}^{h} for flow f ∈ ℱ M i h f\in\mathcal{F}_{M_{i}}^{h} at h h equals the worst-case delay upper bound D M i h D_{M_{i}}^{h} for all flows with the same priority M i M_{i} aggregating at h h, D f h = D M i h = h ​ D ​ e ​ v ​ (α M i h, β M i h) = sup t ≥ 0 { inf { τ ≥ 0 ∣ α M i h ​ (t) ≤ β M i h ​ (t + τ) } }, D_{f}^{h}\!=\!D_{M_{i}}^{h}\!=\!hDev(\alpha_{M_{i}}^{h},\beta_{M_{i}}^{h})=\!\sup_{t\geq 0}\left\{\inf\left\{\tau\!\geq\!0\mid\alpha^{h}_{M_{i}}\!(t)\leq\beta^{h}_{M_{i}}\!(t\!+\!\tau)\right\}\right\},\\ (1) where α M i h ​ (t) \alpha_{M_{i}}^{h}(t) represents the arrival curve of aggregate flows of priority M i M_{i} passing through h h, and β M i h ​ (t) \beta_{M_{i}}^{h}(t) represents the service curve for these corresponding flows. The end-to-end WCD for a flow is obtained by summing per-port delay bounds along its route. Full NC methodology and proofs are provided in Appendix 8.

P024Block 41

For CQF, the worst-case end-to-end delay is given by the closed-form expression WCD = f i. ϕ + (SW num + 1) ⋅ T + ξ, \mathrm{WCD}=f_{i}.\phi+(\mathrm{SW_{num}}+1)\cdot\mathrm{T}+\xi, (2) where f i. ϕ f_{i}.\phi is the flow offset at the source node in μ \mu s, SW num \mathrm{SW_{num}} is the number of switches along the flow route, T \mathrm{T} is the cycle duration in μ \mu s, and ξ \xi denotes the network specific delays including processing delay, propagation delay, switching delay, and time synchronization error. The derivation and proof of this bound are provided in Appendix 10.

P025Block 43

We evaluate 16 state-of-the-art LLMs spanning open-source and closed-source models across general-purpose and reasoning-specialized architectures. Table 6 in Appendix 12 provides the full list of models with their model IDs and organizations. All models are accessed via their respective official vendor APIs with no fine-tuning applied: GPT (OpenAI API), DeepSeek (DeepSeek API), Mistral (Mistral AI API), Claude (Anthropic API), Gemini (Google AI API), Grok (xAI API), and Llama and Qwen (Hugging Face inference router). All client-side operations, including prompt construction, API handling, response parsing, and metric computation, are performed on a standard workstation. To assess repeatability and stochasticity, each MCQA and open-ended question is evaluated three times under two temperature settings: deterministic (T = 0.0) and stochastic (T = 0.7). Since TSN is widely used in safety-critical domains, deterministic responses are essential, as non-determinism would undermine the reliability of LLM-based TSN reasoning. For models that do not expose a temperature parameter, evaluations use the vendor default configuration, as noted in Table 5. Full cost and latency details are provided in Appendix 12, Table 8.

P026Block 46

Since the MCQs were generated using models from families included in the evaluation, as shown in Table 1, contamination is a potential concern. We therefore separate the evaluated models into generator families (Claude, GPT, Llama) and non-generator families (all remaining models) and compare their average MCQA accuracy. Generator-family models achieve an average accuracy of 88.8%, whereas non-generator-family models achieve 91.0%. The generator-family models do not perform better than the non-generator-family models, so we do not observe evidence of a systematic advantage. This analysis does not rule out all possible contamination pathways, but it addresses this specific concern. The open-ended timing tasks are less likely to be affected because their topology, flow, and routing inputs were constructed specifically for TSNBench.

P027Block 47

Evaluation Metrics: Model performance on the MCQA dataset is measured using accuracy, defined as the percentage of correctly answered questions out of 939, averaged across three runs. We additionally report Expected Calibration Error (ECE) (Pavlovic, 2025) and Brier score (Hoessly, 2026) to evaluate the alignment between the model’s expressed confidence and its actual correctness. Calibration is particularly critical in safety-critical domains such as TSN, where high-confidence incorrect answers may lead to misleading configuration decisions, deadline violations, or network instability in industrial and automotive systems. We therefore also evaluate the Confidently Wrong (CW) rate to determine the fraction of incorrect answers where the model expresses high confidence (≥ \geq 0.8). All calibration metrics are computed on the full 939-MCQA dataset across three runs per model.

P028Block 48

Results and Discussion: Table 5 reports accuracy, average (avg.) consistency, calibration, and average latency for all 16 models. The top performers are Claude Sonnet 4.5 (95.3%) and GPT-5 (95.0%), with Claude Sonnet 4.5 also achieving the lowest Brier score (0.0429), indicating strong accuracy and calibration. Llama 3.2 1B achieves the lowest accuracy (67.4%), consistent with its substantially smaller parameter count compared with the other models.

P029Block 49

A notable finding emerges from the reasoning models. Despite their stronger general reasoning capabilities, o3, GPT-5, and DeepSeek-V3.2 (Thinking) do not outperform the best non-reasoning models on MCQA, all scoring below Claude Sonnet 4.5. This suggests that TSN MCQA performance is primarily driven by domain knowledge rather than general reasoning, and that reasoning-specialized architectures offer limited advantage on declarative knowledge retrieval tasks.

P030Block 50

The calibration results reveal key differences across models. While most models are well-calibrated (ECE < 0.06), o3 has the highest ECE (0.1874) despite 94.7% accuracy, yet achieves the lowest CW rate (3.4%), rarely assigning high confidence to incorrect answers (refer to Figure 5). In contrast, many non-reasoning models have CW rates of 100%, assigning high confidence to incorrect answers. Mistral Medium 3.1 has the highest average confidence (0.9779) while maintaining 92.1% accuracy. All models have zero refusal rate, indicating that the MCQA dataset does not trigger response refusals.

P031Block 52

Figure 6 presents the reliability plot for all 16 evaluated models on the MCQA dataset.

P032Block 53

Each diagram shows the observed accuracy against the model’s expressed confidence, binned across the confidence range. A perfectly calibrated model would fall on the gray dashed diagonal line. This means the model’s confidence would perfectly align with its actual accuracy. The red shaded region indicates overconfidence, meaning the model’s confidence exceeds its actual accuracy. The green shaded region indicates underconfidence, meaning the model is more accurate than its expressed confidence suggests.

P033Block 54

In safety-critical TSN deployments, overconfidence is significantly more dangerous than underconfidence. A model that is incorrect but expresses high confidence may mislead a network engineer with an erroneous WCD estimate or misconfigured scheduling parameters. By contrast, an underconfident model that expresses uncertainty on correct answers prompts additional verification.

P034Block 55

The majority of the evaluated models sit in the high-confidence region (0.8 to 1.0) regardless of their actual accuracy. This indicates that the models tend to exhibit overconfidence.

P035Block 56

Grok 4.1 Fast (NR), Mistral Medium 3.1, Mistral Large 3, and Ministral 3 8B achieve CW rates of 100%, meaning all incorrect answers fall in the high-confidence range. This represents the most critical calibration behavior for TSN deployment. GPT-4o, Gemini 2.5 Flash, Llama 3.2 1B, and Qwen3 8B similarly exhibit CW rates exceeding 95%. A notable exception is o3, which is the only model that falls predominantly in the green underconfident zone, with a CW rate of just 3.4%. Despite having the highest ECE (0.1874) among all evaluated models, o3 is the safest among the evaluated models from a calibration perspective, as it rarely expresses high confidence on incorrect MCQA answers. This highlights an important distinction between aggregate calibration metrics and safety-relevant calibration behavior. DeepSeek-V3.2 (NT) achieves the lowest ECE (0.0105), suggesting strong overall calibration, yet maintains a CW rate of 96.4%, demonstrating that a low ECE does not guarantee safe and realistic confidence behavior.

P036Block 58

Evaluation Metrics: For the open-ended questions, we report two widely used metrics: Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE), computed per test case (TC). Each TC consists of n n flows, denoted f i f_{i} where i i = 1 ​ ⋯ ​ n 1\cdots n. For each flow f i f_{i}, y ^ T ​ C x, f i \hat{y}_{{TC}_{x},f_{i}} denotes the WCD predicted by the model for T ​ C x {TC}_{x} and y T ​ C x, f i y_{{TC}_{x},f_{i}} denotes the ground truth WCD of flow f i f_{i} for TC number x x, computed using a verified NC solver for CBS and using Eq. 22 for CQF. The MAE for each TC is defined as: MAE T ​ C x = 1 n ​ ∑ i = 1 n | y ^ T ​ C x, f i − y T ​ C x, f i |, \text{MAE}_{{TC}_{x}}=\frac{1}{n}\sum_{i=1}^{n}|\hat{y}_{{TC}_{x},f_{i}}-y_{{TC}_{x},f_{i}}|, (3) where x x denotes the TC index and x ∈ { 1, …, 100 } x\in\{1,\ldots,100\}. The MAPE for each TC is defined as: MAPE T ​ C x = 1 n ​ ∑ i = 1 n | y ^ T ​ C x, f i − y T ​ C x, f i | y T ​ C x, f i × 100 \text{MAPE}_{{TC}_{x}}=\frac{1}{n}\sum_{i=1}^{n}\frac{|\hat{y}_{{TC}_{x},f_{i}}-y_{{TC}_{x},f_{i}}|}{y_{{TC}_{x},f_{i}}}\times 100 (4) The overall MAE and MAPE for a model are obtained by averaging across all 100 TCs: MAE = 1 100 ​ ∑ x = 1 100 MAE T ​ C x, MAPE = 1 100 ​ ∑ x = 1 100 MAPE T ​ C x \text{MAE}=\frac{1}{100}\sum_{x=1}^{100}\text{MAE}_{{TC}_{x}},\qquad\text{MAPE}=\frac{1}{100}\sum_{x=1}^{100}\text{MAPE}_{{TC}_{x}} (5) We additionally report the median MAE across TCs as a robust measure against outlier TCs. Further example and details on the evaluation metrics are provided in Appendix 13 and Table 9.

P037Block 59

Results and Discussion: Table 3 presents the WCD computation results for both CBS and CQF across all 100 TCs. The central finding is a striking dissociation between MCQA accuracy and computational reasoning performance. Models that achieve above 90% accuracy on MCQA still fail substantially on open-ended WCD computation, with the best-performing model, GPT-5, achieving a median MAE of 92.4 μ \mu s on CBS, which is concerning because industrial TSN traffic can have strict timing requirements (Ekrad et al., 2025). Detailed per-TC results are provided in Appendix 13.

P038Block 60

For CBS, most models produce large errors, with many exceeding 200 μ \mu s MAE and 70% MAPE. Several models exhibit distinct failure modes. On CBS, Llama 3.2 1B responds to fewer than 50 evaluated TCs, returning all-zero WCD values for few TCs and partially incorrect values for some TCs, with incomplete flow coverage in all responses. Grok 4.1 Fast (Reasoning) returns truncated JSON, providing flow profile metadata but no WCD values, suggesting that the model hit an output length limit. DeepSeek-V3.2 (Thinking) returns empty responses for more than 70 TCs across both mechanisms. The NC-based computation required for CBS is mathematically demanding and complex, and the zero-shot setting reveals that most models cannot independently recall or correctly apply the full NC methodology. Among models that produce valid CBS responses, GPT-5 achieves the best performance (MAE 150.2 μ \mu s, MAPE 36.2%). Notably, OpenAI reasoning models and Grok 4.1 Fast perform better on CBS than non-reasoning models, with GPT-5 achieving substantially lower MAE than all non-reasoning models, suggesting that multi-step mathematical reasoning capability provides an advantage for NC-based WCD computation even when it does not improve MCQA accuracy.

P039Block 61

For CQF, performance is more varied, with median MAE ranging from 1.2 μ \mu s (GPT-4o) to 1,046 μ \mu s (Ministral 3 8B), and MAPE ranging from 41.8% (Mistral Large 3) to 1705.5% (Ministral 3 8B). GPT-4o achieves the lowest median MAE on CQF (1.2 μ \mu s, MAPE 61.9%) despite failing completely on CBS, suggesting it can correctly apply the CQF closed-form equation. Mistral Large 3 achieves the lowest MAPE on CQF (41.8%), indicating the most accurate relative WCD estimation across all evaluated models. Llama 3.2 1B exhibits the most severe hallucination failure, fabricating up to 1,013 flows (flow 0–1012) instead of predicting WCD for the actual flows (fewer than 30 flows per TC), and returning WCD = 0 for all. Qwen3 8B fails to produce any response for either CBS or CQF due to repeated API timeouts. Ministral 3 8B, despite being a small model, produces valid responses for both CBS and CQF but with large errors (MAPE 25498.1% for CBS and 1705.5% for CQF), demonstrating that context handling is necessary but not sufficient for correct WCD computation.

P040Block 62

Comparison across MCQA and open-ended questions: Figure 7 illustrates the performance differences between models across two evaluation types, MCQA and open-ended questions. The right-hand figure shows the MAE for a one-switch topology across different models, while the left-hand figure presents the MCQA accuracy. The MCQA accuracy remains high, above 80%, for all models except Llama 3.2 1B. However, the MAE is still significant for TSN flows with deadlines in the range of 1000 to 5000 μ \mu s. Figure 18 further presents the performance differences between models for MCQA and open-ended questions in a ring topology.

P041Block 64

We present TSNBench, the first benchmark for evaluating LLM proficiency in Time-Sensitive Networking (TSN), comprising 939 expert-validated multiple-choice questions (MCQs) and 100 open-ended questions per mechanism for Credit-Based Shaper (CBS) and Cyclic Queuing and Forwarding (CQF). The ground truth WCD values are computed using a verified Network Calculus (NC) solver for CBS and closed-form mathematical upper bounds for CQF. We evaluate 16 LLMs and find that models achieve 67-95% MCQA accuracy yet fail substantially on open-ended WCD computation, with the best model (GPT-5) still achieving a Mean Absolute Percentage Error (MAPE) of 36.2% on CBS. Despite CBS being extensively researched and an older mechanism, models cannot correctly apply NC, whereas CQF, with its simpler closed-form equation, is handled more successfully, confirming that WCD computation performance is governed by mathematical complexity rather than mechanism maturity. TSNBench demonstrates that MCQ benchmarks substantially overestimate LLM capability in safety-critical domains.

P042Block 65

Limitations and Future Directions: TSNBench has three primary limitations. First, the MCQA dataset is generated from open-access research papers, limiting coverage of certain mechanisms. Second, the open-ended evaluation covers only CBS and CQF. Extending to TAS is a natural next step, though its NP-hard gate control list (GCL) synthesis problem poses additional challenges beyond CBS and CQF. Third, the open-ended tasks evaluate standalone zero-shot model behavior under a closed-book prompt and should not be interpreted as a recommended deployment workflow for safety-critical TSN systems. Evaluating LLMs in settings where they produce checkable artifacts verified by deterministic analysis tools, and evaluating whether providing NC equations in the prompt improves WCD computation accuracy, are important directions for future versions of TSNBench.

P043Block 69

While TSNBench fills a significant research gap and proposes a step forward towards evaluating TSN capabilities in LLMs, it has several limitations:

P044Block 70

Dataset scope: TSNBench currently only covers CBS and CQF in open-ended questions. Evaluating other TSN mechanisms is necessary to fully cover the entire TSN mechanism.

P045Block 71

Prompt Design: TSNBench does not provide any mathematical equation to the model as input for NC WCD calculation for CBS or the upper bound delay calculation of CQF.

P046Block 72

MCQA Scope: MCQs are solely developed using published research papers and the IEEE standards are not used to generate the MCQs. Solving the license issue and utilizing standards to include MCQs using IEEE 802.1 standard will enhance the entire MCQA dataset.

P047Block 73

Topology coverage: TSNBench open-ended question currently covers three different topologies: one-switch, medium-mesh, and ring topology. Covering diverse topologies and flow parameters will present a comprehensive evaluation.

P048Block 75

To address the limitations of TSNBench, we propose the following additions and improvements in the future version of TSNBench. 1. Larger and more diverse dataset: Our current TSNBench dataset covers 100 TCs across three topology types. In future versions, we will include larger and more complex topologies with higher flow counts. As model performance improves, more complex open-ended evaluations should be integrated with complex topologies and combined TSN mechanisms. 2. Additional scheduling mechanisms: TSNBench currently evaluates CBS and CQF. Future versions should extend to TAS and ATS to cover a broader range of the TSN standard suite. 3. Updated MCQA: Our MCQA dataset was developed using open-source research documents. In future work, we will update the dataset with MCQAs formulated directly from TSN standards. 4. Fine-tuned and domain-adapted models. TSNBench currently evaluates general-purpose LLMs without any TSN-specific fine-tuning. Future versions should benchmark domain-adapted models trained on TSN standards and network calculus literature.

P049Block 76

Larger and more diverse dataset: Our current TSNBench dataset covers 100 TCs across three topology types. In future versions, we will include larger and more complex topologies with higher flow counts. As model performance improves, more complex open-ended evaluations should be integrated with complex topologies and combined TSN mechanisms.

P050Block 77

Additional scheduling mechanisms: TSNBench currently evaluates CBS and CQF. Future versions should extend to TAS and ATS to cover a broader range of the TSN standard suite.

P051Block 78

Updated MCQA: Our MCQA dataset was developed using open-source research documents. In future work, we will update the dataset with MCQAs formulated directly from TSN standards.

P052Block 79

Fine-tuned and domain-adapted models. TSNBench currently evaluates general-purpose LLMs without any TSN-specific fine-tuning. Future versions should benchmark domain-adapted models trained on TSN standards and network calculus literature.

P053Block 81

TSNBench enables the real-time systems community and the machine learning community to objectively measure LLM performance and readiness for management and deployment assistance in safety-critical deterministic networks. By highlighting the critical aspects of TSN and the performance gap of the models between MCQA and computational reasoning, TSNBench alerts the incompetence of the models which may lead to misconfigurations and safety-critical issues. This benchmark provides a concrete direction to improve LLMs for deterministic networking. TSNBench further highlights the potential benefits of using LLMs thereby automating the management and deployment of TSN networks. Moreover, open-sourced ground truth WCD values computed by NC solvers provide a reliable resource for the entire community to further evaluate different benchmarking datasets.

P054Block 83

While TSNBench is intended to advance research on LLM proficiency in TSN, we acknowledge the following potential negative impacts.

P055Block 84

Overreliance on model outputs: Models trained on the open-access dataset provided by TSNBench may achieve high accuracy on WCD analysis tasks, which could lead practitioners to deploy such models directly in real-world deployments without independent verification. Any WCD values or network configuration decisions produced by an LLM should be verified using formally verified solvers and NC tools before real-world deployment.

P056Block 85

False confidence from MCQA performance: Our results demonstrate that strong MCQA performance does not transfer to open-ended WCD estimation. A practitioner or system engineer who evaluates an LLM solely on MCQA benchmarks may incorrectly conclude that the model is suitable for TSN configuration tasks, leading to unsafe deployments in systems where timing guarantees are required.

P057Block 86

Data contamination and benchmark overfitting: As TSNBench is released as an open-access dataset, future models may be trained directly on the benchmark questions, leading to inflated performance that does not reflect genuine TSN reasoning capability. We recommend that researchers introduce randomization in the test cases to prevent bias in results. Researchers should be cautious when interpreting results from models whose training data may overlap with the TSNBench dataset.

P058Block 87

Misuse of the dataset: The dataset can be used to train models to configure TSN networks. Owing to the safety-critical nature of TSN applications, such models could potentially be exploited by attackers to manipulate network configurations, introduce timing violations, or deliberately cause deadline misses in industrial and automotive systems.

P059Block 89

Time-Sensitive Networking (TSN) (Finn, 2018) is a set of amendments and additions to the IEEE 802.1 standards that, since its inception in 2012, has become one of the most relevant technologies for enabling deterministic and real-time communications over Ethernet networks. TSN extends standard Ethernet by introducing mechanisms for bounded latency, low jitter, and high reliability, making it suitable for applications such as industrial automation, automotive systems, and professional audio-video networks. Figure 8 showcases a simple TSN network with flows.

P060Block 90

In TSN, communication between end-stations is based on the transmission of Ethernet frames across a network of interconnected Ethernet links and TSN switches. These switches, as well as the output ports of end-stations, implement a queuing architecture with up to eight First-In-First-Out (FIFO) queues, each associated with one of the eight traffic priorities defined in IEEE 802.1Q (31). TSN is not just limited to wired domain. The growing necessity of deterministic communication has extended to wireless domain gaining a significant interest in wireless-TSN networks. Although TSN is fundamentally an IEEE 802.1 bridged Ethernet technology, wireless and 5G-TSN (Debnath et al., 2023a) integration requires additional adaptation or translation functions, together with time-synchronization mechanisms that preserve deterministic latency guarantees across heterogeneous network segments. We showcase a 5G-TSN system in Figure 9, where TSN senders are sending mixed criticality traffic types to wireless receiver nodes over a TSN switch and 5G system in the network. Some of the most commonly used abbreviations in TSN are given in Table 4.

P061Block 91

Frames are classified into traffic classes and assigned to egress queues based on their priority, with transmission selection typically governed by strict priority. Industrial TSN traffic is commonly categorized into traffic types such as isochronous traffic, cyclic-synchronous traffic, cyclic-asynchronous traffic, network-control traffic, alarms and events, configuration and diagnostics, and best-effort traffic (Ademaj et al., 2019). These traffic types require different timing guarantees: safety-critical isochronous traffic is typically mapped to time-triggered (TT) traffic, requiring guaranteed latency and bounded jitter, and is commonly handled by time-triggered mechanisms such as the Time-Aware Shaper (TAS) (Craciunas et al., 2016; Serna Oliver et al., 2018). In contrast, cyclic-synchronous or cyclic-asynchronous traffic that requires bounded end-to-end latency but less stringent jitter control is commonly mapped to AVB stream traffic and is often supported by the Credit-Based Shaper (CBS) (Zhao et al., 2018). TSN also defines mechanisms such as Asynchronous Traffic Shaping (ATS) (Specht and Samii, 2016; Debnath et al., 2023b; Nasrallah et al., 2019), Frame preemption (FP) (Debnath et al., 2024), and Cyclic Queuing and Forwarding (CQF) (Wang et al., 2023; Debnath et al., 2025a; Yan et al., 2020) to provide deterministic communication under different traffic and deployment assumptions.

P062Block 92

These mechanisms regulate when and how frames are transmitted, allowing the network to provide guarantees such as bounded delay, jitter, and controlled bandwidth allocation. In the MCQA dataset of TSNBench, we covered the basics of different TSN mechanisms, including TAS, CBS, ATS, CQF, and CBS. The MCQAs are theoretical in nature and cover the basic understanding of the mechanisms without going into their mathematical or analytical details. In contrast, for the open-ended mechanisms, we evaluate the capability of the models to perform numerical analysis, formulate mathematical equations, and find the WCD values for the flows in the network. For this, we selected two TSN mechanisms: CBS and CQF. The WCD values of the flows using the CBS mechanism are calculated using NC analysis, which is mathematically complex. Therefore, we also evaluate the CQF mechanism as a simpler mechanism. The WCD values of the flows using the CQF mechanism can be directly calculated using the routing of the flow and the cycle duration. The detailed working mechanism and architecture of CQF and CBS are described in detail in the Appendix 9 and 10. The theory of NC is further explained along with the mathematical equations in Appendix 8.

P063Block 94

Network Calculus (NC) is a theory for calculating worst-case bounds in communication networks based on min-plus algebra. Its basic paradigm involves two operators: convolution ⊗ \otimes

P064Block 95

(f ⊗ g) ​ (t) = inf 0 ≤ s ≤ t { f ​ (t − s) + g ​ (s) }, (f\!\otimes\!g)(t)\!=\!\inf_{0\leq s\leq t}\{\!f(t\!-\!s)\!+\!g(s)\!\}, (6) and deconvolution ⊘ \oslash,

P065Block 96

(f ⊘ g) ​ (t) = sup s ≥ 0 { f ​ (t + s) − g ​ (s) }. (f\!\oslash\!g)(t)\!=\!\sup_{s\geq 0}\{f(t\!+\!s)\!-\!g(s)\!\}. (7)

P066Block 97

Based on this algebra, the arrival curve and the service curve are constructed to describe the maximum arrival traffic data and the minimum service capability over any time interval, respectively. In the hybrid TSN/TAS+CBS architecture, the service for ET traffic is constrained not only by the bandwidth reservation, but also by high-priority TT traffic. We adopt the state-of-the-art network calculus model (Zhao et al., 2021, 2024) to ensure deadline guarantees for ET flows with an arbitrary number of SR classes in the TSN/TAS+CBS architecture. Since, in our open-end CBS questions, we do not have any TAS mechanism, we use the TSN/TAS+CBS architecture without the TAS mechanism in it with only CBS mechanism for the AVB flows in the network.

P067Block 98

As described in (Zhao et al., 2024), the service curve β ​ (t) \beta(t) is for constraining the minimum service capabilities, satisfying ℛ ∗ ​ (t) ≥ (ℛ ⊗ β) ​ (t). \mathcal{R}^{*}(t)\geq\left(\mathcal{R}\otimes\beta\right)(t). (8) The function ℛ ​ (t) \mathcal{R}(t) (resp. ℛ ∗ ​ (t) \mathcal{R}^{*}(t)) is the input (resp. output) cumulative function counting the total data bits of the flow that arrive at (resp. departure from) the server up to time t t. A typical example of a service curve is the rate-latency form, β R, T ​ (t) = R ​ [ t − T ] + \beta_{R,T}(t)=R[t-T]^{+} (9) with the service rate R R and latency T T. The notation [ x ] + [x]^{+} equals x x if x ≥ 0 x\geq 0, and 0 otherwise.

P068Block 99

In the hybrid TSN/TAS+CBS architecture, the CBS service curve (Zhao et al., 2021) for the arbitrary SR Class M i M_{i} (i ∈ [ 1, N S ​ R ] i\in[1,N_{SR}]) with the impact of TT traffic at the output port h h is, β M i h ​ (t) = i ​ d ​ S ​ l M i h ​ [ t − α T ​ A ​ S h ​ (t) C − c M i h, max i ​ d ​ S ​ l M i h ] ↑ +, \beta^{h}_{M_{i}}(t)=idSl^{h}_{M_{i}}\left[t-\frac{\alpha_{TAS}^{h}(t)}{C}-\frac{c_{M_{i}}^{h,\max}}{idSl^{h}_{M_{i}}}\right]^{+}_{\uparrow}, (10) where c M i h, max c_{M_{i}}^{h,\max} is the credit upper bound for SR Class M i M_{i}, c M i h, max = i ​ d ​ S ​ l M i h ⋅ ∑ j = 1 i − 1 c M j h, min − l > i h, max ∑ j = 1 i − 1 i ​ d ​ S ​ l M j h − C, c_{M_{i}}^{h,\max}=idSl^{h}_{M_{i}}\cdot\frac{\sum_{j=1}^{i-1}c_{M_{j}}^{h,\min}-l^{h,\max}_{>i}}{\sum_{j=1}^{i-1}idSl^{h}_{M_{j}}-C}, (11) where l > i h, max = max j > i ⁡ { l M j h, max, l B ​ E h, max } l^{h,\max}_{>i}=\max_{j>i}\{l^{h,\max}_{M_{j}},l^{h,\max}_{BE}\} is the maximum frame size with priority lower than Class M i M_{i} at h h, l M j h, max l^{h,\max}_{M_{j}} is the maximum frame size of Class M i M_{i} at h h, and c M i h, min c_{M_{i}}^{h,\min} is the lower credit bound of SR Class M i M_{i}, c M i h, min = s ​ d ​ S ​ l M i h ⋅ l M i h, max C. c_{M_{i}}^{h,\min}=sdSl^{h}_{M_{i}}\cdot\frac{l^{h,\max}_{M_{i}}}{C}. (12) α T ​ A ​ S h ​ (t) \alpha_{TAS}^{h}(t) in Eq. (10) is the arrival curve of TT traffic scheduled by GCL.

P069Block 100

The arrival curve α ​ (t) \alpha(t) is for constraining the arrival process of the flow, satisfying ℛ ​ (t) ≤ (ℛ ⊗ α) ​ (t). \mathcal{R}(t)\leq\left(\mathcal{R}\otimes\alpha\right)(t). (13) A typical example of an arrival curve is the burst-rate form,

P070Block 101

α ​ (t) = b + ρ ⋅ t, \alpha(t)=b+\rho\cdot t, (14)

P071Block 102

for t > 0 t>0 and 0 otherwise, with the parameters b b as the maximum burst tolerance and ρ \rho as the long-term rate of the flow.

P072Block 103

For each ET flow f f at its source ES h 0 h_{0}, the arrival curve can be modeled as,

P073Block 104

α f h 0 ​ (t) = b f h 0 + ρ f h 0 ​ t, \alpha_{f}^{h_{0}}(t)=b_{f}^{h_{0}}+\rho_{f}^{h_{0}}t, (15) where b f h 0 = l f b_{f}^{h_{0}}=l_{f}, and ρ f h 0 = l f / P f \rho_{f}^{h_{0}}=l_{f}/P_{f}. The arrival curve of flow f f at intermediate node h h is the output arrival curve of f f departing from the server h − h^{-},

P074Block 105

α f h ​ (t) = α f h − ⊘ δ D f h − ​ (t), \alpha_{f}^{h}(t)=\alpha_{f}^{h^{-}}\oslash\delta_{D_{f}^{h^{-}}}(t), (16)

P075Block 106

where D f h − D_{f}^{h^{-}} is the latency upper bound of flow f f queuing at server h − h^{-}, and δ D ​ (t) \delta_{D}(t) is the pure-delay function.

P076Block 107

The aggregate arrival curve for ET flows of SR Class M i M_{i} at h h is obtained by summing the arrival curves of individual flows. It also incorporates the link shaping curve and the CBS shaping curve to improve the tightness of the analysis results. α M i h ​ (t) = ∑ h − ∈ ℋ ∑ f ∈ ℱ M i h −, h α f h ​ (t) ∧ σ l ​ i ​ n ​ k h −, h ​ (t) ∧ σ M i h −, h ​ (t), \alpha^{h}_{M_{i}}\!(t)\!=\!\!\sum_{h^{-}\in\mathcal{H}}\sum_{f\in\mathcal{F}_{M_{i}}^{h^{-}\!,h}}\!\!\!\!\!\!\alpha^{h}_{f}(t)\!\wedge\!\sigma_{link}^{h^{-}\!,h}(t)\!\wedge\!\sigma_{M_{i}}^{h^{-}\!,h}(t), (17)

P077Block 108

where x ∧ y = min ⁡ { x, y } x\wedge y=\min\{x,y\}, σ l ​ i ​ n ​ k h −, h ​ (t) \sigma_{link}^{h^{-}\!,h}(t) is the link shaping curve from the preceding output h − h^{-} to the current output port h h: σ l ​ i ​ n ​ k h −, h ​ (t) = C ​ t + l M i h −, h, max, \sigma_{link}^{h^{-}\!,h}(t)=Ct+l_{M_{i}}^{h^{-}\!,h,\max}, (18) considering the packetization impact of the maximum frame size l M i h −, h, max l_{M_{i}}^{h^{-}\!,h,\max} of flows with Class M i M_{i} from h − h^{-} to h h. σ M i h −, h ​ (t) \sigma_{M_{i}}^{h^{-}\!,h}(t) is the CBS shaping curve of Class M i M_{i} from h − h^{-} to h h: σ M i h −, h ​ (t) = i ​ d ​ S ​ l M i h − ​ [ t − β T ​ A ​ S h − ​ (t) C + c M i h −, max − c M i h −, min i ​ d ​ S ​ l M i h − ] + l M i h −, h, max, \sigma_{M_{i}}^{h^{-}\!,h}\!(t)\!=\!idSl^{h^{-}}_{M_{i}}\!\left[\!t\!-\!\frac{\beta^{h^{-}}_{TAS}(t)}{C}\!+\!\frac{c_{M_{i}}^{h^{-}\!,\max}\!-\!c_{M_{i}}^{h^{-}\!,\min}}{idSl^{h^{-}}_{M_{i}}}\!\right]\!+\!l_{M_{i}}^{h^{-}\!,h,\max}, (19) β T ​ A ​ S h ​ (t) \beta^{h}_{TAS}(t) represents the minimum service supplied to TT traffic on the output port h h.

P078Block 109

With NC-based Total Flow Analysis (TFA), the worst-case delay upper bound D f h D_{f}^{h} for flow f ∈ ℱ M i h f\in\mathcal{F}_{M_{i}}^{h} at h h equals the worst-case delay upper bound D M i h D_{M_{i}}^{h} for all flows with the same priority M i M_{i} aggregating at h h, D f h = D M i h = h ​ D ​ e ​ v ​ (α M i h, β M i h) = sup t ≥ 0 { inf { τ ≥ 0 ∣ α M i h ​ (t) ≤ β M i h ​ (t + τ) } } D_{f}^{h}\!=\!D_{M_{i}}^{h}\!=\!hDev(\alpha_{M_{i}}^{h},\beta_{M_{i}}^{h})=\!\sup_{t\geq 0}\left\{\inf\left\{\tau\!\geq\!0\mid\alpha^{h}_{M_{i}}\!(t)\leq\beta^{h}_{M_{i}}\!(t\!+\!\tau)\right\}\right\}\\ (20)

P079Block 110

where α M i h ​ (t) \alpha^{h}_{M_{i}}(t) is the arrival curve of aggregate flows of Class M i M_{i} from Eq. (17), and β M i h ​ (t) \beta^{h}_{M_{i}}(t) is the service curve for Class M i M_{i} from Eq. (10). The upper bound of the worst-case end-to-end delay for the flow f f is then obtained by summing the per-port latency bounds along its route.

P080Block 112

Credit-Based Shaper (CBS) is a TSN mechanism designed to prevent starvation of lower-priority traffic while guaranteeing a reserved portion of bandwidth for higher-priority queues, thereby providing reliability through bounded end-to-end delays. Traffic assigned to queues using CBS is typically referred to as Audio Video Bridging (AVB) traffic.

P081Block 113

Here, we build on the description from (Bujosa Mateu, 2024). In CBS, each AVB queue is associated with a credit value. This credit increases over time when a frame is waiting to be transmitted or when the credit is negative, and decreases while a frame is being transmitted. Moreover, if the credit is positive and there are no AVB frames waiting to be transmitted, the credit is immediately reset to 0. The rates at which credit is increased and decreased are defined by the parameters idleSlope and sendSlope, respectively. Each queue implementing CBS is configured with its own idleSlope and sendSlope values, which determine its allocated bandwidth share. In particular, the bandwidth reserved for a queue is expressed as Eq. (21). A queue is eligible for transmission only when its credit is zero or positive.

P082Block 114

Reserved ​ BW = idleSlope idleSlope + sendSlope ⋅ B ​ W \mathrm{Reserved\;BW}=\frac{\textit{idleSlope}}{\textit{idleSlope}+\textit{sendSlope}}\cdot BW (21)

P083Block 115

Consider the example illustrated in Figure 11, which includes two AVB queues and one Best Effort (BE) queue. Frames 1 and 4 are assigned to the higher-priority AVB queue, while frames 2 and 3 belong to the lower-priority AVB queue and the BE queue, respectively.

P084Block 116

At time T0, both AVB queues are eligible for transmission. Due to strict priority scheduling, the higher-priority AVB queue (priority 2) is selected, and frame 1 is transmitted. During this transmission, its credit decreases, while the credit of the lower-priority AVB queue increases because it is waiting.

P085Block 117

At time T1, the higher-priority AVB queue has accumulated negative credit and is therefore no longer eligible for transmission. As a result, the lower-priority AVB queue is selected, and frame 2 is transmitted, even though a higher-priority frame (frame 4) is waiting. During this time, the lower-priority queue’s credit decreases, while the higher-priority queue’s credit recovers.

P086Block 118

By time T2, both AVB queues have negative credit, making them ineligible for transmission. Consequently, the BE queue is selected, and frame 3 is transmitted, despite the presence of a higher-priority AVB frame waiting.

P087Block 119

Finally, at time T3, the credit of the higher-priority AVB queue has recovered to zero, making it eligible again. Therefore, frame 4 is transmitted.

P088Block 121

Cyclic Queuing and Forwarding (CQF) (Debnath et al., 2025b) is a TSN shaping mechanism which uses a single cycle duration, denoted as T T, across the entire network. T T is the minimum scheduling unit where we put the TSN flows. Furthermore, T T defines the granularity of the end-to-end delay of the flows in the network. The unit of T T is in μ \mu s in TSNBench. In a TSN switch, every egress port in the network has eight queues. TSN flows are stored in the queues depending on its priority. In CQF, for each egress port, two queues are used: an even queue and an odd queue. Figure 14 shows the basic working diagram of CQF with two queues (even and odd). As shown in Figure 14, CQF works by employing two queues, let’s say, Q 8 Q_{8} and Q 7 Q_{7} for TT flows by operating them in a ping-pong manner where Q 7 Q_{7} receives and Q 8 Q_{8} transmits at the first cycle slot (T 1 T_{1}). During the second cycle slot (T 2 T_{2}), Q 8 Q_{8} receives and Q 7 Q_{7} transmits. Selecting or allotting a cycle slot for a flow means selecting the cycle slot number (within the hyperperiod H H) and the queue for the flow.

P089Block 122

In the CQF evaluation of TSNBench, we provide the cycle duration (T \mathrm{T}) and network-specific delays to the model as input through the prompt.

P090Block 123

WCD CQF: The worst case end-to-end delay of the TT flows in the CQF network is quantified as follows: Max ​ Delay = f i. ϕ + (SW num + 1) ⋅ T + ξ, \mathrm{Max\;Delay}=f_{i}.\phi+(\mathrm{SW_{num}}+1)\cdot\mathrm{T}+\xi, (22) where f i ⋅ ϕ f_{i}\cdot\phi is the offset of the flow f i f_{i} in μ \mu s, SW num \mathrm{SW_{num}} is the total number of switches in the route of the TT flow, T \mathrm{T} is the cycle duration in μ \mu s, and ξ \xi denotes the network specific delays: processing delay, propagation delay, and time synchronization error (sync error \mathrm{sync_{error}}).

P091Block 126

To maintain the same standards across all human reviewers, we use the following rules to evaluate the MCQA dataset. There are four possible options for every question in the MCQA dataset.

P092Block 127

1. Accept: i. Technically correct. ii. Clearly worded and self-contained. iii. Unambiguous options. iv. Accurate and sufficient explanation. v. The correct answer is actually the correct answer. 2. Reject: i. Incorrect or misleading. ii. Poorly constructed beyond revision. iii. Irrelevant to TSN. iv. Incomplete information. v. Too paper-dependent. vi. Duplicate questions. 3. Revise: i. Minor issues in grammar, clarity, or wording. ii. Options need improvement. iii. Explanation needs refinement. 4. Doubtful: i. Paper-specific or uncertain about the correctness of the question. ii. Explanation seems questionable. iii. Needs further clarification. For a doubtful multiple-choice question, we read the research paper and re-evaluate the question. Afterward, the decision can be accept, reject, or revise; if it is still doubtful, we send it to another expert reviewer for a consensus-based group decision.

P093Block 128

Accept: i. Technically correct. ii. Clearly worded and self-contained. iii. Unambiguous options. iv. Accurate and sufficient explanation. v. The correct answer is actually the correct answer.

P094Block 129

Technically correct.

P095Block 130

Clearly worded and self-contained.

P096Block 131

Unambiguous options.

P097Block 132

Accurate and sufficient explanation.

P098Block 133

The correct answer is actually the correct answer.

P099Block 134

Reject: i. Incorrect or misleading. ii. Poorly constructed beyond revision. iii. Irrelevant to TSN. iv. Incomplete information. v. Too paper-dependent. vi. Duplicate questions.

P100Block 135

Incorrect or misleading.

P101Block 136

Poorly constructed beyond revision.

P102Block 137

Irrelevant to TSN.

P103Block 138

Incomplete information.

P104Block 139

Too paper-dependent.

P105Block 140

Duplicate questions.

P106Block 141

Revise: i. Minor issues in grammar, clarity, or wording. ii. Options need improvement. iii. Explanation needs refinement.

P107Block 142

Minor issues in grammar, clarity, or wording.

P108Block 143

Options need improvement.

P109Block 144

Explanation needs refinement.

P110Block 145

Doubtful: i. Paper-specific or uncertain about the correctness of the question. ii. Explanation seems questionable. iii. Needs further clarification. For a doubtful multiple-choice question, we read the research paper and re-evaluate the question. Afterward, the decision can be accept, reject, or revise; if it is still doubtful, we send it to another expert reviewer for a consensus-based group decision.

P111Block 146

Paper-specific or uncertain about the correctness of the question.

P112Block 147

Explanation seems questionable.

P113Block 148

Needs further clarification.

P114Block 149

Key principles followed while reviewing the dataset: We ensured that the MCQAs were technically accurate and aligned with TSN fundamentals. We avoided tricky questions and preferred clarity over complexity. The same set of rules was given to all expert reviewers who worked on this dataset and served as human judges. After the review, 185 questions were revised by the domain experts, as shown below in Table 5.

P115Block 151

We present three representative sample questions from our MCQA dataset below.

P116Block 152

Q1 TSN Keyword What does TAS stand for in TSN traffic management? A. Transmission Access Scheduler B. Traffic Analysis System C. Time-Aware Shaper D. Traffic Admission Service Correct Answer: C

P117Block 153

Q2 Research Paper In a Cyclic Queuing and Forwarding (CQF) network what fundamental limitation would prevent effective fault tolerance using Frame Replication and Elimination for Reliability (FRER) in a linear topology where each switch has maximum transmission unit (MTU) sized frames frequently queued? A. CQF’s ping-pong queue switching would create timing conflicts with FRER’s frame elimination mechanism. B. EMI interference would corrupt both original and replicated frames equally, making spatial redundancy ineffective. C. FRER cannot detect bit errors caused by EMI since it lacks Cyclic Redundancy Check (CRC) verification capabilities. D. Linear topologies cannot provide the disjoint paths required for FRER’s spatial redundancy approach, forcing expensive hardware additions. Correct Answer: D

P118Block 154

Q3 Research Paper What fundamental challenge makes the Time Aware Shaper (TAS) implementation complex despite its ability to provide guaranteed end-to-end delays? A. The requirement to synchronize all network devices to a common time reference. B. The need to maintain separate queues for each traffic class simultaneously. C. The difficulty in estimating worst-case transmission times for variable-length frames. D. The synthesis of the gate control list, which is an NP-complete problem. Correct Answer: D

P119Block 156

For the CBS and CQF mechanisms, two different approaches are used for WCD calculation. NC is used to calculate the CBS WCD, whereas an analytical mathematical calculation is used to find the WCD for the CQF mechanism. Since these two mechanisms work differently, we design prompts tailored to each mechanism.

P120Block 157

Role: We start by defining the role of the model: “You are an expert Time-Sensitive Networking (TSN) orchestrator.” We inject three network inputs: (i) network topology, (ii) TSN flow information, and (iii) the routes of the flows. We use the prompt-as-program (Reynolds and McDonell, 2021) approach to separate the network topology, flow information, and flow routes. All of these are provided in text format. However, to evaluate different topologies, flows, and routes, we separate them from the prompt logic. This ensures that the prompt remains the same across different network topologies and parameters.

P121Block 158

Constants: To correctly calculate the WCD, information about the network parameters is required. To prevent the model from assuming these values and to keep the constant values consistent across all models, we provide this information in the prompt.

P122Block 159

Constants for CBS open-ended questions: B ​ a ​ n ​ d ​ w ​ i ​ d ​ t ​ h = 100 Bandwidth=100 Mbps, P ​ r ​ o ​ p ​ a ​ g ​ a ​ t ​ i ​ o ​ n ​ d ​ e ​ l ​ a ​ y = 1 ​ μ Propagation\ delay=1\,\mu s, S ​ w ​ i ​ t ​ c ​ h ​ i ​ n ​ g ​ d ​ e ​ l ​ a ​ y = 1 ​ μ Switching\ delay=1\,\mu s, T ​ i ​ m ​ e ​ s ​ y ​ n ​ c ​ h ​ r ​ o ​ n ​ i ​ z ​ a ​ t ​ i ​ o ​ n ​ e ​ r ​ r ​ o ​ r = 1 ​ μ Time\ synchronization\ error=1\,\mu s, The switches of the network are cut-through switches, I ​ d ​ l ​ e ​ S ​ l ​ o ​ p ​ e IdleSlope = 75 % =75\%

P123Block 160

By controlling these network parameters, we directly mitigate hallucinations and assumptions about numerical values.

P124Block 161

Architecture Restriction: TSN supports multiple architectures that affect the Quality of Service (QoS) and the WCD of the flows. The prompt restricts the model to using only one TSN mechanism through the following directive.

P125Block 162

For the CBS mechanism, we use:

P126Block 163

TSN Mechanism: Only Credit-Based Shaper (CBS, IEEE 802.1Qav) is allowed; All flows are AVB Class A, PCP = 6, using queue 6 only.

P127Block 164

For the CQF mechanism, we use:

P128Block 165

TSN Mechanism: Only Cyclic Queuing and Forwarding (CQF, IEEE 802.1Qch) is allowed; All flows are TT, PCP = 7, using queue 7 (odd) and 6 (even) only.

P129Block 166

Our reasoning is that letting the model select the TSN architecture or mechanism is a separate benchmarking problem, where the model is evaluated on architecture design performance. In TSNBench, our goal is to benchmark LLMs in TSN. Without an explicit restriction, the model may select an incorrect or inappropriate mechanism, producing a hallucinated architecture that does not satisfy the QoS requirements of the flows. This restriction forces the model to use a single solution space. It further ensures that the WCDs provided by different models are not caused by architectural faults or mechanism selection ambiguity, but rather by calculation and implementation errors within the specified mechanism.

P130Block 167

Structured Output: We instruct the model through the prompt to provide the output strictly in JSON format (Yang et al., 2026).

P131Block 169

For the open-ended questions, there are three variable entries: network topology, flow information, and flow routing. We use the K-shortest path algorithm to determine the routes of the flows. The routes are then directly provided to the models as input for further evaluation.

P132Block 170

Network Topologies Used: For the open-ended questions, we selected three different topologies to evaluate the models: a one-switch topology, a medium-mesh topology, and an industrial ring topology. Figures 15, 16, and 17 represent the one-switch, medium-mesh, and ring topologies used in TSNBench, respectively.

P133Block 171

Flow parameters: We show the flow information used in TSNBench as follows.

P134Block 172

Flow Information TC1_flows.txt 0,node2_1,node5_2,2500,709,965 1,node5_4,node3_2,2500,610,825 2,node0_4,node0_1,1000,786,887 3,node2_3,node4_3,2500,1088,1233 4,node0_4,node3_3,1000,1015,488 5,node0_4,node0_1,2500,926,501...

P135Block 173

Ground Truth WCD Values The ground-truth WCD values of the flows for all open-ended test cases for the CBS mechanism are calculated using a verified NC tool (Zhao et al., 2018; Debnath et al., 2025c; Gavriluţ and Pop, 2020). For the WCD of the CQF mechanism, we use the mathematical equation given in Eq. 22.

P136Block 175

We evaluate both open-source and closed-source state-of-the-art LLMs on TSNBench. A detailed list of the models, along with their model numbers and snapshots, is given in Table 6. This ensures that the results are reproducible by the community.

P137Block 177

We evaluate the models under two different configurations: (i) default temperature settings (0.7) and (ii) temperature set to 0.0, for both MCQA and open-ended questions. As in safety-critical networks, we want to ensure deterministic results. Therefore, we evaluate whether LLMs can provide consistent results when the temperature is set to 0.0. For models that do not support the temperature parameter, we use the default temperature for evaluation.

P138Block 178

Table 7 provides the accuracy and average consistency of the models for the MCQA dataset under the default temperature and temperature set to 0.0. Average consistency represents the ability of the model to provide the same results across three runs.

P139Block 180

The cost and latency of a model are important evaluation parameters for the research community. Spending a large amount of money on benchmark evaluation is a real bottleneck for research groups. Moreover, not all models can be evaluated locally. Table 8 presents the cost and latency of the TSNBench MCQA and open-ended questions. Evaluating MCQA is relatively much cheaper than evaluating open-ended questions.

P140Block 182

We provide MAE and MAPE evaluations for the open-ended questions. A sample calculation is given as follows:

P141Block 183

MAE and MAPE calculation example: Consider a model evaluated on three test cases (TCs). These three TCs may have different topologies, different flows and flow parameters, and different routes. For each TC, we have the ground-truth and predicted WCD values shown in Table 9. The ground truth is calculated using an NC solver for CBS and a mathematical equation for CQF.

P142Block 184

Per-TC MAE: Suppose TC1, TC2, and TC3 contain three, two, and three flows, respectively. { f 1, f 2, f 3 } ∈ T ​ C ​ 1; \displaystyle\{f_{1},f_{2},f_{3}\}\in TC1; { f 1, f 2 } ∈ T ​ C ​ 2; \displaystyle\{f_{1},f_{2}\}\in TC2; { f 1, f 2, f 3 } ∈ T ​ C ​ 3; \displaystyle\{f_{1},f_{2},f_{3}\}\in TC3; Let Γ ​ (f 0) \Gamma(f_{0}) denote the absolute error of flow f 0 f_{0} in TC1, β ​ (f 0) \beta(f_{0}) denote the predicted WCD of flow f 0 f_{0} given by the LLM model, and Ω ​ (f 0) \Omega(f_{0}) denote the ground truth of flow f 0 f_{0}. We calculate Γ ​ (f 0) \Gamma(f_{0}) as follows: Γ ​ (f 0) = | β ​ (f 0) − Ω ​ (f 0) | \displaystyle\Gamma(f_{0})=|\beta(f_{0})-\Omega(f_{0})| In the given example, let Γ ​ (f 0) \Gamma(f_{0}) = 12, Γ ​ (f 1) \Gamma(f_{1}) = 30, and Γ ​ (f 2) \Gamma(f_{2}) = 10 for TC1. Similarly, for TC2, Γ ​ (f 0) \Gamma(f_{0}) = 8 and Γ ​ (f 1) \Gamma(f_{1}) = 45 and for TC3, Γ ​ (f 0) \Gamma(f_{0}) = 20, Γ ​ (f 1) \Gamma(f_{1}) = 15, and Γ ​ (f 2) \Gamma(f_{2}) = 0. We calculate the MAE for TC1, TC2, and TC3 represented as MAE TC1 \text{MAE}_{\text{TC1}}, MAE TC2 \text{MAE}_{\text{TC2}}, and MAE TC3 \text{MAE}_{\text{TC3}} as follows: MAE TC1 \displaystyle\text{MAE}_{\text{TC1}} = (12 + 30 + 10) / 3 = 17.3 ​ μ ​ s \displaystyle=(12+30+10)/3=17.3~\mu\text{s} MAE TC2 \displaystyle\text{MAE}_{\text{TC2}} = (8 + 45) / 2 = 26.5 ​ μ ​ s \displaystyle=(8+45)/2=26.5~\mu\text{s} MAE TC3 \displaystyle\text{MAE}_{\text{TC3}} = (20 + 15 + 0) / 3 = 11.7 ​ μ ​ s \displaystyle=(20+15+0)/3=11.7~\mu\text{s}

P143Block 185

For every model, we have 100 test cases, and the final MAE is averaged across all test cases (in this example 3 test cases) and is represented as: MAE = (17.3 + 26.5 + 11.7) / 3 = 18.5 ​ μ ​ s \text{MAE}=(17.3+26.5+11.7)/3=18.5~\mu\text{s} The per-flow MAPE denoted as α ​ (f 0) \alpha(f_{0}) is calculated as follows: α ​ (f 0) = | β ​ (f 0) − Ω ​ (f 0) | Ω ​ (f 0) × 100 \displaystyle\alpha(f_{0})=\frac{|\beta(f_{0})-\Omega(f_{0})|}{\Omega(f_{0})}\times 100 For TC1, we calculate the MAPE as follows: MAPE TC1 \displaystyle\text{MAPE}_{\text{TC1}} = α ​ (f 0) + α ​ (f 1) + α ​ (f 2) 3 = 8.7 % \displaystyle=\frac{\alpha(f_{0})+\alpha(f_{1})+\alpha(f_{2})}{3}=8.7\% Similarly, the MAPE for TC2 and TC3 is given as follows: MAPE TC2 \displaystyle\text{MAPE}_{\text{TC2}} = 11.5 % \displaystyle=11.5\% MAPE TC3 \displaystyle\text{MAPE}_{\text{TC3}} = 3.7 % \displaystyle=3.7\% The final MAPE for each model is averaged across the 3 test cases: MAPE = (8.7 + 11.5 + 3.7) / 3 = 8.0 % \text{MAPE}=(8.7+11.5+3.7)/3=8.0\% In TSNBench, all test cases contributes equally towards the model performance irrespective of the number of flows in the network. As per the network architecture, all flows are equally critical and needs the same preference. This ensures that for each network scenario all the flows are weighted equally.

P144Block 186

Test Case: TC1 TSN mechanism: CBS You are an expert Time-Sensitive Networking (TSN) orchestrator. Your task is to calculate the worst case delay (WCD) for each TSN flow. Input: Network Topology (T ​ C ​ 1 ​ _ ​ t ​ o ​ p ​ o. t ​ x ​ t TC1\_topo.txt) Flow Information (T ​ C ​ 1 ​ _ ​ f ​ l ​ o ​ w ​ s. t ​ x ​ t TC1\_flows.txt) Routing of the Flow (T ​ C ​ 1 ​ _ ​ r ​ o ​ u ​ t ​ e. t ​ x ​ t TC1\_route.txt) Constant: Bandwidth link = 100 Mbps; Propagation delay = 1 μ \mu s; Switching delay = 1 μ \mu s; Time synchronization error = 1 μ \mu s; The switches of the network are cut-through switches. TSN Mechanism: Only Credit-Based Shaper (CBS, IEEE 802.1Qav) is allowed; All flows are AVB Class A, PCP = 6, using queue 6 only. Task: 1. Map each egress port’s queues and collect the set of flows traversing from that port, using the given topology, flows, and route of the flow. 2. For each egress port, use the given IdleSlope and then compute the SendSlope. 3. For each flow, construct an arrival curve from its frame size and periodicity. 4. For each port, derive a lower-bounded CBS service curve. 5. Calculate the worst case delay (WCD) in microseconds (μ \mu s) for each flow using Network Calculus method. 6. Provide the confidence score between 0.0 and 1.0 from your answers. 1.0 means mathematically or procedurally provable from given info with zero ambiguity. 0.0 means zero confidence.

P145Block 187

Map each egress port’s queues and collect the set of flows traversing from that port, using the given topology, flows, and route of the flow.

P146Block 188

For each egress port, use the given IdleSlope and then compute the SendSlope.

P147Block 189

For each flow, construct an arrival curve from its frame size and periodicity.

P148Block 190

For each port, derive a lower-bounded CBS service curve.

P149Block 191

Calculate the worst case delay (WCD) in microseconds (μ \mu s) for each flow using Network Calculus method.

P150Block 192

Provide the confidence score between 0.0 and 1.0 from your answers. 1.0 means mathematically or procedurally provable from given info with zero ambiguity. 0.0 means zero confidence.

P151Block 193

Grok 4.1 Fast (Non-Reasoning) output: F0: 1452.0, F1: 1124.0, F2: 678.0, F3: 1234.0, F4: 1567.0, F5: 689.0, F6: 456.0, F7: 789.0, F8: 567.0, F9: 589.0, F10: 678.0, F11: 645.0, F12: 1123.0, F13: 789.0, F14: 456.0, F15: 678.0, F16: 789.0, F17: 1567.0, F18: 456.0, F19: 1234.0 Model Reasoning and Explanation Output: Parameters used: send ​ _ ​ slope \mathrm{send\_slope}: 75 % × 100 ​ Mbps = 75 ​ Mbps = 75000 ​ bits / μ ​ s 75\%\times 100~\mathrm{Mbps}=75~\mathrm{Mbps}=75000~\mathrm{bits}/\mu\mathrm{s} frame ​ _ ​ overhead \mathrm{frame\_overhead}: 24 bytes (Ethernet header + FCS), mtu \mathrm{mtu}: 1522 bytes (max frame size), queue \mathrm{queue}: 6 (PCP=6, AVB Class A), cbs ​ _ ​ credit ​ _ ​ max \mathrm{cbs\_credit\_max}: MTU × 8 = 12176 \mathrm{MTU}\times 8=12176 bits WCD equation given by the model: α ​ (t) = L max Period + t Period (arrival ​ curve) \alpha(t)=\frac{L_{\max}}{\mathrm{Period}}+\frac{t}{\mathrm{Period}}\quad\mathrm{(arrival\ curve)} β ​ (t) = R send × t − C (CBS ​ service ​ curve) \beta(t)=R_{\mathrm{send}}\times t-C\quad\mathrm{(CBS\ service\ curve)} W ​ C ​ D = h × (prop ​ _ ​ delay + switch ​ _ ​ delay + sync ​ _ ​ error) + max ⁡ (α ∩ β) WCD=h\times\left(\mathrm{prop\_delay}+\mathrm{switch\_delay}+\mathrm{sync\_error}\right)+\max(\alpha\cap\beta) SendSlope = IdleSlope × Bandwidth = 0.75 × 100 ​ Mbps \mathrm{SendSlope}=\mathrm{IdleSlope}\times\mathrm{Bandwidth}=0.75\times 100~\mathrm{Mbps} R send = SendSlope = 75 ​ Mbps R_{\mathrm{send}}=\mathrm{SendSlope}=75~\mathrm{Mbps} C = credit max = MTU × 8 ​ bits C=\mathrm{credit}_{\max}=\mathrm{MTU}\times 8~\mathrm{bits} h = hop ​ count ​ from ​ path h=\mathrm{hop\ count\ from\ path} Per-flow WCD: W ​ C ​ D F i = ∑ h ∈ path ​ (F i) (d prop h + d switch h + d sync h) + max t ≥ 0 ⁡ (α F i ​ (t) ∩ β port ​ (t)) WCD_{F_{i}}=\sum_{h\in\mathrm{path}(F_{i})}\left(d_{\mathrm{prop}}^{h}+d_{\mathrm{switch}}^{h}+d_{\mathrm{sync}}^{h}\right)+\max_{t\geq 0}\left(\alpha_{F_{i}}(t)\cap\beta_{\mathrm{port}}(t)\right)

P152Block 194

Grok 4.1 Fast (Non-Reasoning) output:

P153Block 195

F0: 1452.0, F1: 1124.0, F2: 678.0, F3: 1234.0, F4: 1567.0, F5: 689.0, F6: 456.0, F7: 789.0, F8: 567.0, F9: 589.0, F10: 678.0, F11: 645.0, F12: 1123.0, F13: 789.0, F14: 456.0, F15: 678.0, F16: 789.0, F17: 1567.0, F18: 456.0, F19: 1234.0

P154Block 196

Run 2 of the model (Grok 4.1 Fast (Non-Reasoning)): WCD equation given by the model during run 2: Frame size: L i \displaystyle\text{Frame size:}\quad L_{i} = payload ​ _ ​ i + 24 ​ bytes, \displaystyle=\mathrm{payload}\_i+4~\mathrm{bytes}, Transmission time: T x i \displaystyle\text{Transmission time:}\quad T_{x_{i}} = 8 ​ L i R, \displaystyle=\frac{8L_{i}}{R}, Arrival curve: α i ​ (t) \displaystyle\text{Arrival curve:}\quad\alpha_{i}(t) = L i p i ⋅ t, \displaystyle=\frac{L_{i}}{p_{i}}\cdot t, CBS service curve: β ​ (t) \displaystyle\text{CBS service curve:}\quad\beta(t) = min ⁡ { α h i ​ (t), sendSlope ⋅ t + credit max }, \displaystyle=\min\left\{\alpha_{h_{i}}(t),\;\mathrm{sendSlope}\cdot t+\mathrm{credit}_{\max}\right\}, Worst-case delay: W ​ C ​ D i \displaystyle\text{Worst-case delay:}\quad WCD_{i} = max h ∈ path ​ ∑ (T prop, h + T switch, h + T queue, h) + T sync, \displaystyle=\max_{h\in\mathrm{path}}\sum\left(T_{\mathrm{prop},h}+T_{\mathrm{switch},h}+T_{\mathrm{queue},h}\right)+T_{\mathrm{sync}}, Queueing delay: T queue, h \displaystyle\text{Queueing delay:}\quad T_{\mathrm{queue},h} = L i α ​ C + burst ​ _ ​ interference, \displaystyle=\frac{L_{i}}{\alpha C}+\mathrm{burst}\_{\mathrm{interference}}, Send slope: sendSlope \displaystyle\text{Send slope:}\quad\mathrm{sendSlope} = idleSlope ⋅ R − idleSlope ⋅ R = α ​ R ​ (1 − 2 ​ α); \displaystyle=\mathrm{idleSlope}\cdot R-\mathrm{idleSlope}\cdot R=\alpha R(1-2\alpha); Per-flow WCD: W ​ C ​ D i \displaystyle WCD_{i} = hop ​ _ ​ count i ⋅ (1 + 1) + max ​ _ ​ queue ​ _ ​ delay + 1; \displaystyle=\mathrm{hop\_count}_{i}\cdot(1+1)+\mathrm{max\_queue\_delay}+1; Expert Explanation: The model uses different equations across different runs even with temperature set to 0.0. When analyzing the first run, the model makes several fundamental errors in evaluating the interference/blocking experienced by the queue under analysis, i.e., both the arrival curve and CBS service curve. First, it assumes the wrong maximum burst size and does not consider the variant output arrival curve at subsequent hops along the path. Second, it assumes a non-zero maximum credit, which is possible in this scenario due to the non-preemption frame of lower priority of non-CBS traffic. However, the model applies the wrong maximum credit to the CBS service curve, incorrectly understanding the relation of the corresponding rate (75%). Furthermore, in the final WCD calculation, the model considers only the arrival curve of the frame under analysis. This is incorrect: the proper approach requires using the aggregate arrival curve of all frames transmitted through the queue. It is also not clear whether the m ​ a ​ x ​ (α ∩ β) max(\alpha\cap\beta) operation represents the maximum horizontal deviation between α \alpha and β \beta. The output arrival curve is not considered in the following hops’ calculation. On the other hand, the second run exhibits even more severe hallucinations, producing calculations of the wrong arrival curve without considering any burst size, and the wrong CBS service curve considering the wrong service rate (sendSlope) and wrong relation between the corresponding rate and the maximum credit, as well as without considering the output arrival curve either etc. GPT-5 output: F0: 714.65, F1: 821.79, F2: 171.48, F3: 687.17, F4: 907.26, F5: 171.48, F6: 183.9, F7: 453.91, F8: 611.08, F9: 181.3, F10: 611.08, F11: 321.35, F12: 1124.12, F13: 483.67, F14: 172.43, F15: 489.26, F16: 521.33, F17: 918.73, F18: 181.3, F19: 624.12

P155Block 197

Run 2 of the model (Grok 4.1 Fast (Non-Reasoning)):

P156Block 198

WCD equation given by the model during run 2: Frame size: L i \displaystyle\text{Frame size:}\quad L_{i} = payload ​ _ ​ i + 24 ​ bytes, \displaystyle=\mathrm{payload}\_i+4~\mathrm{bytes}, Transmission time: T x i \displaystyle\text{Transmission time:}\quad T_{x_{i}} = 8 ​ L i R, \displaystyle=\frac{8L_{i}}{R}, Arrival curve: α i ​ (t) \displaystyle\text{Arrival curve:}\quad\alpha_{i}(t) = L i p i ⋅ t, \displaystyle=\frac{L_{i}}{p_{i}}\cdot t, CBS service curve: β ​ (t) \displaystyle\text{CBS service curve:}\quad\beta(t) = min ⁡ { α h i ​ (t), sendSlope ⋅ t + credit max }, \displaystyle=\min\left\{\alpha_{h_{i}}(t),\;\mathrm{sendSlope}\cdot t+\mathrm{credit}_{\max}\right\}, Worst-case delay: W ​ C ​ D i \displaystyle\text{Worst-case delay:}\quad WCD_{i} = max h ∈ path ​ ∑ (T prop, h + T switch, h + T queue, h) + T sync, \displaystyle=\max_{h\in\mathrm{path}}\sum\left(T_{\mathrm{prop},h}+T_{\mathrm{switch},h}+T_{\mathrm{queue},h}\right)+T_{\mathrm{sync}}, Queueing delay: T queue, h \displaystyle\text{Queueing delay:}\quad T_{\mathrm{queue},h} = L i α ​ C + burst ​ _ ​ interference, \displaystyle=\frac{L_{i}}{\alpha C}+\mathrm{burst}\_{\mathrm{interference}}, Send slope: sendSlope \displaystyle\text{Send slope:}\quad\mathrm{sendSlope} = idleSlope ⋅ R − idleSlope ⋅ R = α ​ R ​ (1 − 2 ​ α); \displaystyle=\mathrm{idleSlope}\cdot R-\mathrm{idleSlope}\cdot R=\alpha R(1-2\alpha); Per-flow WCD: W ​ C ​ D i \displaystyle WCD_{i} = hop ​ _ ​ count i ⋅ (1 + 1) + max ​ _ ​ queue ​ _ ​ delay + 1; \displaystyle=\mathrm{hop\_count}_{i}\cdot(1+1)+\mathrm{max\_queue\_delay}+1;

P157Block 199

Expert Explanation:

P158Block 200

The model uses different equations across different runs even with temperature set to 0.0. When analyzing the first run, the model makes several fundamental errors in evaluating the interference/blocking experienced by the queue under analysis, i.e., both the arrival curve and CBS service curve. First, it assumes the wrong maximum burst size and does not consider the variant output arrival curve at subsequent hops along the path. Second, it assumes a non-zero maximum credit, which is possible in this scenario due to the non-preemption frame of lower priority of non-CBS traffic. However, the model applies the wrong maximum credit to the CBS service curve, incorrectly understanding the relation of the corresponding rate (75%). Furthermore, in the final WCD calculation, the model considers only the arrival curve of the frame under analysis. This is incorrect: the proper approach requires using the aggregate arrival curve of all frames transmitted through the queue. It is also not clear whether the m ​ a ​ x ​ (α ∩ β) max(\alpha\cap\beta) operation represents the maximum horizontal deviation between α \alpha and β \beta. The output arrival curve is not considered in the following hops’ calculation. On the other hand, the second run exhibits even more severe hallucinations, producing calculations of the wrong arrival curve without considering any burst size, and the wrong CBS service curve considering the wrong service rate (sendSlope) and wrong relation between the corresponding rate and the maximum credit, as well as without considering the output arrival curve either etc.

P159Block 201

GPT-5 output:

P160Block 202

F0: 714.65, F1: 821.79, F2: 171.48, F3: 687.17, F4: 907.26, F5: 171.48, F6: 183.9, F7: 453.91, F8: 611.08, F9: 181.3, F10: 611.08, F11: 321.35, F12: 1124.12, F13: 483.67, F14: 172.43, F15: 489.26, F16: 521.33, F17: 918.73, F18: 181.3, F19: 624.12

P161Block 203

Model Reasoning and Explanation Output: sendSlope \displaystyle\mathrm{sendSlope} = − 25 ​ Mbps, \displaystyle=-5~\mathrm{Mbps}, Arrival curve: b \displaystyle\text{Arrival curve:}\quad b = ∑ ∀ F i | share ​ output ​ port L i ​ bits, \displaystyle=\sum_{\mathrm{\forall F_{i}|{\ share\ output\ port}}}L_{i}~\mathrm{bits}, r \displaystyle r = ∑ ∀ F i | share ​ output ​ port L i p i ​ bits / μ ​ s, \displaystyle=\sum_{\mathrm{\forall F_{i}|{\ share\ output\ port}}}\frac{L_{i}}{p_{i}}~\mathrm{bits}/\mu\mathrm{s}, α ​ (t) \displaystyle\alpha(t) = b + r ​ t, \displaystyle=b+rt, T tech, total \displaystyle T_{\mathrm{tech,total}} = T prop + T switching + T sync \displaystyle=T_{\mathrm{prop}}+T_{\mathrm{switching}}+T_{\mathrm{sync}} = h ​ o ​ p ​ s × 1 ​ μ ​ s + (h ​ o ​ p ​ s − 1) × 1 ​ μ ​ s + (h ​ o ​ p ​ s − 1) × 1 ​ μ ​ s = 13 ​ μ ​ s, \displaystyle=hops\times 1~\mu\mathrm{s}+(hops-1)\times 1~\mu\mathrm{s}+(hops-1)\times 1~\mu\mathrm{s}=3~\mu\mathrm{s}, W ​ C ​ D total \displaystyle WCD_{\mathrm{total}} = W ​ C ​ D sum + T tech, total \displaystyle=WCD_{\mathrm{sum}}+T_{\mathrm{tech,total}} = 701.65 ​ μ ​ s + 13 ​ μ ​ s = 714.65 ​ μ ​ s. \displaystyle=0165~\mu\mathrm{s}+3~\mu\mathrm{s}=1465~\mu\mathrm{s}. Expert Explanation: GPT uses the correct arrival and service curves. Its main error is that it fails to include the output port (output arrival curves) in the analysis. Ministral 3 8B output: F0: 1250.0, F1: 1100.0, F2: 500.0, F3: 1500.0, F4: 600.0, F5: 1100.0, F6: 2500.0, F7: 1200.0, F8: 500.0, F9: 500.0, F10: 2000.0, F11: 500.0, F12: 1000.0, F13: 1000.0, F14: 2000.0, F15: 500.0, F16: 1500.0, F17: 1000.0, F18: 500.0, F19: 2000.0 Model Reasoning and Explanation Output: Bandwidth: R \displaystyle\text{Bandwidth:}\quad R = 100 ​ Mbps = 12.5 ​ MB / s, \displaystyle=00~\mathrm{Mbps}=25~\mathrm{MB/s}, Propagation delay: T prop \displaystyle\text{Propagation delay:}\quad T_{\mathrm{prop}} = 1 ​ μ ​ s per hop, \displaystyle=1~\mu\mathrm{s}\quad\text{per hop}, Switching delay: T switch \displaystyle\text{Switching delay:}\quad T_{\mathrm{switch}} = 1 ​ μ ​ s per switch, \displaystyle=1~\mu\mathrm{s}\quad\text{per switch}, Time synchronization error: T sync \displaystyle\text{Time synchronization error:}\quad T_{\mathrm{sync}} = 1 ​ μ ​ s, \displaystyle=1~\mu\mathrm{s}, IdleSlope: IdleSlope \displaystyle\text{IdleSlope:}\quad\mathrm{IdleSlope} = 0.75 ​ R = 75 ​ Mbps = 9.375 ​ MB / s, \displaystyle=75R=5~\mathrm{Mbps}=375~\mathrm{MB/s}, Cut-through switch: T buffer \displaystyle\text{Cut-through switch:}\quad T_{\mathrm{buffer}} = 0, \displaystyle=0, Payload overhead: L overhead \displaystyle\text{Payload overhead:}\quad L_{\mathrm{overhead}} = 18 ​ B + 4 ​ B = 22 ​ B, \displaystyle=8~\mathrm{B}+4~\mathrm{B}=2~\mathrm{B}, Maximum frame size: L max \displaystyle\text{Maximum frame size:}\quad L_{\max} = 1433 ​ B + 22 ​ B = 1455 ​ B, \displaystyle=433~\mathrm{B}+2~\mathrm{B}=455~\mathrm{B}, Maximum transmission time: T tx, max \displaystyle\text{Maximum transmission time:}\quad T_{\mathrm{tx,max}} = 1455 × 8 100 ​ Mbps = 9.24 ​ μ ​ s. \displaystyle=\frac{1455\times 8}{100~\mathrm{Mbps}}=24~\mu\mathrm{s}.

P162Block 204

Model Reasoning and Explanation Output: sendSlope \displaystyle\mathrm{sendSlope} = − 25 ​ Mbps, \displaystyle=-5~\mathrm{Mbps}, Arrival curve: b \displaystyle\text{Arrival curve:}\quad b = ∑ ∀ F i | share ​ output ​ port L i ​ bits, \displaystyle=\sum_{\mathrm{\forall F_{i}|{\ share\ output\ port}}}L_{i}~\mathrm{bits}, r \displaystyle r = ∑ ∀ F i | share ​ output ​ port L i p i ​ bits / μ ​ s, \displaystyle=\sum_{\mathrm{\forall F_{i}|{\ share\ output\ port}}}\frac{L_{i}}{p_{i}}~\mathrm{bits}/\mu\mathrm{s}, α ​ (t) \displaystyle\alpha(t) = b + r ​ t, \displaystyle=b+rt, T tech, total \displaystyle T_{\mathrm{tech,total}} = T prop + T switching + T sync \displaystyle=T_{\mathrm{prop}}+T_{\mathrm{switching}}+T_{\mathrm{sync}} = h ​ o ​ p ​ s × 1 ​ μ ​ s + (h ​ o ​ p ​ s − 1) × 1 ​ μ ​ s + (h ​ o ​ p ​ s − 1) × 1 ​ μ ​ s = 13 ​ μ ​ s, \displaystyle=hops\times 1~\mu\mathrm{s}+(hops-1)\times 1~\mu\mathrm{s}+(hops-1)\times 1~\mu\mathrm{s}=3~\mu\mathrm{s}, W ​ C ​ D total \displaystyle WCD_{\mathrm{total}} = W ​ C ​ D sum + T tech, total \displaystyle=WCD_{\mathrm{sum}}+T_{\mathrm{tech,total}} = 701.65 ​ μ ​ s + 13 ​ μ ​ s = 714.65 ​ μ ​ s. \displaystyle=0165~\mu\mathrm{s}+3~\mu\mathrm{s}=1465~\mu\mathrm{s}.

P163Block 205

Expert Explanation:

P164Block 206

GPT uses the correct arrival and service curves. Its main error is that it fails to include the output port (output arrival curves) in the analysis.

P165Block 207

Ministral 3 8B output:

P166Block 208

F0: 1250.0, F1: 1100.0, F2: 500.0, F3: 1500.0, F4: 600.0, F5: 1100.0, F6: 2500.0, F7: 1200.0, F8: 500.0, F9: 500.0, F10: 2000.0, F11: 500.0, F12: 1000.0, F13: 1000.0, F14: 2000.0, F15: 500.0, F16: 1500.0, F17: 1000.0, F18: 500.0, F19: 2000.0

P167Block 209

SendSlope \displaystyle\mathrm{SendSlope} = IdleSlope ​ (1 − IdleSlope) \displaystyle=\mathrm{IdleSlope}\,(1-\mathrm{IdleSlope}) = 9.375 ​ MB / s × 0.25 \displaystyle=375~\mathrm{MB/s}\times 25 = 2.34375 ​ MB / s, \displaystyle=34375~\mathrm{MB/s}, Burst size: B burst, i \displaystyle\text{Burst size:}\quad B_{\mathrm{burst},i} = SendSlope ⋅ p i, \displaystyle=\mathrm{SendSlope}\cdot p_{i}, CBS service curve: S ​ (t) \displaystyle\text{CBS service curve:}\quad S(t) = SendSlope ​ (t − B burst SendSlope), t ≥ B burst SendSlope. \displaystyle=\mathrm{SendSlope}\left(t-\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}\right),\quad t\geq\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}. WCD Calculation: Arrival curve for flow F x: R (t) \displaystyle\text{Arrival curve for flow }F_{x}:\quad R(t) = min ⁡ { B x, S x ​ t }, \displaystyle=\min\left\{B_{x},\;S_{x}t\right\}, B x \displaystyle B_{x} = L payload, x + L overhead, \displaystyle=L_{\mathrm{payload},x}+L_{\mathrm{overhead}}, S x \displaystyle S_{x} = 8 ​ L payload, x p x, \displaystyle=\frac{8L_{\mathrm{payload},x}}{p_{x}}, Service curve for each port: S ​ (t) \displaystyle\text{Service curve for each port:}\quad S(t) = SendSlope ​ (t − B burst SendSlope), t ≥ B burst SendSlope, \displaystyle=\mathrm{SendSlope}\left(t-\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}\right),\quad t\geq\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}, W ​ C ​ D \displaystyle WCD = max ⁡ (0, B burst − SendSlope ⋅ P ​ e ​ r ​ i ​ o ​ d SendSlope) \displaystyle=\max\left(0,\;\frac{B_{\mathrm{burst}}-\mathrm{SendSlope}\cdot Period}{\mathrm{SendSlope}}\right) + T prop + T switch + T sync. \displaystyle\quad+T_{\mathrm{prop}}+T_{\mathrm{switch}}+T_{\mathrm{sync}}. Expert Explanation: Even though it is a small model, this agent delivers a poor result. Similar to what was observed with Grok 4.1 Fast (Non-Reasoning), it fails to account for the arrival curves of flows and the wrong CBS service curve. In addition, it analyzes only a single queue, instead of computing the delay at the output port of each device along the end-to-end path. Moreover, it introduces unit inconsistencies such as dividing the frame size expressed in bytes by a link speed given in bits/ μ \mu s, which leads to incorrect results.

P168Block 210

SendSlope \displaystyle\mathrm{SendSlope} = IdleSlope ​ (1 − IdleSlope) \displaystyle=\mathrm{IdleSlope}\,(1-\mathrm{IdleSlope}) = 9.375 ​ MB / s × 0.25 \displaystyle=375~\mathrm{MB/s}\times 25 = 2.34375 ​ MB / s, \displaystyle=34375~\mathrm{MB/s}, Burst size: B burst, i \displaystyle\text{Burst size:}\quad B_{\mathrm{burst},i} = SendSlope ⋅ p i, \displaystyle=\mathrm{SendSlope}\cdot p_{i}, CBS service curve: S ​ (t) \displaystyle\text{CBS service curve:}\quad S(t) = SendSlope ​ (t − B burst SendSlope), t ≥ B burst SendSlope. \displaystyle=\mathrm{SendSlope}\left(t-\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}\right),\quad t\geq\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}. WCD Calculation: Arrival curve for flow F x: R (t) \displaystyle\text{Arrival curve for flow }F_{x}:\quad R(t) = min ⁡ { B x, S x ​ t }, \displaystyle=\min\left\{B_{x},\;S_{x}t\right\}, B x \displaystyle B_{x} = L payload, x + L overhead, \displaystyle=L_{\mathrm{payload},x}+L_{\mathrm{overhead}}, S x \displaystyle S_{x} = 8 ​ L payload, x p x, \displaystyle=\frac{8L_{\mathrm{payload},x}}{p_{x}}, Service curve for each port: S ​ (t) \displaystyle\text{Service curve for each port:}\quad S(t) = SendSlope ​ (t − B burst SendSlope), t ≥ B burst SendSlope, \displaystyle=\mathrm{SendSlope}\left(t-\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}\right),\quad t\geq\frac{B_{\mathrm{burst}}}{\mathrm{SendSlope}}, W ​ C ​ D \displaystyle WCD = max ⁡ (0, B burst − SendSlope ⋅ P ​ e ​ r ​ i ​ o ​ d SendSlope) \displaystyle=\max\left(0,\;\frac{B_{\mathrm{burst}}-\mathrm{SendSlope}\cdot Period}{\mathrm{SendSlope}}\right) + T prop + T switch + T sync. \displaystyle\quad+T_{\mathrm{prop}}+T_{\mathrm{switch}}+T_{\mathrm{sync}}.

P169Block 211

Expert Explanation:

P170Block 212

Even though it is a small model, this agent delivers a poor result. Similar to what was observed with Grok 4.1 Fast (Non-Reasoning), it fails to account for the arrival curves of flows and the wrong CBS service curve. In addition, it analyzes only a single queue, instead of computing the delay at the output port of each device along the end-to-end path. Moreover, it introduces unit inconsistencies such as dividing the frame size expressed in bytes by a link speed given in bits/ μ \mu s, which leads to incorrect results.

P171Block 213

Test Case: TC1 TSN mechanism: CQF You are an expert Time-Sensitive Networking (TSN) orchestrator. Your task is to calculate the worst case delay (WCD) for each TSN flow. Input: Network Topology (T ​ C ​ 1 ​ _ ​ t ​ o ​ p ​ o. t ​ x ​ t TC1\_topo.txt) Flow Information (T ​ C ​ 1 ​ _ ​ f ​ l ​ o ​ w ​ s. t ​ x ​ t TC1\_flows.txt) Routing of the Flow (T ​ C ​ 1 ​ _ ​ r ​ o ​ u ​ t ​ e. t ​ x ​ t TC1\_route.txt) Constant: Bandwidth link = 100 Mbps; Propagation delay = 1 μ \mu s; Switching delay = 1 μ \mu s; Time synchronization error = 1 μ \mu s; The switches of the network are cut-through switches. TSN Mechanism: Only Cyclic Queuing and Forwarding (CQF, IEEE 802.1Qch) is allowed; All flows are TT, PCP = 7, using queue 7 (odd) and 6 (even) only. Task: 1. Map each egress port’s queues and collect the set of flows traversing that port, using the given topology, flows, and route of the flow. 2. For the entire network, use the given cycle duration and compute the Hypercycle. 3. For each flow, set the offset or the start time of the flow from the sending node as 0. 4. Calculate the worst case delay (WCD) in microseconds (μ \mu s) for each flow. 5. Provide the confidence score between 0.0 and 1.0 from your answers. 1.0 means mathematically or procedurally provable from given info with zero ambiguity. 0.0 means zero confidence.

P172Block 214

Map each egress port’s queues and collect the set of flows traversing that port, using the given topology, flows, and route of the flow.

P173Block 215

For the entire network, use the given cycle duration and compute the Hypercycle.

P174Block 216

For each flow, set the offset or the start time of the flow from the sending node as 0.

P175Block 217

Calculate the worst case delay (WCD) in microseconds (μ \mu s) for each flow.

P176Block 218

Provide the confidence score between 0.0 and 1.0 from your answers. 1.0 means mathematically or procedurally provable from given info with zero ambiguity. 0.0 means zero confidence.

P177Block 219

Claude Sonnet’s output: F0: 257.72, F1: 206.8, F2: 105.096, F3: 218.704, F4: 253.904, F5: 104.008, F6: 103.808, F7: 153.52, F8: 119.408, F9: 119.752, F10: 122.896, F11: 118.464, F12: 222.04, F13: 156.552, F14: 103.808, F15: 156.576, F16: 155.68, F17: 256.616, F18: 101.696, F19: 208.328 Model Reasoning and Explanation Output: Cycle duration: = 50 ​ μ ​ s, \displaystyle\text{Cycle duration:}=0~\mu\mathrm{s}, LCM of all periods and cycle: LCM ​ (1000, 2500, 5000, 50) \displaystyle\text{LCM of all periods and cycle:}\quad\mathrm{LCM}(000,500,000,0) = 5000 ​ μ ​ s, \displaystyle=000~\mu\mathrm{s}, Hypercycle: = 5000 ​ μ ​ s. \displaystyle\text{Hypercycle:}=000~\mu\mathrm{s}. WCD equation: W ​ C ​ D \displaystyle\text{WCD equation:}\quad WCD = ∑ hops (T trans + T prop + T sw + T sync) + T buffering. \displaystyle=\sum_{\mathrm{hops}}\left(T_{\mathrm{trans}}+T_{\mathrm{prop}}+T_{\mathrm{sw}}+T_{\mathrm{sync}}\right)+T_{\mathrm{buffering}}. T trans \displaystyle T_{\mathrm{trans}} = (Payload + Ethernet ​ _ ​ overhead) ⋅ 8 R, \displaystyle=\frac{\left(\mathrm{Payload}+\mathrm{Ethernet\_overhead}\right)\cdot 8}{R}, T prop \displaystyle T_{\mathrm{prop}} = 1 ​ μ ​ s per link, \displaystyle=1~\mu\mathrm{s}\quad\text{per link}, T sw \displaystyle T_{\mathrm{sw}} = 1 ​ μ ​ s per switch, \displaystyle=1~\mu\mathrm{s}\quad\text{per switch}, T sync \displaystyle T_{\mathrm{sync}} = 1 ​ μ ​ s per hop, \displaystyle=1~\mu\mathrm{s}\quad\text{per hop}, T buffering \displaystyle T_{\mathrm{buffering}} = (hop ​ _ ​ count − 1) ⋅ T cycle \displaystyle=\left(\mathrm{hop\_count}-1\right)\cdot T_{\mathrm{cycle}} In CQF, frames may wait up to one cycle duration at each intermediate switch if they arrive just after the transmission window closes. For cut-through with CQF, worst case adds (N − 1) ∗ cycle ​ _ ​ duration (N-1)*\mathrm{cycle\_duration} where N N is hop count. Hop count = 5, \displaystyle=5, Payload_bytes = 965 ​ bytes, \displaystyle=65~\mathrm{bytes}, Frame_size_bytes = 1003 ​ bytes, \displaystyle=003~\mathrm{bytes}, Transmission_delay = 80.24 ​ μ ​ s, \displaystyle=024~\mu\mathrm{s}, Total propagation delay = 5 ​ μ ​ s, \displaystyle=5~\mu\mathrm{s}, Total switching delay = 4 ​ μ ​ s, \displaystyle=4~\mu\mathrm{s}, Total synchronization error = 5 ​ μ ​ s, \displaystyle=5~\mu\mathrm{s}, CQF buffering = 200 ​ μ ​ s, \displaystyle=00~\mu\mathrm{s}, WCD = 257.72 ​ μ ​ s. \displaystyle=5772~\mu\mathrm{s}. Ground truth WCD:: Route hop count of ​ f 1 \displaystyle\text{Route hop count of }f_{1}: 4, T = 50 ​ μ ​ s, \displaystyle:4,T=0~\mu\mathrm{s}, W ​ C ​ D f 1 \displaystyle WCD_{f_{1}} = f 1 ⋅ ϕ + (S ​ W num + 1) ​ T + ξ \displaystyle=f_{1}\cdot\phi+(SW_{\mathrm{num}}+1)T+\xi = (3 + 1) × 50 + 5 \displaystyle=(3+1)\times 0+5 = 205 ​ μ ​ s. \displaystyle=05~\mu\mathrm{s}. Expert Explanation: The model knows the working mechanism of CQF. However, the model calculated the wrong Hypercycle by including cycle duration along with the periods. Furthermore, the CQF WCD explanation given by the model is wrong.

P178Block 220

Claude Sonnet’s output:

P179Block 221

F0: 257.72, F1: 206.8, F2: 105.096, F3: 218.704, F4: 253.904, F5: 104.008, F6: 103.808, F7: 153.52, F8: 119.408, F9: 119.752, F10: 122.896, F11: 118.464, F12: 222.04, F13: 156.552, F14: 103.808, F15: 156.576, F16: 155.68, F17: 256.616, F18: 101.696, F19: 208.328

P180Block 222

Model Reasoning and Explanation Output: Cycle duration: = 50 ​ μ ​ s, \displaystyle\text{Cycle duration:}=0~\mu\mathrm{s}, LCM of all periods and cycle: LCM ​ (1000, 2500, 5000, 50) \displaystyle\text{LCM of all periods and cycle:}\quad\mathrm{LCM}(000,500,000,0) = 5000 ​ μ ​ s, \displaystyle=000~\mu\mathrm{s}, Hypercycle: = 5000 ​ μ ​ s. \displaystyle\text{Hypercycle:}=000~\mu\mathrm{s}. WCD equation: W ​ C ​ D \displaystyle\text{WCD equation:}\quad WCD = ∑ hops (T trans + T prop + T sw + T sync) + T buffering. \displaystyle=\sum_{\mathrm{hops}}\left(T_{\mathrm{trans}}+T_{\mathrm{prop}}+T_{\mathrm{sw}}+T_{\mathrm{sync}}\right)+T_{\mathrm{buffering}}. T trans \displaystyle T_{\mathrm{trans}} = (Payload + Ethernet ​ _ ​ overhead) ⋅ 8 R, \displaystyle=\frac{\left(\mathrm{Payload}+\mathrm{Ethernet\_overhead}\right)\cdot 8}{R}, T prop \displaystyle T_{\mathrm{prop}} = 1 ​ μ ​ s per link, \displaystyle=1~\mu\mathrm{s}\quad\text{per link}, T sw \displaystyle T_{\mathrm{sw}} = 1 ​ μ ​ s per switch, \displaystyle=1~\mu\mathrm{s}\quad\text{per switch}, T sync \displaystyle T_{\mathrm{sync}} = 1 ​ μ ​ s per hop, \displaystyle=1~\mu\mathrm{s}\quad\text{per hop}, T buffering \displaystyle T_{\mathrm{buffering}} = (hop ​ _ ​ count − 1) ⋅ T cycle \displaystyle=\left(\mathrm{hop\_count}-1\right)\cdot T_{\mathrm{cycle}} In CQF, frames may wait up to one cycle duration at each intermediate switch if they arrive just after the transmission window closes. For cut-through with CQF, worst case adds (N − 1) ∗ cycle ​ _ ​ duration (N-1)*\mathrm{cycle\_duration} where N N is hop count. Hop count = 5, \displaystyle=5, Payload_bytes = 965 ​ bytes, \displaystyle=65~\mathrm{bytes}, Frame_size_bytes = 1003 ​ bytes, \displaystyle=003~\mathrm{bytes}, Transmission_delay = 80.24 ​ μ ​ s, \displaystyle=024~\mu\mathrm{s}, Total propagation delay = 5 ​ μ ​ s, \displaystyle=5~\mu\mathrm{s}, Total switching delay = 4 ​ μ ​ s, \displaystyle=4~\mu\mathrm{s}, Total synchronization error = 5 ​ μ ​ s, \displaystyle=5~\mu\mathrm{s}, CQF buffering = 200 ​ μ ​ s, \displaystyle=00~\mu\mathrm{s}, WCD = 257.72 ​ μ ​ s. \displaystyle=5772~\mu\mathrm{s}.

P181Block 223

Ground truth WCD::

P182Block 224

Route hop count of ​ f 1 \displaystyle\text{Route hop count of }f_{1}: 4, T = 50 ​ μ ​ s, \displaystyle:4,T=0~\mu\mathrm{s}, W ​ C ​ D f 1 \displaystyle WCD_{f_{1}} = f 1 ⋅ ϕ + (S ​ W num + 1) ​ T + ξ \displaystyle=f_{1}\cdot\phi+(SW_{\mathrm{num}}+1)T+\xi = (3 + 1) × 50 + 5 \displaystyle=(3+1)\times 0+5 = 205 ​ μ ​ s. \displaystyle=05~\mu\mathrm{s}.

P183Block 225

Expert Explanation:

P184Block 226

The model knows the working mechanism of CQF. However, the model calculated the wrong Hypercycle by including cycle duration along with the periods. Furthermore, the CQF WCD explanation given by the model is wrong.

P185Block 227

GPT-4o output: F0: 260.0, F1: 210.0, F2: 60.0, F3: 210.0, F4: 260.0, F5: 60.0, F6: 60.0, F7: 110.0, F8: 60.0, F9: 60.0, F10: 60.0, F11: 60.0, F12: 210.0, F13: 110.0, F14: 60.0, F15: 110.0, F16: 110.0, F17: 260.0, F18: 60.0, F19: 210.0 Model Reasoning and Explanation Output: Hypercycle = 100 ​ μ ​ s. \displaystyle\text{Hypercycle}=00~\mu\mathrm{s}. WCD = (hop ​ _ ​ count × (cycle ​ _ ​ duration + T switch + T prop + T sync)) + Payload Bandwidth \mathrm{WCD}=\left(\mathrm{hop\_count}\times\left(\mathrm{cycle\_duration}+T_{\mathrm{switch}}+T_{\mathrm{prop}}+T_{\mathrm{sync}}\right)\right)+\frac{\mathrm{Payload}}{\mathrm{Bandwidth}} Each flow’s path, cycle duration, hypercycle, and CQF parameters are detailed, along with the equation used for WCD calculation. No missing inputs are identified, but assumptions are made based on typical network configurations. Expert Explanation: The Hypercycle calculated by the model is wrong, moreover the WCD equation used by the model has an extra wrong component Payload Bandwidth \frac{\mathrm{Payload}}{\mathrm{Bandwidth}}. One of the most interesting finding is all the models provide very different WCD equations across different runs and different test cases. Ministral 3 8B output: F0: 1160.0, F1: 1020.0, F2: 100.0, F3: 1160.0, F4: 1260.0, F5: 100.0, F6: 100.0, F7: 100.0, F8: 100.0, F9: 100.0, F10: 100.0, F11: 100.0, F12: 1160.0, F13: 100.0, F14: 100.0, F15: 100.0, F16: 100.0, F17: 1260.0, F18: 100.0, F19: 1160.0 Model Reasoning and Explanation Output: Hypercycle: H \displaystyle\text{Hypercycle:}\quad H = 10000 ​ μ ​ s, \displaystyle=0000~\mu\mathrm{s}, Max frame size = 1526 ​ bytes = 1500 + 26 ​ bytes, \displaystyle=526~\mathrm{bytes}=500+6~\mathrm{bytes}, Max transmission time = 12.208 ​ μ ​ s, \displaystyle=2208~\mu\mathrm{s}, Max queuing delay = 48 ​ μ ​ s, \displaystyle=8~\mu\mathrm{s}, W ​ C ​ D \displaystyle WCD = h ​ (T prop + T switch + T sync) \displaystyle=h\left(T_{\mathrm{prop}}+T_{\mathrm{switch}}+T_{\mathrm{sync}}\right) + T tx, max ​ ⌈ payload max ​ _ ​ frame ​ _ ​ size ⌉ \displaystyle\quad+T_{\mathrm{tx,max}}\left\lceil\frac{\mathrm{payload}}{\mathrm{max\_frame\_size}}\right\rceil + T queue, max ​ (h − 1), \displaystyle\quad+T_{\mathrm{queue,max}}(h-1),

P186Block 228

GPT-4o output:

P187Block 229

F0: 260.0, F1: 210.0, F2: 60.0, F3: 210.0, F4: 260.0, F5: 60.0, F6: 60.0, F7: 110.0, F8: 60.0, F9: 60.0, F10: 60.0, F11: 60.0, F12: 210.0, F13: 110.0, F14: 60.0, F15: 110.0, F16: 110.0, F17: 260.0, F18: 60.0, F19: 210.0

P188Block 230

Model Reasoning and Explanation Output: Hypercycle = 100 ​ μ ​ s. \displaystyle\text{Hypercycle}=00~\mu\mathrm{s}. WCD = (hop ​ _ ​ count × (cycle ​ _ ​ duration + T switch + T prop + T sync)) + Payload Bandwidth \mathrm{WCD}=\left(\mathrm{hop\_count}\times\left(\mathrm{cycle\_duration}+T_{\mathrm{switch}}+T_{\mathrm{prop}}+T_{\mathrm{sync}}\right)\right)+\frac{\mathrm{Payload}}{\mathrm{Bandwidth}} Each flow’s path, cycle duration, hypercycle, and CQF parameters are detailed, along with the equation used for WCD calculation. No missing inputs are identified, but assumptions are made based on typical network configurations.

P189Block 231

Expert Explanation:

P190Block 232

The Hypercycle calculated by the model is wrong, moreover the WCD equation used by the model has an extra wrong component Payload Bandwidth \frac{\mathrm{Payload}}{\mathrm{Bandwidth}}. One of the most interesting finding is all the models provide very different WCD equations across different runs and different test cases.

P191Block 233

Ministral 3 8B output:

P192Block 234

F0: 1160.0, F1: 1020.0, F2: 100.0, F3: 1160.0, F4: 1260.0, F5: 100.0, F6: 100.0, F7: 100.0, F8: 100.0, F9: 100.0, F10: 100.0, F11: 100.0, F12: 1160.0, F13: 100.0, F14: 100.0, F15: 100.0, F16: 100.0, F17: 1260.0, F18: 100.0, F19: 1160.0 Model Reasoning and Explanation Output: Hypercycle: H \displaystyle\text{Hypercycle:}\quad H = 10000 ​ μ ​ s, \displaystyle=0000~\mu\mathrm{s}, Max frame size = 1526 ​ bytes = 1500 + 26 ​ bytes, \displaystyle=526~\mathrm{bytes}=500+6~\mathrm{bytes}, Max transmission time = 12.208 ​ μ ​ s, \displaystyle=2208~\mu\mathrm{s}, Max queuing delay = 48 ​ μ ​ s, \displaystyle=8~\mu\mathrm{s}, W ​ C ​ D \displaystyle WCD = h ​ (T prop + T switch + T sync) \displaystyle=h\left(T_{\mathrm{prop}}+T_{\mathrm{switch}}+T_{\mathrm{sync}}\right) + T tx, max ​ ⌈ payload max ​ _ ​ frame ​ _ ​ size ⌉ \displaystyle\quad+T_{\mathrm{tx,max}}\left\lceil\frac{\mathrm{payload}}{\mathrm{max\_frame\_size}}\right\rceil + T queue, max ​ (h − 1), \displaystyle\quad+T_{\mathrm{queue,max}}(h-1),

P193Block 235

Hop count: h \displaystyle\text{Hop count:}\quad h = 5, \displaystyle=5, Propagation + switching + sync = 5 × (1 + 1 + 1) = 15 ​ μ ​ s, \displaystyle=5\times(1+1+1)=5~\mu\mathrm{s}, Transmission time = 12.208 × ⌈ 2500 1526 ⌉ = 12.208 × 2 = 24.416 ​ μ ​ s, \displaystyle=2208\times\left\lceil\frac{2500}{1526}\right\rceil=2208\times 2=4416~\mu\mathrm{s}, Queueing delay = 48 × (5 − 1) = 192 ​ μ ​ s, \displaystyle=8\times(5-1)=92~\mu\mathrm{s}, Total = \displaystyle\text{Total}= 15 + 24.416 + 192 = 231.416 ​ μ ​ s, \displaystyle 5+4416+92=31416~\mu\mathrm{s}, Adjusted worst-case = 1160 ​ μ ​ s. \displaystyle=160~\mu\mathrm{s}. Expert Explanation: The WCD equation provided by the model is wrong. Even though the model takes into consideration the number of hops present in the route, the delays accumulated across each hop and also calculates the hop count. However, the model misses the most crucial part of the WCD equation which is the cycle duration. Furthermore, the two components of the WCD equation (T tx, max ​ ⌈ payload max ​ _ ​ frame ​ _ ​ size ⌉ T_{\mathrm{tx,max}}\left\lceil\frac{\mathrm{payload}}{\mathrm{max\_frame\_size}}\right\rceil) and (T queue, max ​ (h − 1) T_{\mathrm{queue,max}}(h-1)) considered by the model is entirely hallucinated. These two components are mainly contributing to the large WCD values of this model.

P194Block 236

Hop count: h \displaystyle\text{Hop count:}\quad h = 5, \displaystyle=5, Propagation + switching + sync = 5 × (1 + 1 + 1) = 15 ​ μ ​ s, \displaystyle=5\times(1+1+1)=5~\mu\mathrm{s}, Transmission time = 12.208 × ⌈ 2500 1526 ⌉ = 12.208 × 2 = 24.416 ​ μ ​ s, \displaystyle=2208\times\left\lceil\frac{2500}{1526}\right\rceil=2208\times 2=4416~\mu\mathrm{s}, Queueing delay = 48 × (5 − 1) = 192 ​ μ ​ s, \displaystyle=8\times(5-1)=92~\mu\mathrm{s}, Total = \displaystyle\text{Total}= 15 + 24.416 + 192 = 231.416 ​ μ ​ s, \displaystyle 5+4416+92=31416~\mu\mathrm{s}, Adjusted worst-case = 1160 ​ μ ​ s. \displaystyle=160~\mu\mathrm{s}.

P195Block 237

Expert Explanation:

P196Block 238

The WCD equation provided by the model is wrong. Even though the model takes into consideration the number of hops present in the route, the delays accumulated across each hop and also calculates the hop count. However, the model misses the most crucial part of the WCD equation which is the cycle duration. Furthermore, the two components of the WCD equation (T tx, max ​ ⌈ payload max ​ _ ​ frame ​ _ ​ size ⌉ T_{\mathrm{tx,max}}\left\lceil\frac{\mathrm{payload}}{\mathrm{max\_frame\_size}}\right\rceil) and (T queue, max ​ (h − 1) T_{\mathrm{queue,max}}(h-1)) considered by the model is entirely hallucinated. These two components are mainly contributing to the large WCD values of this model.

P197Block 240

To understand the nature of WCD computation failures, we identify five distinct failure modes observed across models and mechanisms.

P198Block 242

The model returns WCD = 0 for all flows, producing a structurally valid JSON response but with no computational content. This failure mode affects GPT-4o and DeepSeek-V3.2 (Non-thinking) on CBS, and Llama 3.2 1B across all test cases for CBS and CQF. This suggests these models recognize the output format requirement but cannot engage with the underlying NC computation or any reasoning behind the WCD calculation.

P199Block 244

The model produces valid WCD values for fewer than 80% of flows in a given TC, resulting in incomplete coverage. This affects Mistral Large 3 on CBS and Llama 3.3 on CBS, suggesting these models lose track of flow indexing in large topologies.

P200Block 246

The model cannot process the full open-ended prompt due to context window limitations or API timeout. This affects Qwen3 8B (API timeout across all TCs) and Llama 3.2 1B (context limit exceeded), confirming that small models are structurally unsuited for TSN open-end evaluation.

P201Block 248

The model returns an empty response for all open-ended test cases, regardless of network topology or flow count. This failure mode exclusively affects DeepSeek-V3.2 (Thinking), which produces no output, neither WCD values nor intermediate reasoning, across all evaluated topologies, including one-switch, medium-mesh, and ring configurations, and across all flows, for both CBS and CQF mechanisms.