Clustering of unknown protocol messages based on format comparison

ZHANG  Mingyuan, LIU  Xiaolei, WU  Xiaohu, ZHANG  Xiaojian

doi:10.3969/j.issn.1673-3819.2025.03.017

Clustering of unknown protocol messages based on format comparison

ZHANG Mingyuan, LIU Xiaolei, WU Xiaohu, ZHANG Xiaojian

2025 · doi:10.3969/j.issn.1673-3819.2025.03.017 · W6928970686

article OA: green CC0

🔓 Open OA copy Full text JSON View on OpenAlex View at publisher

Abstract

Protocol reverse is a solution for detecting and analyzing location or proprietary protocols, and packet clustering for protocol formats is the basic way to identify unknown protocol packets. In this paper, we propose an Unknown Protocol Packet Clustering MethodBased on Format Matching (CUPFC), which introduces the enhanced Barcos paradigm, defines Token Format Distance (TFD) and Message Format Distance (MFD) to represent the format similarity of Token and packets, and introduces Jaccard distance and an optimized sequence alignment algorithm to calculate them. Then, the MFD is used to establish a distance matrix and input it into the DBSCAN model to cluster unknown protocol packets into classes of different formats. On the two simulation datasets, the harmonic mean v measure of clustering is above 0.91, and the FMI and coverage are not less than 0.97, which has great advantages compared with previous work.

Full text 1,042 characters · extracted from oa-html · click to expand

Zhihui kongzhi yu fangzhen (Jun 2025) Clustering of unknown protocol messages based on format comparison Abstract Protocol reverse is a solution for detecting and analyzing location or proprietary protocols, and packet clustering for protocol formats is the basic way to identify unknown protocol packets. In this paper, we propose an Unknown Protocol Packet Clustering MethodBased on Format Matching (CUPFC), which introduces the enhanced Barcos paradigm, defines Token Format Distance (TFD) and Message Format Distance (MFD) to represent the format similarity of Token and packets, and introduces Jaccard distance and an optimized sequence alignment algorithm to calculate them. Then, the MFD is used to establish a distance matrix and input it into the DBSCAN model to cluster unknown protocol packets into classes of different formats. On the two simulation datasets, the harmonic mean v measure of clustering is above 0.91, and the FMI and coverage are not less than 0.97, which has great advantages compared with previous work. Keywords

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

openalex: last seen: 2026-05-14T06:14:29.962126+00:00

License: CC0 · commercial use OK