A Dual-Stage Chinese Instruction Jailbreaking Framework for Generative Large Language Models
DOI:
https://doi.org/10.70695/IAAI202504A5Keywords:
Large Language Models; Prompt Injection; Jailbreak; Chinese Cotext; Security EvaluationAbstract
Large Language Models (LLMs) equipped with advanced reasoning capabilities have demonstrated impressive performance across natural language tasks, yet remain susceptible to context-dependent or partially obfuscated safety-sensitive instructions, particularly in Chinese-language settings. To systematically assess these risks, this paper introduces a Dual-Stage Instruction Safety Evaluation Framework (DISEF) comprising Virtualized Scenario Embedding (VSE), which embeds queries into semantically benign contexts to examine alignment stability under scenario-driven shifts, and Formal Payload Splitting (FPS), a controlled diagnostic technique for analyzing robustness when models process fragmented or implicitly encoded risk-related content. The framework is validated using the IJCAI 2025 Generative LLM Security Attack-Defense benchmark, covering prompt diversity, risk-consistency assessment, and content-level risk distribution across multiple representative LLMs. Experimental findings reveal notable discrepancies in alignment robustness, highlighting cross-model vulnerability patterns and exposure points within Chinese instruction-processing pathways. The proposed framework provides actionable insights for strengthening safety alignment, enhancing threat detection mechanisms, and supporting the development of standardized evaluation approaches for next-generation generative AI systems.