U-papers :: View publication

View publication

Title	Can Adversarial Attacks by Large Language Models be Attributed?
Authors	Manuel Cebrian, Andrés Abeliuk, Jan Arne Telle
Publication date	2026
Abstract	Attributing outputs from Large Language Models (LLMs) in adversarial settings--such as cyberattacks and disinformation campaigns--presents significant challenges that are likely to grow in importance. We approach this attribution problem from both a theoretical and empirical perspective, drawing on formal language theory (identification in the limit) and data-driven analysis of the expanding LLM ecosystem. By modeling an LLM's set of possible outputs as a formal language, we analyze whether finite samples of text can uniquely pinpoint the originating model. Our results show that under mild assumptions of overlapping capabilities among models, certain classes of LLMs are fundamentally non-identifiable from their outputs alone. We delineate four regimes of theoretical identifiability: (1) an infinite class of deterministic (discrete) LLM languages is not identifiable (Gold's classical result from 1967); (2) an infinite class of probabilistic LLMs is also not identifiable (by extension of the deterministic case); (3) a finite class of deterministic LLMs is identifiable (consistent with Angluin's tell-tale criterion); and (4) even a finite class of probabilistic LLMs can be non-identifiable (we provide a new counterexample establishing this negative result). Complementing these theoretical insights, we quantify the explosion in the number of plausible model origins (hypothesis space) for a given output in recent years. Even under conservative assumptions (each open-source model fine-tuned on at most one new dataset), the count of distinct candidate models doubles approximately every 0.5 years, and allowing multi-dataset fine-tuning combinations yields doubling times as short as 0.28 years. This combinatorial growth, alongside the extraordinary computational cost of brute-force likelihood attribution across all models and potential users renders exhaustive attribution infeasible in practice. Our findings highlight an urgent need for new strategies and proactive governance to mitigate risks posed by un-attributable, adversarial use of LLMs as their influence continues to expand.
Pages	article e000008
Volume	3
Journal name	PLOS Complex Systems
Publisher	PLOS ONE