Publications

View publication

Title Can Adversarial Attacks by Large Language Models be Attributed?
Authors Manuel Cebrian, Andrés Abeliuk, Jan Arne Telle
Publication date 2026
Abstract Attributing outputs from Large Language Models (LLMs) in
adversarial settings--such as cyberattacks and disinformation
campaigns--presents significant challenges that are likely to grow in
importance. We approach this attribution problem from both a theoretical and
empirical perspective, drawing on formal language theory (identification in
the limit) and data-driven analysis of the expanding LLM ecosystem. By
modeling an LLM's set of possible outputs as a formal language, we analyze
whether finite samples of text can uniquely pinpoint the originating model.
Our results show that under mild assumptions of overlapping capabilities
among models, certain classes of LLMs are fundamentally non-identifiable
from their outputs alone. We delineate four regimes of theoretical
identifiability: (1) an infinite class of deterministic (discrete) LLM
languages is not identifiable (Gold's classical result from 1967); (2) an
infinite class of probabilistic LLMs is also not identifiable (by extension
of the deterministic case); (3) a finite class of deterministic LLMs is
identifiable (consistent with Angluin's tell-tale criterion); and (4) even
a finite class of probabilistic LLMs can be non-identifiable (we provide a
new counterexample establishing this negative result). Complementing these
theoretical insights, we quantify the explosion in the number of plausible
model origins (hypothesis space) for a given output in recent years. Even
under conservative assumptions (each open-source model fine-tuned on at most
one new dataset), the count of distinct candidate models doubles
approximately every 0.5 years, and allowing multi-dataset fine-tuning
combinations yields doubling times as short as 0.28 years. This
combinatorial growth, alongside the extraordinary computational cost of
brute-force likelihood attribution across all models and potential users
renders exhaustive attribution infeasible in practice. Our findings
highlight an urgent need for new strategies and proactive governance to
mitigate risks posed by un-attributable, adversarial use of LLMs as their
influence continues to expand.
Pages article e000008
Volume 3
Journal name PLOS Complex Systems
Publisher PLOS ONE