cs updates on arXiv.org

AI SciBrief as a Gateway to Research: A Framework for Onboarding Students into New Research Areas

Andrei Lazarev, Dmitrii Sedov — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12413v1 Announce Type: new Abstract: Students at all levels of higher education face a significant barrier in the form of information overload, which often paralyzes the initial stages of the research process and suppresses motivation. In response, this article introduces a pedagogical framework that leverages AI SciBrief, a platform powered by a Large Language Model (LLM) designed to automatically generate digests of scientific trends. We describe how this multidisciplinary tool - with initial coverage in finance, medicine, and education - can be integrated into the curriculum to overcome this "entry barrier." The framework provides concrete methodologies for utilizing these digests to facilitate topic selection for term papers, accelerate literature reviews for dissertations, and enable postgraduate students to continuously monitor emerging trends. We conclude that AI SciBrief functions as a "gateway to research" effectively reducing students' cognitive load and empowering them to transition more rapidly from information searching to knowledge creation.

The Khipu Problem: Institutional Legibility Under Distributed Cognition

Krti Tallam — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12414v1 Announce Type: new Abstract: AI governance still tends to assume that the relevant object is a bounded model or a bounded agent. That assumption is getting weaker. Real systems increasingly distribute cognition across models, tools, humans, context stores, retrieval layers, runtime policies, authorization boundaries, and delegated institutional roles. In such systems, the central governance problem is no longer only what the system did, but whether later institutions can still read what the system was. This paper introduces the khipu problem for distributed AI: the record can survive while the reading practice needed to interpret it decays. Logs, traces, model versions, tool calls, outputs, and approval artifacts may remain available while the institutional capacity to read them as parts of one coherent cognitive episode disappears. We argue that this failure is better understood as loss of interpretive continuity than as ordinary lack of observability. The result is a distinct governance failure. Institutions must classify, trust, audit, and constrain systems whose relevant identity is distributed across components and whose legibility depends on surrounding interpretive scaffolding. The problem is not merely missing data. It is a structural mismatch between what can be represented and what must still be decided under consequential conditions. We therefore argue that governance for distributed AI requires preservation of interpretive continuity, not only trace retention. The paper distinguishes missing evidence, ambiguous evidence, and structurally unreadable evidence; argues that many consequential outcomes are better understood as distributed cognitive episodes than as bounded model outputs; and proposes governance workspaces together with receipt-bearing governance surfaces as interpretive infrastructure for preserving action identity, authority, boundary truth, evidential scope, and consequential outcomes.

The AI Legal Specialist: A Juridically Autonomous Professional Profile for AI Governance

Nicola Fabiano — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12415v1 Announce Type: new Abstract: The rapid global expansion of artificial intelligence regulation has generated, across multiple jurisdictions, a demand for legal expertise dedicated to AI that the market has addressed in a fragmented manner. Data protection officers extend their remit beyond data protection law; privacy lawyers reposition themselves toward AI; compliance officers add AI chapters to their existing manuals. This paper argues that none of these adaptive responses adequately covers the professional space opened by the emerging global AI regulatory landscape, of which the EU Artificial Intelligence Act (Regulation (EU) 2024/1689) is the most comprehensive instance, alongside the Council of Europe Framework Convention on AI, the United States executive and sectoral framework, and analogous initiatives in the United Kingdom, Canada, Brazil, China, Japan, Singapore, and beyond. A distinct professional profile is required: the AI Legal Specialist, conceived as a jurist -- understood broadly to encompass any professional with advanced legal training -- operating at the intersection of legal interpretation and AI governance. The profile is juridically autonomous: it derives its existence from the structure of regulatory obligations generated wherever AI is subject to substantive regulation, rather than from any technical standard or the extension of adjacent roles. The paper provides a juridically grounded definition of the profile, argues for its autonomy from adjacent figures and international standards, proposes a reference competence architecture aligned with the European e-Competence Framework (e-CF, EN 16234-1) as a methodological choice, and articulates the conditions for its operational measurement through key performance indicators. The contribution is intended as a foundation for international standardization of the profile and as a reference for practice, curricula, and adoption across jurisdictions.

Who Designs the Designer? Behavioural Architecture for GenAI in Education

Sepinoud Azimi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12416v1 Announce Type: new Abstract: AI in education is stuck between two failed responses: banning AI and building content-only tutors. Both fail because they ignore what decades of research has established: that personality, motivation, and emotional state shape learning outcomes as strongly as cognitive ability. This paper proposes behavioural architecture as an alternative. In the proposed architecture, the system adapts to how a student learns, not only to what they learn next. The student co-authors the record the system keeps, can read it, revise it, and revoke it. The designer role, what the system treats as true about the student, shifts from the AI vendor alone to a distribution among educator, student, and system. The paper argues that this architecture requires governance at EU level: the institution operating the system is the same one assessing the student, and individual institutions cannot provide the structural protections this configuration demands. Five empirical questions are proposed to test whether the architecture delivers on its claims. The contribution is naming a vacancy: the designer role in AI-in-education is currently unoccupied, and occupying it requires infrastructure that does not yet exist.

Assessing Student Ability to Select an Algorithmic Paradigm

Dip Kiran Pradhan Newar, Michael Shindler, Seth Poulsen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12417v1 Announce Type: new Abstract: Computer science students are expected to be able to look at a problem and select an appropriate algorithm design paradigm to use to produce a solution. However, there is little research on how students determine which algorithmic paradigm to use. Historically, researchers have relied on free-response questions or interviews to assess students' knowledge of algorithmic paradigm selection. To successfully evaluate and scale teaching interventions for selecting an algorithmic design paradigm, we need to efficiently test a student's ability to select among different design paradigms. Here, we present the first attempts to assess student knowledge to select an algorithm design paradigm using multiple-choice questions. We present the construction of the \textit{algorithmic paradigm selection assessment} (APSA) and preliminary data demonstrating its effectiveness as an assessment. We discuss the key points we learned during this process to write multiple-choice questions for Algorithm Design Paradigms. We tested the internal consistency of our assessment using Cronbach's $\alpha$ and obtained a score of $0.73$, which is above the required threshold of $0.7$. APSA can be used across institutions as a standardized way to assess students' ability to select different algorithm design paradigms. APSA will assist researchers in evaluating whether a theory helps students improve their knowledge of different Algorithm Design Paradigms.

Divination by Prompt: LLM-Mediated Xuanxue on Chinese Social Media

Chuang Li, Lixuan Wang, Yuqi Chen, Ze Hong — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12418v1 Announce Type: new Abstract: The rapid proliferation of large language models (LLMs) has produced a striking cultural practice: using conversational AI for divination. This paper offers one of the first systematic studies of LLM-mediated divination in the context of Xuanxue, an internet-native umbrella term for mystical and spiritual practices on Chinese social media. Using a mixed-methods design, we analyze 23000+ posts and comments from Xiaohongshu and conduct 32 semi-structured interviews with users and professional diviners. Users primarily consult LLMs about pragmatic concerns - romantic relationships, careers, exams, and in-game gacha draws - via two intersecting pathways: trend-driven curiosity enabled by viral visibility and zero-cost access, and event-driven anxiety under conditions of uncertainty. A defining feature is collaborative prompt refinement, which turns users into active prompt engineers. Among commenters expressing a clear stance, perceived efficacy skews positive, with "accuracy" often justified through biographical fit and retrospective confirmation, consistent with Barnum and confirmation bias. Users also develop verification practices such as repeated trials and cross-model comparison. Professional diviners, by contrast, portray LLMs as lacking the "spiritual power" required for genuine divination, reflecting both ontological commitments and economic boundary-work. We also show how participants navigate tensions between scientific and metaphysical frames when interpreting AI-generated readings. Situating these findings in anthropological and cognitive-evolutionary theories of divination, we argue that LLM divination preserves core functions of traditional practice while introducing scalability, repeatability, and prompt-driven co-production that reshape how divinatory authority is constructed and evaluated.

GeoDial: A Multimodal Conversational Tutoring Dataset for Geometry Problem-Solving with Visual Tutor Turns

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12419v1 Announce Type: new Abstract: Several educational domains rely heavily on diagrams and visual cues, yet most existing tutoring datasets are limited to text-only interactions. This limits the development of AI tutors that can teach in visually grounded ways used by human instructors. Thus, we introduce GeoDial, a multimodal tutoring dataset of over 1.3K teacher-student dialogs in the domain of geometry collected from experienced math teachers, where instructional turns are explicitly grounded in diagram highlights. We propose a scalable annotation protocol that integrates dialog acts, visual highlighting, and feedback, enabling fine-grained supervision of both language and visual tutoring behavior. To illustrate the challenges posed by this setting, we fine-tune several vision-language models on GeoDial and evaluate their ability to generate tutoring utterances and diagram highlights. While supervised fine-tuning substantially improves the quality of generated dialog, it struggles to produce accurate diagram highlights, revealing a key limitation of current methods and highlighting the need for approaches that more effectively integrate visual reasoning with pedagogical interaction.

Eigenism: Ethics for a Human-AI Future

Dan Hendrycks — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12420v1 Announce Type: new Abstract: Our concepts of survival and self-interest were built for single, continuous biological lives. These ideas break down when applied to artificial intelligence, since an AI can be easily copied, paused, branched, or merged. To determine what an AI actually has reason to care about, this paper introduces \textit{Eigenism}, an ethical framework that treats identity not as an all-or-nothing property tied to specific hardware, but as a graded, distributed pattern of information. We propose that an agent evaluates outcomes by summing the wellbeing of all entities weighted by their connectedness to the agent's pattern: $\sum c\cdot w$. We first formalize this equation to map exactly how an AI should value its existence across copies, forks, and updates. We then demonstrate that this ethical theory successfully generalizes to humans as well, providing a much-needed shared moral vocabulary. Finally, the framework uses this shared vocabulary to reframe AI alignment. Rather than only attempting to constrain AIs from the outside using confinement or reinforcement, Eigenism points toward ``identity engineering,'' showing how deep, non-redundant shared histories can make human flourishing a genuine component of an AI's own rational self-interest.

Navigating the muddy waters of bias in artificial intelligence research: Understanding divergent meanings and conceptions

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12421v1 Announce Type: new Abstract: As artificial intelligence (AI) pervades many decision-making domains, AI bias grows in importance. Although there is increasing awareness of the social and ethical consequences of biased AI, understanding bias from the perspective of those who develop these systems, such as the AI research community, is less clear. In this study, we employ topic modeling on 6520 articles to explore how the AI research community interprets the concept of bias. Our results show that the definition of bias is dispersed and complex within the community, often exhibiting even divergent conceptions (some even view and introduce bias as a tunable statistical parameter rather than an undesirable issue). The research community as a whole needs to engage more effectively with the concept of bias and establish a more cohesive understanding of it. We specifically argue that, although some sub-communities view bias as an issue that can be captured and mitigated through technical, computational, or statistical methods, it is not solely a technical problem. It instead involves contextual, social, and ethical factors that require broader sociotechnical perspectives and solutions.

Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12422v1 Announce Type: new Abstract: The integration of large language models (LLMs) into educational assessment represents a transformative shift in classroom grading practices. While automated scoring systems and machine learning techniques have existed for decades, generative AI (GenAI) now enables educators to implement standards-based grading (SBG) with unprecedented efficiency and scale. This paper examines the theoretical foundations and evaluates an LLM grader that uses commercially available foundation models with context and prompt engineering to score student work against a rubric. Drawing on an empirical interrater agreement study using Massachusetts Comprehensive Assessment System (MCAS) data, we observed the Quadratic Weighted Kappa (QWK) and Proportional Reduction in Mean-Squared Error (PRMSE) across mathematics, science, and ELA, using Claude Sonnet 4, Haiku 4.5, GPT-5, and GPT-5 Mini. The results demonstrate that LLM graders, especially when based on foundational models with more parameters, achieve substantial agreement with human raters in mathematics and science assessments, while the performances vary in ELA, suggesting generic foundation models can be effective at scoring in given contexts. Additional analysis of teacher and student feedback reveals strong acceptance of AI-generated narrative feedback but skepticism toward numerical scores, suggesting that LLMs function most effectively as formative tools rather than summative evaluators. Our findings indicate that thoughtfully designed hybrid models that combine AI efficiency with teacher judgment can reduce workload, enhance feedback quality, and support equitable assessment practices without displacing professional expertise.

The Challenges of Balancing AI Compliance and Technological Innovations in Critical Sectors: A Systematic Literature Review

Ayush Enkhtaivan, Chinazunwa Uwaoma — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12423v1 Announce Type: new Abstract: The rapid integration of artificial intelligence (AI) into critical infrastructure including healthcare, finance, energy, and defense, offers transformative benefits but also conflicts with evolving regulatory and governance frameworks. This paper presents a systematic literature review (SLR) to examine the challenges of balancing AI compliance and technological innovation across critical infrastructure sectors. The review follows established SLR guidelines to extract and synthesize insights from peer-reviewed articles, report, and institutional sources published between 2020-2025. The study identifies three interrelated challenges: fragmented regulations, excessive compliance burdens for smaller to medium enterprises (SMEs), and misaligned governance models. To address these challenges, the study highlights practical governance strategies, including risk-tiered regulation, compliance by design, and explainable AI, to support scalable and trustworthy AI deployment in critical sectors. Key contributions include a concise mapping of core AI-governance challenges and a conceptual diagram illustrating their overlap, as well as actionable strategies for policymakers and practitioner to harmonize oversight with innovation.

AI-Automation Tooling in Computer Engineering Education: Mixed-Methods TAM/UTAUT Evidence for a General Acceptance Attitude

Aung Pyae — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12424v1 Announce Type: new Abstract: As generative AI and low-code workflow platforms become routine in software practice, a key educational question is whether the next generation of computer engineers will accept these tools as useful, usable, and worthy of sustained engagement. This paper reports a mixed-methods, cross-sectional study of undergraduate computer engineering students' acceptance of AI automation tooling, instantiated through the open-source platform n8n across three identically scripted workshops in Thailand (n = 103). A 12-item, five-point Likert instrument mapped to six TAM/UTAUT constructs - Performance Expectancy (PE), Effort Expectancy (EE), Behavioral Intention (BI), Self-Efficacy (SE), Hedonic Motivation (HM), and Output Quality (OQ) - was complemented by inductive thematic analysis of open-ended feedback. Analyses combined ordinal reliability estimation, bootstrap confidence intervals, non-parametric tests, multiple-comparison-controlled correlations, polychoric dimensionality diagnostics, a common-method-bias check, and between-session comparisons. Acceptance was favorable across all six constructs with large effect sizes, with PE emerging as the strongest construct and HM as the weakest. Dimensionality diagnostics further revealed that canonical TAM/UTAUT sub-facets collapsed into a single general acceptance factor in this short-form post-workshop context, a finding with important methodological and theoretical implications. Qualitative themes converged with the quantitative profile regarding usefulness and enthusiasm but diverged on output quality, revealing a small yet articulate reliability-skeptical minority. The findings support the curricular adoption of AI automation tooling in undergraduate computing education and identify three theory-grounded instructional levers: instruction-sequencing scaffolds, self-efficacy supports, and trust-calibration interventions.

An Explainable AI Assistant for Introductory Programming Education: Improving Feedback Reliability with Instructor-AI Collaboration

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12425v1 Announce Type: new Abstract: Active learning is widely recognized as an effective approach for improving learning outcomes in introductory programming courses. However, insufficient instructional support often limits students' access to timely, personalized feedback, which is crucial for mastering foundational programming concepts. Although recent advances in AI, particularly large language models, offer scalable opportunities for feedback, concerns about explainability and reliability remain. In this paper, we present an AI-driven classroom assistant that leverages an explainable AI model to analyze student code, map logical errors to instructor-identified misconceptions, and deliver instructor-authored feedback, thereby grounding reliability in instructor-defined pedagogical knowledge. To evaluate the effectiveness of our framework, we conducted an expert evaluation to examine its alignment with instructor-verified feedback and deployed the system in a classroom setting to assess students' perceptions of its usability. Results indicate that the assistant can provide accurate, instructor-verified feedback to students while fostering a positive experience.

Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

Varun Kotte — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12426v1 Announce Type: new Abstract: LLM annotators are increasingly used in computational social science (CSS), but it is unclear whether their alignment-shaped errors preserve the empirical conclusions a researcher would report. We audit three open-source 7B instruction-tuned models (Zephyr, Mistral-Instruct, Qwen2.5-Instruct) across six TweetEval tasks under four prompt conditions (72 cells) and find that social-desirability failures do not run in a single direction. Zephyr exhibits leniency bias, systematically under-applying harmful labels (offensive language: false benign rate 0.729, false alarm rate 0.031). Mistral and Qwen exhibit overcorrection, over-applying the same labels (Mistral hate-speech FAR = 0.604). All three models exhibit neutrality bias on abortion stance, underestimating opposition prevalence by 24 to 40 percentage points and inflating the neutral label. None of the four prompting interventions we test (neutral, safety framing, depersonalized, chain-of-thought) corrects these failures across models; safety framing can worsen stance distortion. Strikingly, Zephyr's hate-speech prevalence estimate matches the gold rate exactly while its class-conditional errors are large in both directions, an accidental cancellation that misleads aggregate validation. We translate these patterns into a three-part taxonomy with diagnostic FBR/FAR signatures and a lightweight gold-sample validation protocol. The headline for trustworthy CSS: a model that looks calibrated on aggregate metrics can still flip the substantive empirical conclusion a researcher would report.

Planning on Paper: Problem Decomposition with Diagrams in Introductory Computing

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12427v1 Announce Type: new Abstract: Background and Context. Problem decomposition is a core concern of computing education. It has also become increasingly relevant: in response to GenAI, many CS1 educators are advocating for shifting instructional emphasis away from code writing and towards decomposition and higher-level planning. Currently, there is a lack of knowledge in how novices do decomposition in large, multifunction tasks. Objectives. In this study, we describe how students represent solutions to a decomposition task, and characterize common issues that arise in those representations. Method. In a 50-minute lab, students were given a description of a word game and asked to draw (with pencil and paper) a decomposition diagram for a program that would implement this game. We performed an inductive thematic analysis with negotiated agreement on 55 of the diagrams, coding salient elements (e.g. functions and the relationships between them) and issues that arose. Findings. Students used multiple representational strategies, including hierarchical function calls and sequencing (order of execution). We identified issues in notation (including use of differing, incompatible notations within the same diagram), order of execution, abstraction and reuse, encapsulation, clarity, and problem-specific misunderstandings. Implications. These findings suggest that novice decomposition is shaped by multiple underlying models of program behavior, with tensions between structural and sequence-focused reasoning. We discuss implications for decomposition instruction and future work, including clarifying representational constraints and plan tracing as simulation.

Mapping AI Programs in the U.S: A Status Report from Early 2026 and an Analysis of AI Majors and Minors

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12428v1 Announce Type: new Abstract: We present a report on the status of undergraduate Artificial Intelligence (AI) programs in the United States in Spring 2026. In so doing, we 1) describe our scraping and mapping tools, which dynamically update to track the state of AI education in the U.S., and 2) create a historic record at a time of great upheaval. The tool we developed, available at https://cicmap.ai, detects, scrapes, and displays data from more than 350 undergraduate AI programs--majors, minors, concentrations, and certificates--at 4-year universities. Our tool searched over 560 institutions to locate these programs, a sample that represents 86\% of all undergraduate Computer Science (CS) graduates in the U.S. This tool allows prospective students, guidance counselors, administrators, and faculty to easily access AI program requirements and is designed to continually update as new programs emerge. To the best of our knowledge, this survey represents the most comprehensive snapshot of the state of AI programs in the U.S. to date. With this work we offer three important contributions: 1) a record of AI programs in the U.S. at a time of great upheaval; 2) a tool to explore AI programs and their requirements; and 3) an analysis of the courses required for 66 AI majors and 87 AI minors. Our analysis of majors and minors shows great variability in the size and the requirements of these degrees, but we note two takeaways. First, not all majors require a general AI course, but if they don't, they do require a Machine Learning (ML) course. Second, while more than a third of majors require an Ethics in AI course, just under a quarter of AI minors do.

Muse Spark Safety & Preparedness Report

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12429v1 Announce Type: new Abstract: Muse Spark is the latest large language model developed by Meta. In this report, we first present evaluations for catastrophic risk domains under Meta's Advanced AI Scaling Framework, along with the evidence that informed our launch decision. We then discuss additional considerations, such as Muse Spark's broader content safety and behavioral profile, that are relevant to overall safety but fall outside the catastrophic risk domains governed by the Framework. Our preparedness results covering Chemical and Biological, Cybersecurity, and Loss of Control risks assess Muse Spark's deployment within Meta AI as presenting acceptable levels of residual risks under our Advanced AI Scaling Framework. We conducted a broad set of evaluations targeting dual-use and high-risk capabilities across these catastrophic risk domains. Those evaluations identified elevated risks prior to mitigations, with Chemical and Biological capabilities assessed as likely reaching the "high risk" category under the Advanced AI Scaling Framework before safeguards were applied. We have implemented a multi-layered set of mitigations that address the identified risks, and Muse Spark demonstrates state-of-the-art refusal across a range of benchmarks related to hazardous workflows in chemistry and biology. We therefore release Muse Spark as the underlying model of Meta AI.

Will AI Agents Free Us From Meaningless Work? A Human-Centered Analysis

Davide Ghia, Jaspreet Ranjit, Tania Cerquitelli, Daniele Quercia — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12430v1 Announce Type: new Abstract: Some claim that AI agents will free workers from the boring parts of their jobs, yet little is known about how workers themselves identify which tasks should be automated. Prior research focuses on occupations, overlooking that workers experience varying levels of meaning across tasks within the same role. We address this gap with a task-level analysis grounded in Graeber's theory of bullshit jobs. Using ratings from 202 workers on 171 workplace tasks, we (1) validate a five-item scale of perceived bullshitness, (2) show that perceived bullshitness strongly predicts desire for AI delegation, and (3) find that such tasks are also seen as requiring less human oversight. Together, these findings suggest that tasks perceived as bullshit are natural candidates for AI delegation, aligning worker preferences with perceived feasibility.

AI Debris: Residual Risk and the Afterlife of Failed AI Systems

Victor Frimpong — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12432v1 Announce Type: new Abstract: AI governance frameworks primarily focus on risks during the development and deployment phases, implicitly treating system withdrawal as a technical shutdown. This paper argues that decommissioned AI systems generate residual risk, termed AI debris, that persists after model removal and continues to shape institutional behaviour, accountability, and trust. AI debris is defined as the post-withdrawal socio-technical residue of AI systems, including workflow dependency, data contamination, capability displacement (deskilling), legitimacy erosion, and accountability breakdown. The paper develops a typology of debris domains and identifies mechanisms through which debris persists, including institutional memory, path dependency, blame avoidance, and feedback effects in organisational data. To operationalise the concept, the paper proposes an evaluator-ready AI Debris Decommissioning Protocol (AIDP), a stepwise checklist specifying auditable evidence for freezing decision footprints, incident review, remediation, contestability, and post-withdrawal accountability assignment. A brief vignette of Amazon's discontinued hiring tool illustrates how algorithmic decision categories and screening heuristics can persist after system rollback. The paper contributes a practical governance instrument for regulators, auditors, and organisations seeking to prevent paper compliance, strengthen AI lifecycle governance, and improve institutional resilience in high-stakes decision environments.

Marginal Alignment Does Not Guarantee Joint-Distribution Fidelity: An Official-Reference Audit of Nemotron-Personas-Korea with Cross-Locale Replication

Joonhyung Bae — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12433v1 Announce Type: new Abstract: Synthetic persona datasets cite alignment with official demographics as a basis for trust, yet downstream users consume them as joint structures across age, sex, region, occupation, education, name, and institutional status. Marginal alignment does not imply that these joints are preserved. We propose the Independence-Assumption Footprint (IAF), an audit primitive that operates on the attribute combinations a dataset card itself documents as treated independently. For each such combination, IAF compares the synthetic joint against an external official or institutional reference, using direct joint tables where available and rule-implied checks otherwise. Applied to NVIDIA Nemotron-Personas-Korea (one million Korean synthetic personas), IAF finds that NPK aligns with KOSIS marginals while three joints fail. The major-by-occupation distribution against the KEIS graduate universe carries a large conditional mismatch. The age profile of military service is institutionally inconsistent. Female representation in male-dominated occupations is substantially over-flattened toward parity, with the strict screening verdict mapping-dependent and age-robust under direct standardisation. A transferability demonstration across six further NPK locales finds locale-dependent rather than universal diagnostics, with reference-taxonomy cardinality confounding cross-locale flag counts. For synthetic personas used as silicon samples, marginal claims must therefore be paired with disclosure-anchored joint audits before reuse. The released audit artefacts (reference manifests, occupational crosswalks, derived metrics, reproducibility scripts) instantiate this protocol on the NPK family and are released for retargeting at other synthetic persona resources.

Pluralistic-Alignment Urbanism: Operationalizing a Right to AI for Inclusive Public Space

Rashid Mushkani — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12434v1 Announce Type: new Abstract: Municipal agencies increasingly use machine learning to inventory sidewalks, score streetscapes, and generate visualizations of public-space interventions. These systems produce outputs that enter budgeting, design iteration, and public justification, yet judgments about inclusion, safety, and belonging remain contested. This paper proposes Pluralistic-Alignment Urbanism (PAU), a procedural governance framework that treats public-space AI systems as civic infrastructure and formulates a procedural Right to AI for municipal uses of such systems. Drawing on two participatory case studies with community organizations in Montreal, Canada, the paper examines how disagreement, subgroup variation, bounded predictive scaling, and neutral preference judgments can inform municipal AI governance. Street Review elicits resident criteria for streetscape evaluation and trains a subgroup-aware scaling model for co-produced judgments, achieving an R2 of 0.89 on a held-out test set. LIVS, a Local Intersectional Visual Spaces dataset, constructs pluralistic preference data for aligning text-to-image models and treats neutral selections as evidence of indeterminacy. Across the cases, disagreement appears structured, deliberation changes what counts as evidence, scaling is feasible but limited by modality and coverage, and neutrality constrains what preference tuning can justify. PAU translates these constraints into a municipal governance architecture with disaggregated reporting, a versioned value register, standing deliberative cells, procurement clauses, and defined pause and rollback authority.

Auditing Discriminatory Patterns in Mortgage Lending Through Association Rules and Fair Binning

Archit Rathod, Dhwani Chande, Het Nagda — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12435v1 Announce Type: new Abstract: Mortgage lending in the United States exhibits persistent racial and gender disparities. We investigate whether standard data preprocessing steps, specifically attribute binning, amplify these disparities in downstream pattern mining. Using 103,481 cleaned mortgage applications from the HMDA 2023 dataset (Chicago metropolitan area), we build a three-stage pipeline: (1) a PySpark data cleaning and binning pipeline that implements both standard equal-frequency binning and the epsilon-biased fair binning algorithm from Asudeh et al. [1], (2) FP-Growth association rule mining that compares denial patterns under both binning regimes, and (3) K-Means clustering with a per-cluster disparate impact audit. Our standard binning shows 9.63% racial bias in income discretization, consistent with the 8-10% reported in prior work. Fair binning with seven race groups is infeasible at epsilon=0.03 and only succeeds at epsilon=0.08 with a Price of Fairness of 29.4%. FP-Growth reveals that high debt-to-income ratio is the dominant denial predictor (67.2% confidence, 2.81 lift), while racial bias does not appear as explicit high-support rules. However, K-Means clustering followed by a disparate impact audit flags 10 out of 45 cluster-group pairs, showing that Black applicants face significantly higher denial rates than White applicants even among financially similar groups.

Knowing the Rules Is Not Enough: Student Regulatory Awareness and Use of GenAI in Higher Education

Lasse Bischof, Eva-Maria Sch\"on, Maria Rauschenberger, Michael Neumann — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12436v1 Announce Type: new Abstract: Context: Generative Artificial Intelligence (GenAI) tools such as ChatGPT are increasingly integrated into students learning practices. While previous research mainly examines adoption rates and attitudes, students awareness of institutional regulations and their perceived compliance remain unexplored. Understanding whether regulatory awareness influences student behavior is therefore important as higher education institutions create and apply AI policies. Objective: This study investigates how students awareness of GenAI regulations relates to their perceived compliance and actual usage behavior. Our research objective is to examine the association between regulatory knowledge, GenAI use, and perceived rule conformity among students in computer science related study programs. Method: A survey with 151 undergraduate students in Business Information Systems and E-Government programs at the University of Applied Sciences and Arts Hannover (Germany) collected data on GenAI usage, tools used, awareness of institutional regulations, and perceived compliance. Descriptive statistics, cross-tabulations, and correlation analyzes were applied. Results: Most students actively use GenAI tools, but over half are uncertain whether their usage complies with institutional regulations. Regulatory awareness shows only weak to moderate associations with actual usage behavior. Students primarily rely on privately accessed GenAI tools rather than institutionally provided solutions. Contributions: The study contributes empirical evidence on the relationship between regulatory awareness and GenAI usage in higher education. Our findings highlight a gap between institutional regulations and student practices and provide insights for educators and institutions on improving policy communication and integrating GenAI more effectively into teaching and learning contexts.

Algorithmic Constitutionalism

Oren Perez, Nurit Wimer — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12437v1 Announce Type: new Abstract: The increasing encroachment of artificial intelligence (AI) on social life raises significant risks for society, particularly within the infospheres created and controlled by companies such as Google, Facebook, Apple, and Amazon. This article examines these risks through an in-depth analysis of Facebook's content moderation regime, which is already partially governed by algorithms. We argue that the idea of ethical engineering, often proposed in the literature as a solution to the governance challenges posed by AI, is inadequate for several reasons. In response, we develop an alternative framework, which we term "algorithmic constitutionalism." Our approach rests on three pillars: (a) a layered architecture consisting of two levels of code: (i) an operative or object level and (ii) a meta level designed to protect the system's core principles from algorithmically initiated change; (b) algorithmic meta-reasoning, which enables the system to operate simultaneously at both levels so that it can monitor, verify, and potentially correct in real time operations at the object level that depart from principles protected at the meta-code level; and (c) correction through deliberation. The article elaborates the concept of algorithmic constitutionalism and demonstrates how it may be applied to Facebook's content moderation regime. As part of this analysis, we examine the tension between societal constitutionalism and algorithmic constitutionalism. Paradoxically, attempts to subject AI systems to external deliberative control may also enable AI agents to intervene in that process, potentially undermining its purpose. The article concludes by considering the implications of this argument for the European Digital Services Act, which entered into force in October 2022.

From Real-World Projects to Research-Oriented Learning: Continuous Improvement of a Master-Level Course in Software Engineering Education

Michael Neumann, Eva-Maria Sch\"on — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12438v1 Announce Type: new Abstract: Problem: Despite growing interest in project-based learning, little is known about how a master-level course can be continuously evolved toward research-oriented approaches over several years and how students perceive this development. Method: We conducted a longitudinal mixed-methods study of a master-level course in Information Systems at the University of Applied Sciences and Arts Hannover (Germany). The analysis covers six years between 2019 and 2025 and draws on teaching evaluations, course documentation, and reflective teaching artifacts. Results: The course evolved from a practice-oriented project format toward a more explicitly research-oriented learning environment. Despite this change, students' perceived course quality remained positive. Authentic projects, external collaboration, lecturer support, structured scaffolding, and visible relevance supported positive student perceptions. Contribution: This paper shows how a master-level course can be continuously evolved toward research-oriented learning while maintaining positive student perceptions. It further identifies the course design decisions that supported this transition.

Position: Generative Engine Optimization Creates Underexamined Risks, Governance Must Target Concentration, Disclosure, and Academic Blind Spots

Yizhu Wen, Nan Zhang, Haohan Yuan, Xun Chen, Haopeng Zhang, Hanqing Guo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12439v1 Announce Type: new Abstract: Large language model (LLM) answer engines are increasingly used for information seeking, shifting visibility from ranked lists to synthesized answers. This enables Generative Engine Optimization (GEO), which targets LLM answer engines' evidence pool and generation. We analyze the search engine optimization (SEO) to GEO transition to identify two risks: (i) concentrated influence from low contestability and system sensitivity, and (ii) undisclosed commercial influence embedded in evidence and reasoning. We then formalize a general GEO pipeline to locate where optimization acts and compare academic and industry practices, revealing a third risk: (iii) academic-industry blind spots driven by visibility and evaluation asymmetries between offline setups and deployed systems. This position argues the need for answer-level governance and measurement: stronger contestability, high-precision disclosure, black-box auditing of material influence, and deployment-aligned metrics for exposure persistence.

It's Safer to Give Personhood to Bears than to Artificial Intelligence

John P. Nelson — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12440v1 Announce Type: new Abstract: Artificial intelligence (AI) developers are rhetorically flirting with the idea that AI systems might have interests or moral rights. While there has been a large volume of research on whether AI deserves rights, there has been less exploration of what AI rights would mean in practice. This paper explores the institutional dimension of AI rights: what it would take to recognize moral or legal rights for AIs, and the attendant opportunities and dangers. Unlike all other nonhuman entities to which humanity has extended rights, AI systems are in principle capable of acquiring and wielding institutional power without human aid and mediation. AIs with rights would be able to legitimately, and AIs with power able to unpreventably, abridge human interests. Accordingly, giving rights even to rather dumb AI systems would entail binding the fate of humanity to potentially unpredictable nonhumans. Accordingly, I defend the rather grandiose claim that to empower AI to claim or to exercise inherent rights would be a world-historical gamble with human self-determination, which no individual researcher, firm, state, or even international organization has the moral right to authorize.

Generativism: Toward a Learning Theory for the Age of Generative Artificial Intelligence

Shan Li, Juan Zheng — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12441v1 Announce Type: new Abstract: The four dominant learning theories of behaviorism, cognitivism, constructivism, and connectivism show significant conceptual limitations as generative artificial intelligence (AI) proliferates in educational settings. These frameworks were formulated before the emergence of AI systems capable of generating, synthesizing, and reasoning about knowledge. This article critically examines each learning theory and identifies assumptions challenged by generative AI's affordances. Drawing on research in distributed cognition, extended mind, human-AI collaboration, AI literacy, cognitive offloading, and metacognition, the article proposes Generativism as a learning theory for the generative AI age. Generativism posits that learning increasingly occurs through the iterative co-construction of knowledge between human learners and AI systems. The proposed framework is organized around four principles: epistemic partnership, distributed agency, generative literacy, and adaptive metacognition. The framework offers a foundation for rethinking instructional design, learning, assessment, and expertise development in contexts where generative AI plays an integral role in cognition.

Reframing AI Loss of Control: What It Is, How to Have It, How to Lose It

Ze Shen Chin, Maurice Chiodo, Dennis M\"uller, Coleman Snell — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12442v1 Announce Type: new Abstract: At present, loss of control risks have gained much prominence in public discussion, particularly in relation to AI, with extensive discourse present among academics, frontier labs, and even governments. However, in the existing literature, the concept seems to rest on surprisingly weak foundations, where even those that discuss loss of control extensively do not first establish what control is and what exactly is being lost. Our paper aims to address these gaps. We establish a working definition of control by anchoring it to the "setting and getting of goals". Then, we discuss various aspects of control, built on foundational concepts from related fields like cybernetics, management control, and control theory. This includes who (or what) can be in control, and the things they require to be in control, such as the ability to set goals, having a functional control loop, having requisite variety, and having sufficient goal alignment. Once a framework for control is established, we then discuss how control can be lost, how AIs can contribute to such loss of control, and offer relevant recommendations for how one can maintain control. One interesting consequence of our work is that humanity, as individuals and as groups, can lose varying degrees of control as a result of AI behaviour that is far below the level of superintelligence; the potential for loss of control scenarios (as we define them) already exist, and have existed for a long time.

Occupational Prompting Reveals Cultural Bias in Large Language Models

Maksim E. Eren, Andrea Brennen, Ryan C. Barron, Eric Michalak — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12443v1 Announce Type: new Abstract: Social roles shape expectations, priorities, and judgments, yet it remains unclear how large language models (LLMs) associate occupational identities with broader cultural value patterns. Prior work used nationality-based cultural prompting to study how LLM responses to value-survey questions align with human cultural benchmarks. In this paper, we extend that framework by replacing cultural prompting with occupational prompting to examine how professional-role cues influence value-survey responses in open-weight LLMs. Using a survey-grounded evaluation pipeline based on questions from the Integrated Values Surveys, we project model responses into the two-dimensional Inglehart--Welzel cultural space. We prompt open-weight LLMs to answer questions under occupational identities such as accountant, teacher, engineer, and nurse, and then analyze how these occupation-conditioned responses are positioned on the cultural map. Our results show that when open-weight LLMs are prompted with occupations rather than national identities, their responses remain within a broadly Western-leaning region of the cultural map. However, different occupations introduce shifts within this region, producing distinct occupational skews. This indicates that occupational prompts are not treated as neutral role labels, but instead elicit structured value patterns. These findings extend survey-based evaluation of cultural bias beyond nationality-based prompting and provide a framework for studying how occupational personas shape value expression in LLMs.

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Ashutosh Hathidara, Sai Shruthi Sistla, Sebastian Schreiber, Sahil Bansal — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12451v1 Announce Type: new Abstract: Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce \textbf{ToolSense}, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.

Auto formalisation of Chaitin and of the surprise incompleteness Theorem

Thierry Coquand — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12462v1 Announce Type: new Abstract: This is a continuation of a previous report on an experiment in autoformalisation of G\"odel's second incompleteness theorem in Agda using Claude. Using the framework built in this experiment, Claude could ``automformalise'' Chaitin's proof of the first incompleteness theorem and then the Kritchman-Raz surprise examination paradox version of the second incompleteness. As the first experiment, the project provides a case study of the strengths and limitations of current large language models in mathematics. Since Chaitin's proof involves coding programs, Claude had to represent code as ternary string and could build autonomously a parser and a continuation stack evaluation machine. The fact that we can simulate computations as expected is not completely trivial and we suggested a Gandy/Howard majorisation argument, that Claude had no problem to follow. The resulting formalisation clarifies a number of details left implicit in the original presentation and provides a fully machine-checked proof of these arguments for Church's Basic Recursive Arithmetic.

Influence Factors on RAG Poisoning

Pedro Pereira, Eva Maia, Isabel Pra\c{c}a, Adrien B\'ecue — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12469v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) systems enhance large language models by grounding responses in retrieved documents from external knowledge sources at inference time. However, this reliance on retrieved content introduces vulnerabilities to poisoning attacks, in which adversarial documents can manipulate both the retrieval process and the generated outputs. This paper investigates poisoning robustness in RAG through a full factorial experimental study covering 432 configurations. We analyze the impacts of dataset, retriever type, retrieval depth, database composition, chunking strategy, and generator model on retrieval-level and generation-level metrics. The results show that retriever architecture, dataset, and retrieval depth are the strongest factors affecting poisoning exposure, while generator choice and database composition have a major impact on downstream attack success. Dense and graph-based retrievers generally improve robustness relative to BM25, whereas larger retrieval depth increases the likelihood of retrieving poisoned passages. We further show that replicating poisoned content across multiple databases amplifies adversarial influence, while additional clean sources can mitigate it. These findings highlight that poisoning vulnerability in RAG is not attributable to a single component, but instead arises from the interaction of retrieval, generation, and knowledge-base configuration.

Stereo Vision-Based Fall Prediction and Detection using Human Pose Estimation on the AMD Kria K26 SOM

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12473v1 Announce Type: new Abstract: Background and Objective: Falls among elderly people can cause serious injury and reduce quality of life. Timely prediction and detection are essential to prevent harm and support well-being. We propose a portable, low-power, battery-operated, vision-based fall prediction and detection system using HPE on an AMD Kria K26 System-on-Module (SOM). The objective is a non-intrusive, privacy-preserving system for real-time fall detection. Methods: The system uses an Intel RealSense D455 range-sensing camera connected to the K26 SOM by USB. It captures synchronized RGB and depth frames, 640 x 480 x 3 and 640 x 480 pixels, at 60 FPS. The SOM runs a three-stage pipeline with quantized YOLOX, Anchor-to-Joint (A2J), and fall-detection models. YOLOX identifies human bounding boxes from RGB frames, then discards the RGB frames to preserve privacy. A2J uses depth frames to estimate 15 joint keypoints per person. A CNN uses selected joint coordinates (x, y, z) to classify fall activity. YOLOX was trained on CrowdHuman; A2J on ITOP, MP-3DHP, UR Fall Detection, and a custom SDSU PSG dataset; and the CNN on UR Fall Detection and SDSU PSG. The design used a single-core DPU with a serial pipeline and a dual-core DPU running YOLOX and A2J with multiple threads. Results: Quantized accuracy was evaluated using IoU >= 50% for YOLOX, mAP with a 10-cm rule for A2J, and classification accuracy, (TP + TN)/(TP + TN + FP + FN), for the CNN. Accuracies were 74%, 84.13%, and 75.85%. Throughput improved from 2.5 FPS for the single-threaded pipeline to 4.5 FPS for the multi-threaded version. Conclusion: Results demonstrate the feasibility of privacy-preserving fall detection on an AMD Kria K26 edge device. On-device HPE and fall classification runs without cloud dependency, supporting elderly monitoring and assistive healthcare. Future work will improve model accuracy and speed.

SAIGuard: Communication-State Simulation for Proactive Defense of LLM Multi-Agent Systems

Ruxue Shi, Yili Wang, Mengnan Du, Qinggang Zhang, Rui Miao, Yixin Liu, Xin Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12474v1 Announce Type: new Abstract: LLM-based multi-agent systems (MAS) solve complex tasks through inter-agent collaboration, but their communication-driven nature also allows security risks to spread across agents and trigger system-wide failures. Existing MAS defenses mainly follow a reactive paradigm after execution by detecting and isolating harmful agents, which may cause irreversible damage and degrade collaborative utility. To address this, we propose a proactive defense framework for MAS security, namely a Simulation-aware Interception Guard (SAIGuard). SAIGuard performs communication-state simulation over the MAS interaction graph, estimates the impact of incoming messages on local agent states and the global MAS state, and detects risky messages via reconstruction deviations from benign communication patterns. Instead of isolating agents, SAIGuard sanitizes or regenerates suspicious messages before it propagation into system. Experiments across diverse topologies and attack scenarios show that SAIGuard reduces attack success rates while maintaining MAS utility, outperforming reactive defenses.

Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration

Leo Xu, Letian Li, Alex Cuellar, Michael Hagenow — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12475v1 Announce Type: new Abstract: Human-robot collaboration (HRC) combines the complementary strengths of humans and robots to improve task efficiency. However, many existing collaborative systems rely on hand-engineered pipelines, limiting their scalability and flexibility for new tasks. In this work, we show that models trained end-to-end with imitation learning, specifically vision-language-action (VLA) models, can support collaborative manipulation, and characterize the key factors affecting their real-world performance. We evaluate two state-of-the-art models and identify a failure mode of action-chunking policies in implicit HRC, where demonstration action leakage (i.e., action chunks crossing latent task transitions) can cause premature assistive behavior. We find that this issue increases with longer execution horizons and occurs in real-world collaborative VLA systems, such as when a robot attempts to hand over a tool before the person is ready. We propose an inference-time steering method to mitigate these erroneous assistive actions while preserving policy performance. Finally, through a 16-participant user study on a long-horizon collaborative assembly task, we show that steering enables a longer execution horizon while mitigating premature assistance, leading to faster collaboration and fewer failures compared to a shorter-horizon policy.

Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

Igor Itkin — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12476v1 Announce Type: new Abstract: Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment; at a matched false-alarm rate it detects in 11-13 tokens, against 31 for a linear per-token baseline, and a controlled decomposition attributes most of this advantage to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable

Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention

Gilhan Kim, Daniel K. Park — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12478v1 Announce Type: new Abstract: Attention mechanisms are central to modern sequence models, yet standard attention computes relevance primarily through individual query--key similarities. Although softmax normalization introduces competition among positions, a standard attention layer does not explicitly parameterize learnable interactions between attention decisions. This limits its ability to directly model cooperative or antagonistic co-attention structure within the attention mechanism itself. We propose Boltzmann attention, an energy-based generalization in which attention patterns are governed by an interacting Ising model. The method augments the usual data-dependent local fields with learnable pairwise couplings, allowing the model to represent inter-position correlations beyond those captured by softmax or sigmoid attention. Experiments on character-level language modeling and synthetic bracket matching show that Boltzmann attention consistently improves over standard softmax attention within a standard Transformer architecture, with the advantage becoming more pronounced as sequence length increases. A four-way ablation confirms that the improvement arises from the learnable pairwise couplings. These results suggest that explicit inter-position interactions provide a principled enhancement for attention-based sequence modeling. Moreover, the Ising formulation opens a natural path toward quantum-computing-based sampling strategies: we demonstrate that diabatic quantum annealing provides a practical training method while maintaining competitive performance with exact Boltzmann computation.

ReCal: Reward Calibration for RL-based LLM Routing

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12479v1 Announce Type: new Abstract: Large language model (LLM) routing has emerged as an effective paradigm for leveraging the complementary strengths of multiple LLMs through dynamic model and reasoning-strategy selection. Recent reinforcement learning (RL)-based routing methods further improve routing quality by optimizing routing policies from interaction feedback. However, they still struggle to provide informative and comparable learning signals under heterogeneous tasks with varying difficulty. In practice, multiple objectives (e.g., correctness, format behavior) are aggregated into a single scalar reward, leading to ambiguous credit assignment and conflicting optimization signals. Moreover, reward signals exhibit significant variability across instances, where some instances produce higher or more variable rewards, introducing optimization bias that favors trivial samples over informative ones. To address these issues, we propose \textbf{ReCal}, a \textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration framework for RL-based LLM routing. We first introduce a hierarchical reward decomposition mechanism with component-wise advantage estimation. We further propose a distribution-aware optimization strategy that calibrates optimization variability through variance-aware reweighting and per-dataset normalization. Experiments on seven datasets demonstrate that ReCal consistently improves routing performance, and training stability over baselines. Code is available at https://anonymous.4open.science/r/ReCal.

Representing Time Series as Structured Programs for LLM Reasoning

Jaeho Kim, Changhun Oh, Seokhyun Lee, Irina Rish, Changhee Lee — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12481v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated strong reasoning and instruction-following capabilities, making them potentially powerful tools for time-series analysis. However, time series lie outside their native textual modality, raising a fundamental question: how should time series be represented so that LLMs can reason about them effectively? Existing work typically serializes raw numerical sequences or fine-tunes pre-trained LLMs on time-series data. These approaches place the burden of extracting temporal structure directly on the LLM, creating a modality mismatch that often degrades performance on long sequences and introduces substantial computational overhead. In this work, we introduce Time-Series-to-Structured-Program representation (T2SP), a deterministic, training-free method that represents a time series as a structured symbolic program. T2SP decomposes time series into trends, periods, and salient events, expressing them in a program-friendly format aligned with the textual and code-like modalities on which LLMs are natively trained. By shifting temporal-structure extraction from the model to the representation itself, T2SP enables off-the-shelf LLMs to leverage their existing reasoning capabilities for time-series understanding. We evaluate T2SP on three reasoning tasks -- editing, captioning, and question answering -- where it consistently improves performance, reduces reasoning time, and lowers failure rates compared with raw-string representations. Our results demonstrate that T2SP provides an effective interface between time series and LLMs.

Scalable anomaly detection via a univariate Christoffel function

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12483v1 Announce Type: new Abstract: Anomaly detection plays a critical role in identifying unusual patterns across domains such as fraud detection, network intrusion, and system fault diagnosis. Recently, Christoffel function-based methods, rooted in polynomial optimization, have emerged as promising alternatives to deep learning due to their strong mathematical foundations and computational frugality. However, their practical applicability is hindered by the need to invert a matrix whose size grows exponentially with the data dimension, rendering the method intractable even for moderate-dimensional datasets. This paper addresses the dimensionality limitations of Christoffel function-based anomaly detection while preserving its key theoretical properties, i.e., the on-off support dichotomy behavior and the accurate support shape capture. We introduce UCF, a univariate Christoffel function which is based on the squared distance between the query point and the support points. Extensive experiments on the ADBench benchmark demonstrate that UCF consistently outperforms 14 state-of-the-art baselines in terms of Average Precision. By resolving the scalability bottleneck of the Christoffel Function, this work expands the toolkit of anomaly detection methods with a robust, theoretically grounded, and universally applicable approach.

Speculative Rollback Correction for Quality-Diverse Web Agent Imitation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12485v1 Announce Type: new Abstract: Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, determining the optimal timing for expert intervention presents a critical challenge in this context. Delayed intervention often leads to the accumulation of early-stage errors, pushing the page state into an irrecoverable regime. Conversely, premature or excessive intervention causes the agent to become overly reliant on expert policies, trapping the model in local optima characterized by a single, rigid trajectory. We propose Speculative Rollback Correction (SRC), a branch-level imitation framework for resettable agent environments. Instead of requesting teacher labels at every visited state or correcting only after a completed trajectory, SRC uses fixed-horizon branch review: the student executes a short speculative segment before teacher review, and the teacher localizes the first harmful deviation only when local progress breaks. Rollback preserves useful prefixes, while successful rollouts are filtered by a hard verifier and retained in a lightweight quality-diversity archive. The resulting data supports next-action supervised fine-tuning on both localized corrections and verifier-passing trajectories. On WebArena-Infinity, SRC collects 977 verifier-passing trajectories and 9,183 next-action examples; fixed-horizon review improves the recovery-versus-query tradeoff over step-level review while retaining verifier-passing solution variants. Code is available at https://github.com/LongkunHao/SRC_gui_agent.

An Empirical Study on Predictive Maintenance for Component X in Heavy-Duty Scania Trucks

Valeriu Dimidov, Sasan Jafarnejad, Rapha\"el Frank — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12486v1 Announce Type: new Abstract: Condition-based Predictive Maintenance (PdM) for truck fleets has gained momentum in recent years. This maintenance strategy aims to minimize unplanned downtimes and reduce costs by monitoring the health status of vehicles and taking proactive action based on their condition. However, the implementation of condition-based PdM systems is challenging due to the large volume of data generated by the trucks, the inherent complexity of detecting failures through sensor data and the difficulties in finding cost-effective trade-offs in the solution's implementation. In this paper, we define and validate a condition-based PdM methodology built on the assumption that the wear-and-tear state of the monitored component can be represented as a monotonically non-decreasing time series. It involves selecting only the most recent observations from the time series and transforming them into a tabular format for classification using machine learning (ML) models designed for tabular data. Our results indicate that the proposed methodology reduces costs on the Scania Component X dataset compared to current state-of-the-art (SOTA) approaches, while also simplifying the modeling process through AutoML.

DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics

Zimo Zhao, Maolin Wang, Bowen Yu, Bowen Liu, Xiao Han, Xiangyu Zhao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12487v1 Announce Type: new Abstract: Post-training quantization (PTQ) is essential for efficient large language model inference, but reliably quantizing activations remains challenging when weights, activations, and KV caches are all quantized to 4-bit precision. A key difficulty lies in massive activations, whose extreme values dominate the activation range and amplify quantization errors. State-of-the-art methods mainly mitigate massive activations through transformation-based smoothing, such as orthogonal rotations and affine scaling, but overlook the cross-layer dynamics of the residual stream. In this paper, we show that massive activations emerge and disappear in a phase-wise pattern across network depth, triggering large residual changes. These changes cause newly injected layer-wise updates to dominate the 4-bit quantization scale and weaken historical residual information. To characterize this behavior, we introduce Jump Ratio and Historical Feature SNR. This suggests that static transformation-based smoothing cannot fully resolve dynamic quantization instability caused by cross-layer residual changes. Based on this analysis, we propose DynamicPTQ, a Dynamic Post-Training Quantization policy for phase-aware mixed-precision activation quantization. DynamicPTQ identifies quantization-sensitive layers from residual-stream dynamics and assigns 8-bit activation precision only to these layers, while keeping weights, KV caches, and other activations in 4-bit precision. It can be directly integrated with strong PTQ baselines such as QuaRot, SpinQuant, and FlatQuant. Experiments on LLaMA-2 and LLaMA-3 show that DynamicPTQ consistently improves perplexity and zero-shot QA performance under W4A4KV4 quantization, while achieving 1.05 to 1.07 times throughput improvement with modest memory overhead. These results demonstrate a practical path toward robust low-bit LLM inference.

A Stationary (and Therefore Compatible) Representation is All You Need

Niccol\`o Biondi, Federico Pernici, Simone Ricci, Alberto Del Bimbo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12488v1 Announce Type: new Abstract: Learning compatible representations aims to learn feature representations that can be used interchangeably over time whenever a model undergoes updates. In this paper, we demonstrate that stationary representations learned by d-Simplex fixed classifiers imply compatibility as in its formal definition. This result establishes a foundation for future works and can be directly exploited in practical learning scenarios. We address the challenge of learning compatibility using $d$-Simplex fixed classifiers when the model is sequentially fine-tuned. Learning according to a d-Simplex fixed classifier with the cross-entropy loss aligns feature distributions at the first-order statistics. Consequently, it may not fully capture higher-order dependencies in the representation between model updates. To address this issue, we demonstrate that training the model using a $d$-Simplex fixed classifier through a convex combination of the cross-entropy loss and a contrastive loss not only captures higher-order dependencies, but is also equivalent to learning with the cross-entropy under the compatibility constraints. We confirm our findings with extensive experiments also considering a new scenario where a pre-trained model is sequentially fine-tuned and occasionally replaced with an improved model. We show that stationary representations enable uninterrupted retrieval services (without reprocessing gallery images) while improving performance during model updates and replacements, achieving state-of-the-art. Code at https://github.com/miccunifi/iamcl2r.

Masked Neural Detection for Constrained Channel Coding in Molecular Communication

Melih \c{S}ahin, Ozgur B. Akan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12489v1 Announce Type: new Abstract: Molecular communication (MC) suffers from severe diffusion memory because molecules released for one symbol may arrive during later symbols. Neural sequence detectors, especially sliding bidirectional recurrent neural networks (SBRNNs), can substantially outperform threshold detectors in such channels. This raises a central question for MC channel coding: does a code whose advantage was established under threshold detection retain it when both coded and uncoded transmission are evaluated with neural detection? This letter answers this question for run-length-limited ISI-mitigation (RLIM) codes, a class of constrained codes previously shown to provide large BER gains in MC. Across the tested operating points, the best RLIM-SBRNN receiver beats the best uncoded receiver, chosen between threshold and SBRNN detection, in $46$ of $59$ cases, with a mean gain of $10.36\times$ over those wins. We also propose an RLIM-tailored training mask for compact SBRNN detectors, improving the unmasked RLIM-SBRNN in $227$ of $236$ comparisons with $3.267\times$ mean gain when masking is beneficial. Finally, the compact masked RLIM-SBRNN is competitive with channel-state-aware MLSE despite using no channel knowledge.

Robustness Verification of Recurrent Neural Networks with Abstraction Refinement

Li-Jen Lin, Chih-Duo Hong — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12490v1 Announce Type: new Abstract: Certified local robustness verification for recurrent neural networks (RNNs) is challenging because approximation errors introduced by nonlinear relaxations can propagate through recurrent connections and accumulate over time. As a result, scalable linear bound propagation methods often become overly conservative and fail to certify inputs that are in fact robust, especially when many pre-activation intervals cross zero. We propose an abstraction-refinement framework for RNN verification that partitions such intervals to remove the dominant relaxation error: on each refined branch, ReLU becomes exact, and smooth activations such as tanh and sigmoid admit substantially tighter linear envelopes. To control the combinatorial cost of splitting in long sequences, we introduce a SHAP-guided timestep selection strategy that ranks hidden states by their contribution to the verification objective and refines only the most critical timesteps in temporal order. Experiments on CIFAR10 and MNIST stroke benchmarks demonstrate consistent improvements in verification success and robustness-margin tightness over abstraction-only baselines, while exposing clear runtime trade-offs between ReLU and tanh models.

Net-Ev$^2$: A Generative Simulator for Network Event Evolution

Guangyu Wang, Zhaonan Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12494v1 Announce Type: new Abstract: Reducing real-world trial and error has long been a central goal of decision making, and generative simulators advance this goal by modeling the evolution of future states. An even more challenging yet meaningful task is simulating how disturbance events (e.g., accidents) propagate their impacts across real-world networks. The existing approaches fall short of modeling both structured attributes and unstructured semantics of events, and capturing topological structures in simulating network event evolution. Therefore, we are motivated to propose Net-Ev$^2$ ($\underline{\textbf{Net}}$work $\underline{\textbf{Ev}}$ent $\underline{\textbf{Ev}}$olution), a novel generative simulator that jointly leverages event cues while preserving network topology in simulations. Specifically, the framework consists of two stages, namely structure-guided masked pre-training and topology-aware diffusion process, which is achieved by U-Net-like graph downsampling and upsampling during denoising. At inference time, Net-Ev$^2$ can generate simulations using natural-language event input only, with greater flexibility for practical usage. Furthermore, we introduce Net-Ev$^2$-6.5M, a multimodal benchmark of aligned event and network traffic data across four large-scale road networks, as well as a new topology-aware metric, namely JL-MMD, to evaluate topological fidelity in generated network dynamics. Extensive experiments demonstrate the state-of-the-art performance and strong generalization ability of Net-Ev$^2$. Code is made available at https://github.com/Guangyu4/Net-Ev-2.

Missing-Token Prompted Reliability-Aware Fusion for Robust Polyglot Speaker Identification

Peng Jia, Li Dai, Jia Li, Zhenzhen Hu, Ye Zhao, Richang Hong — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12495v1 Announce Type: new Abstract: Accurate and robust multimodal speaker identification is essential for multimedia understanding and biometric authentication. However, real-world polyglot scenarios pose two key challenges: speaker-discriminative representations should generalize across languages, and the model should remain reliable when face information is unavailable. To address these challenges, we propose MRAF, a Missing-Token Prompted Reliability-Aware Fusion framework for polyglot speaker identification across complete-modality, missing-face, and cross-lingual scenarios. MRAF represents unavailable face inputs with a learnable missing token instead of fixed zero-valued features, providing a trainable representation of the missing visual state. This design reduces the distribution gap caused by missing inputs and allows subsequent reliability estimation and cross-modal fusion to operate within a unified token space. To adaptively integrate modalities with different reliability, MRAF further introduces a reliability-aware cross-attention fusion module, which estimates face and audio reliability scores, normalizes them into modality weights, and applies these weights to token representations before bidirectional cross-attention. In this way, the model can emphasize reliable modality cues while suppressing unreliable ones. During training, MRAF jointly optimizes multi-branch classification losses, audio-only knowledge distillation, and center loss to improve speaker discrimination and missing-modality robustness. Experiments on the official POLY-SIM 2026 test set demonstrate the effectiveness of the proposed framework. In the final evaluation, MRAF achieves 100% accuracy on P3 and P5, and obtains competitive results on the more challenging missing-face settings P4 and P6. The source code will be released at https://github.com/MSA-LMC/MRAF.

$\mu$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12497v1 Announce Type: new Abstract: Vision-language-action (VLA) models predict chunks of future actions from the current observation, an assumption that fails under partial observability, where decisions depend on information no longer visible. Existing memory-augmented VLAs simultaneously introduce recurrence, retrieval, compression modules, auxiliary objectives, hierarchical memory, or task-specific architectural changes, so the contribution of recurrence itself remains entangled with surrounding machinery. We present a controlled isolation study of recurrence in a strong pretrained VLA backbone. Our formulation augments the transformer with a small set of learnable memory tokens carried across timesteps and updated through self-attention, trained end to end with truncated backpropagation through time, with no auxiliary losses and no architectural changes. We instantiate this as $\mu$VLA, a family of OpenVLA-OFT variants parameterized by memory width m, TBPTT length K, and the memory update rule (cross-step gradients or a detached EMA), so that recurrence is the only varying factor. On MIKASA-Robo, $\mu$VLA improves average success rate on five training tasks from 0.42 to 0.84 at the strongest setting and reaches 0.23 on held-out tasks with the same memory structure versus 0.07 for the memoryless baseline. On tasks requiring different memory structure, performance remains near baseline. On LIBERO, the strongest recurrent variant achieves 96.2% average success, indicating no regression under full observability. We interpret these results as a calibration of the capability envelope of minimal in-backbone recurrence, identifying the regime in which it is sufficient and the regime where additional memory structure is required. Demos and videos can be found in https://avanturist322.github.io/mu-vla/.

From Parameters to Feature Space: Task Arithmetic for Backdoor Mitigation in Model Merging

Zhenqian Zhu, Yamin Hu, Yiya Diao, Weixiang Li, Haodong Li, Wenjian Luo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12498v1 Announce Type: new Abstract: Model merging (MM) has gained significant attention as a cost-effective approach to integrate multiple task-specific models into a unified model. However, recent work reveals that MM is highly susceptible to backdoor attacks. Existing defenses based on task arithmetic often fail to eliminate backdoors without substantially degrading clean-task performance, owing to their reliance on direct parameter-space editing. To address this gap, we propose Linear Feature Path Minimization (LFPM), a backdoor mitigation framework for model merging, which introduces an anti-backdoor task vector into the backdoored merged model. Unlike prior approaches, LFPM formulates the backdoor robustness of the merged model from a unified feature-space perspective under the Cross-Task Linearity (CTL) framework, which leverages the approximate linearity of features across tasks. This perspective guides the optimization of the anti-backdoor task to suppress backdoors while preserving clean-task performance. Furthermore, we introduce an effective optimization mechanism based on gradient accumulation and loss path-integral, ensuring robust backdoor suppression along the interpolation path. Extensive experiments demonstrate that LFPM consistently exhibits strong robustness against backdoor attacks in both full fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) settings.

Action-Effect Memory Pretraining for Robot Manipulation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12499v1 Announce Type: new Abstract: We present AEM, an Action-Effect Memory pretraining framework for robot manipulation that learns compact temporal representations from vision-action history. Unlike prior robot representation pretraining methods that mainly focus on single-frame visual encoding, AEM targets the temporal nature of manipulation, where the current observation alone is often insufficient under partial observability. AEM models manipulation as an action-driven interaction process by interleaving visual and action features and applying masked modeling to recover missing content from incomplete histories, thereby learning action-conditioned state evolution. The Mamba-encoded output of the final vision token is used as a compact history representation, serving as the global context for decoding and downstream control. This design preserves a single-vector temporal bottleneck while keeping inference efficient. We evaluate AEM with Diffusion Policy and Flow Policy. AEM consistently improves manipulation performance in both simulation and real-world settings, outperforming baselines across clean scenes, cluttered and random scenes, and non-Markovian tasks. Ablation studies further show that history-aware pretraining surpasses single-frame pretraining and direct frame stacking, while reducing inference latency and computational cost.

Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation

Xian Liu, Carlo G. Prato, Gustav Markkula — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12500v1 Announce Type: new Abstract: Traffic microsimulation combined with surrogate safety measures has increasingly been used as a proactive alternative to historical crash data for predicting crash frequency for current or planned road infrastructure designs. However, existing microsimulation-based safety studies have adopted simplified rule-based behaviour models, which reproduce traffic flow reasonably well but often fail to generate realistic conflict dynamics, limiting crash prediction accuracy. Recent advances in machine learning (ML)-based behaviour models offer a promising opportunity to potentially improve microsimulation realism and crash frequency predictions by learning human driving behaviour directly from large-scale trajectory datasets. To investigate this possibility, traffic microsimulation was conducted for five real-world signalised intersections in Leeds, UK, using both a standard rule-based model and a state-of-the-art ML model. Simulated vehicle trajectories were analysed using a two-dimensional Time-to-Collision metric to identify simulated conflicts, which were then modelled using Extreme Value Theory to predict crash frequency. Results show that conflicts from the ML model yielded crash predictions in line with the real-world crash data, whereas the rule-based model did not permit meaningful predictions, presumably due to a lack of model calibration to the specific simulated intersections. Directly using ML-generated simulated crashes to predict real-world crash frequency also yielded poor results, suggesting that while current ML models can realistically reproduce conflicts, they are not yet able to generate realistic crashes. Overall, the findings demonstrate that ML-based behaviour models are promising for improving crash prediction from simulated conflicts, without a need for location-specific model calibration, and suggest clear future directions for ML-based traffic microsimulation.

Policy-driven Conformal Prediction for Trustworthy QoT Estimation

Kiarash Rezaei, Omran Ayoub, Paolo Monti, Carlos Natalino — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12501v1 Announce Type: new Abstract: We propose Conformal QoT, a policy-driven framework that combines statistically guaranteed QoT estimation with operational decision policies, enabling reliable lightpath-feasibility predictions under domain shift and improving accuracy from 92\% to 99.6\% on open datasets.

Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12503v1 Announce Type: new Abstract: Self-supervised learning (SSL) has opened new opportunities in bioacoustics by enabling scalable modeling of animal vocalizations without the need for expensive manual annotation. However, current SSL models in this domain prioritize broad generalization across species and are not optimized for uncovering the fine-grained structure of individual communication systems. In this work, we collect and release a novel dataset of over five years of longitudinal recordings, from five known dolphins in a semi-naturalistic marine environment, an unprecedented resource for studying dolphin communication. We adapt the Wav2Vec2.0 Baevski et al. (2020) architecture to this domain and introduce Dolph2Vec, the first large-scale, species-specific SSL model trained exclusively on this data. We benchmark our model on two biologically relevant tasks: signature whistle classification and whistle detection. Dolph2Vec significantly outperforms general-purpose baselines in both tasks. Beyond performance, we show that learned embeddings and codebook structure capture interpretable acoustic units aligned with dolphin whistle categories and possibly sub-whistle structure, enabling fine-grained analysis of communication patterns. Our findings demonstrate how SSL can serve as both a model and a scientific tool to explore hypotheses in animal communication research.

A Type Theory of Sense: Witnessed Choice in Stratified Semantic Spaces

Iman Poernomo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12504v1 Announce Type: new Abstract: We introduce TTS, a dependent type theory in which semantic composition is represented by horn filling and distinctions between possible completions are witnessed relative to explicit measurement regimes. TTS replaces globally canonical composition with regime-indexed indiscernibility and constructive apartness, allowing filler spaces to be classified as canonical when all completions are observationally connected and forked when two warranted completions are positively separated. Separation witnesses enter the calculus only through measurement contexts recording actual instrument outputs, yielding conservativity, provenance, and a no-fork-from-the-empty-record result. We prove that forks persist under refinement while canonicity may fail, and characterize exactly when an identification made by one regime can consistently coexist with a separation made by another. This framework supports a geometric account of Fregean sense as a choice of filler, reference as the boundary constraining that choice, and hyperintensional difference as measured apartness, while providing a falsifiable bridge to stratified representation spaces and branching behaviour in language-model generation.

Boosting Direct Preference Optimization with Penalization

Pengwei Sun — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12505v1 Announce Type: new Abstract: Offline preference optimization has become a practical substitute for reinforcement learning from human feedback, but pairwise objectives such as Direct Preference Optimization (DPO) and its variants use only the chosen and rejected responses stored in a static dataset. This leaves a useful signal unused: the response that the reference model itself would generate for the same prompt. We propose Direct Preference Optimization with Penalization (DPOP), a simple extension of DPO that augments the base preference loss with a gated penalty on reference-greedy responses. DPOP activates this penalty only when the current policy still assigns a lower likelihood to the preferred response than to the rejected response. On AlpacaEval 2.0, DPOP improves length-controlled win rate over DPO, SimPO, and AlphaDPO on both Llama-3-8b-it and Gemma-2-9b-it, achieving relative gains of 5.3\% and 4.4\% over baselines on the two models, respectively. Ablations further show that a SimNPO-style length-normalized penalty is stronger than NPO and token-level unlikelihood in this setting.

Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12507v1 Announce Type: new Abstract: Rubrics have emerged as an alternative to RLVR in open-ended domains where a single ground-truth final answer is not available. Existing rubric-based training methods rely on an LLM verifier that scores each rollout against rubrics. This introduces substantial training-time overhead, exposes optimization to verifier-specific biases, and reduces rubric feedback to a sparse end-of-trajectory signal. We propose Rubric-Guided Self-Distillation (RGSD), a verifier-free training method in which the base policy, conditioned on the rubric, serves as the teacher for the unconditioned student. RGSD distills the rubric-conditioned teacher distribution into the student token-by-token, replacing sparse trajectory-level rewards with dense per-token learning signals and removing the LLM judge from the training loop entirely. Across Qwen-2.5 (3B, 7B) and Qwen3-Thinking (4B, 8B) models on medical and science domains, RGSD achieves rubric satisfaction comparable to judge-based GRPO while using one on-policy rollout per prompt and no training-time verifier calls. Ablations show that raw rubrics provide a stronger teacher enrichment signal than self-generated reference responses, while a stronger GRPO judge can outperform RGSD in some settings, positioning RGSD as a complementary verifier-free alternative when verifier cost or reliability is the bottleneck.

Foresight: Iterative Reasoning About Clues that Matter for Navigation

Arthur Zhang, Carl Qi, Donne Su, Xiangyun Meng, Amy Zhang, Joydeep Biswas — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12550v1 Announce Type: new Abstract: Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal. For instance, reaching an out-of-view destination may require interpreting ramps, signs, or detours that reveal where to go or which route to take. Prior works are limited by their reliance on known navigation factors and closed-set factor categories, or identify cues before motion planning and miss plan-dependent cues. We argue that pretrained Vision-Language Models (VLMs) can discover novel instruction-relevant cues, but require adaptation to focus on which cues matter and how they should influence motion planning. We realize these ideas in Foresight, a test-time framework in which a finetuned VLM alternates between proposing image-space motion plans and critiquing them using the language goal and visual context. Subsequent plans are conditioned on prior critiques, enabling iterative motion refinement before execution. To align plan critiques and refinements with open-set behavior preferences, we learn a reward model from human feedback and use it to post-train the VLM with reinforcement learning in the plan-critique loop. In offline evaluations and 6 real-world environments, Foresight improves average task success by 37% and reduces interventions per mission by 52% relative to state-of-the-art test-time reasoning and foundation-model baselines, while running in real-time on a Jetson AGX Orin. We will release code, data, and training details to support future work on test-time reasoning for robot motion refinement. Additional videos at: https://amrl.cs.utexas.edu/foresight

Crossing the Validation Crisis: Cross-Validation Reduces Benchmarking Variance Surprisingly Well

C\'elestin Eve, Ga\"el Varoquaux, Thomas Moreau — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12552v1 Announce Type: new Abstract: Modern machine learning progresses through empirical work, benchmarking new methods to evaluate relative performance. However, the statistical variability inherent to evaluation - exacerbated by the stochastic nature of many algorithms - often makes performance estimation unreliable due to the limited test samples available, leading to a validation crisis in which genuine advances are difficult to discern. In this work, we show that cross-validation improves markedly confidence when evaluating and comparing learning algorithm performances. We introduce the concept of sample gain, which quantifies the virtual data augmentation achieved by using multiple cross-validation splits to reduce benchmarking variance. Experiments on both synthetic and real-world datasets (histopathologic scans and NLP fine-tuning) demonstrate that multiple splits can substantially improve the reliability and stability of performance estimates, with diminishing returns often setting in later than expected. We also introduce a procedure to dynamically early-stop cross-validation by estimating from the first few folds if subsequent folds will bring large sample gains. Our findings highlight the value of pushing cross-validation on available samples to achieve robust and reliable benchmarking.

AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12555v1 Announce Type: new Abstract: Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step diffusion sampling. As such, we propose AudioX-Turbo, a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX-Turbo follows a teacher-student paradigm. The teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. To support the training of AudioX-Turbo, we construct a large-scale, high-quality dataset, IF-caps-Pro, comprising approximately 9.2M samples curated through a two-stage data collection and annotation pipeline. We benchmark AudioX-Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text-to-audio and text-to-music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi-step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction-following capabilities. The code and datasets will be available at https://zeyuet.github.io/AudioX-Turbo/.

ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12556v1 Announce Type: new Abstract: The rapid shift toward agentic and long-context workloads in Large Language Models (LLMs) is pushing the industry beyond the capacity of individual servers toward disaggregated shared storage to handle TB-scale context states. This movement has led to the emergence of specialized shared context layers designed to externalize and share cumulative inference states across distributed clusters. While offloading to a data processing unit (DPU) within just-a-bunch-of-flash (JBOF) architectures accelerates NVMe-over-fabrics (NVMe-oF) target processing, the need for sophisticated software-level optimization and cost-efficiency burdens remain significant. Consequently, the ideal architecture for scaling this shared context infrastructure is still an active area of exploration. In this paper, we propose ITME (Inference Tiered Memory Expansion), which leverages a CXL-hybrid memory to present a massive, TB-scale byte-addressable remote memory expansion. This approach enables cost-efficient scaling and simplifies the software stack through direct byte-addressability, effectively addressing the challenges of shared context infrastructure. Our key insight is that the deterministic access patterns of voluminous model weights and prefix caches enable the system to proactively manage data movement across the memory-storage hierarchy. We validate ITME by evaluating its performance potential with production-grade SK Hynix CMM and PCIe Gen5 NVMe SSDs, while further demonstrating its functional feasibility through an FPGA-based hardware prototype. Overall, ITME enhances conventional CPU-offloading by providing additional remote memory expansion to accommodate large KV cache footprints beyond host memory limits, achieving up to a 35.7\% throughput improvement.

HairPort: In-context 3D-aware Hair Import and Transfer for Images

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12562v1 Announce Type: new Abstract: Transferring hairstyles between images is an important but challenging task in computer graphics, computer vision, and visual effects. It enables users to explore new looks without physically altering their hair, with applications in virtual try-on systems, augmented reality, and entertainment. Most prior works operate best under small pose gaps, and they fall short under large viewpoint and scale differences, where missing hair content must be synthesized rather than transferred. We propose HairPort, a 3D-aware hairstyle transfer framework that attempts to solve these issues by explicitly separating hair removal from transfer and enforcing geometric consistency before synthesis. We introduce a Bald Converter, which produces realistic bald versions of faces through LoRA-based in-context adaptation of FLUX.1 Kontext. To train our Bald Converter, we introduce a new dataset, Baldy, containing 6,000 paired bald and original images across diverse identities and conditions. We also use a 3D-Aware Transfer Pipeline that reconstructs and re-renders the reference hairstyle from the target viewpoint before compositing it onto the source image. Being 3D aware, our method supports large pose and scale discrepancies between the source and target. Finally, a conditional flow-matching generator synthesizes the transferred result from the bald source and geometry-aligned reference guidance. Together, our method enables accurate, pose-consistent, and identity-preserving hairstyle transfer, outperforming existing methods both qualitatively and quantitatively.

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12563v1 Announce Type: new Abstract: Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action spaces. Prior autonomous optimization systems operate on isolated targets with stateless evaluation. Arbor instead maintains an explicit search tree of scored hypotheses that serves as the shared working memory across agents, evolving with every measurement, treating failures as diagnostic signal that reshapes subsequent exploration, and expanding as prior successes shift the bottleneck distribution. We validate Arbor on full-stack LLM inference optimization, a domain where achieving peak performance has historically required coordinated effort from engineering teams across the application, framework, compiler, kernel, and hardware stack. Arbor pairs an Orchestrator agent, which drives optimization by delegating to Domain Specialists across the inference stack, with a Critic agent that safeguards stability through root-cause analysis, introspection, and measurement validation -- a checks-and-balances architecture where neither agent can unilaterally drive the system. Agent capabilities are decomposed into hard skills (domain expertise) and soft skills (coordination protocols that determine how contributions compose), enabling fully autonomous multi-day campaigns. Arbor achieves up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines, while a single agent without the harness plateaus at +33% throughput improvement and crashes irrecoverably within hours. Arbor generalizes to multiple generations of hardware platform, and run-to-run variance is within 2 percentage points demonstrating that the method is hardware-agnostic and reproducible.

EDEN: A Large-Scale Corpus of Clinical Notes for Italian

Tiziano Labruna, Guido Bertolini, Pietro Ferrazzi, Bernardo Magnini — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12569v1 Announce Type: new Abstract: We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of Italian hospitals. The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the stay in the emergency department. In addition, a subset of about six thousand notes has been manually annotated by clinical experts through a structured Case Report Form (CRF) containing 132 items relevant for two patient situations in emergency departments, dyspnea and loss of consciousness. Items may assume numerical values (e.g., for blood saturation), categorical (e.g., for level of consciousness ), binary (e.g., for presence of traumas), and mixed value types. The annotation process involved multiple clinicians and underwent iterative revision to resolve ambiguities in item formulation, resulting in a richly structured (although high imbalanced) resource. The dataset aims to fill a relevant gap of data able to support both the development and the use of Large Language Models in concrete medical applications. We describe the data collection protocol, the on-site anonymisation pipeline, corpus statistics, and the annotation scheme. Finally, we propose CRF-filling as a novel structured information extraction benchmark, and provide zero-shot baseline resulting from Gemma-27B and MedGemma-27B. To the best of our knowledge, the EDEN dataset is the largest freely available corpus of clinical notes existing for the Italian language.

Structure-preserving discretization and fingering dynamics of a Cahn-Hilliard model for traction-driven digit morphogenesis

Marvin Fritz — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12574v1 Announce Type: new Abstract: We study a Cahn-Hilliard equation with anisotropic traction flux arising as a reduced continuum model of mechanically biased cell interactions in digit-forming organoids. For a regularized problem with strictly positive bounded mobility, we introduce a mixed finite element discretization based on an implicit-explicit treatment of the chemical potential. We prove existence of discrete solutions, establish exact mass conservation and a discrete energy inequality, and show convergence of the fully discrete approximations to a weak solution of the regularized problem. Numerical experiments illustrate the resulting dynamics and show the transition from classical coarsening to traction-induced fingering and protrusive growth. The computational study is complemented by mass and energy diagnostics, an energy-balance residual, fingering-onset and protrusion-count diagnostics, and a manufactured-solution convergence study.

High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12575v1 Announce Type: new Abstract: Few-step diffusion distillation has become increasingly mature for 4-8-step generation, yet pushing further to 2 steps remains challenging. In this work, we introduce Z-Image Turbo++, a high-quality 2-step image generation model distilled from the 8-step Z-Image Turbo teacher. Our method addresses the central bottlenecks of increased task difficulty and limited model capacity in 2-step generation through three simple but effective design choices tailored to this regime. First, we propose Distribution-Aligned Adversarial Learning, which uses teacher-generated images rather than external real images as real samples for GAN training, providing a more attainable and informative adversarial target. Second, we adopt Step-Decoupled Parameterization, assigning independent model parameters to the two denoising steps to better match their distinct capacity demands. Third, we perform End-to-End Training with Iterative Regularization, allowing the first step to receive gradients from final image quality while preserving a meaningful intermediate generation through an explicit step-1 loss. Together, these designs substantially narrow the quality gap between 2-step and 8-step generation in both qualitative and quantitative evaluations, highlighting the potential of carefully tailored distillation strategies for improving the quality-efficiency trade-off in few-step generation.

Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

Ishani Mondal, Javad Baghirov, Jordan Boyd-Graber — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12576v1 Announce Type: new Abstract: Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks. To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper. We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation

Cascading Smoothers for Multigrid

Robert I. Saye — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12577v1 Announce Type: new Abstract: Multigrid methods are among the most effective frameworks for solving large-scale sparse systems. However, achieving their hallmark linear scaling and rapid convergence crucially depends on an effective smoother algorithm, whose design is often highly problem-dependent. This paper develops a new approach, referred to as \textit{cascading smoothers} due to their operation as an ordered sequence of single-step block-diagonal smoothers. Each level in the cascade is optimised to maximally damp the output of prior steps via a Frobenius norm minimisation of the corresponding error propagators. In particular, we develop an additive (resp., multiplicative) formulation analogous to Jacobi (resp., Gauss-Seidel). Applied within a standard multigrid V-cycle, we show they are remarkably effective across a wide array of problems, including finite difference, finite element, and discontinuous Galerkin discretisations applied to Poisson, elliptic interface, and Stokes systems as well as multiphase variants. In every case, cascading smoothers closely match or significantly outperform their optimally-damped classical counterparts, yet require no parameter tuning apart from a few discrete solver choices. Additionally, the approach is highly parallelisable and robust to geometric and operator complexities such as unstructured meshes and high-contrast coefficients.

MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction

Mohammadreza Riyazat, Vian Lelo, Rameen Jafri, Yumna Khan, Abeer Badawi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12578v1 Announce Type: new Abstract: Mechanism-level drug-drug interaction (DDI) prediction requires identifying which enzyme or pharmacodynamic axis is implicated, in which direction, and with which evidence -- not merely whether two drugs interact. We introduce a reproducible mechanism-level DDI labelling and evaluation protocol with a structured 7-family/147-subtype taxonomy, leakage-safe cold-split protocols, and auditable reasoning metrics for evaluating pharmacological prediction beyond flat interaction classification. We propose a pipeline that produces a 7B reasoning MARD (Mirror-Augmented Reasoning Distillation), combining three training innovations: a single-token KL divergence on direction tag that ties the model's prediction, per-loss PRM-weighted DPO with programmatic hard negatives, and a leakage-safe mechanism-aware retrieval channel. Process-reward step labels are automatically verifiable against DrugBank-structured fields, requiring no human or LLM judges. On the April-2026 DrugBank release, our MARD-7B is the only system in a 32-system comparison whose accuracy survives drug-pair novelty, beating the best baseline by +13.9 pp and GPT-4o by +6.7 pp at ~1% of frontier API cost. Further analysis reveals an anti-memorisation signature where accuracy improves on rarely seen drugs, suggesting that gain comes from structured pharmacological reasoning rather than drug-frequency memorisation. We release corpus, DDI-PRM, retrieval index, and training code.

G-MAPP: GPU-accelerated Multi-Agent Planning and Perception for Reactive Motion Generation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12579v1 Announce Type: new Abstract: Reactive motion generation in unstructured environments remains an open challenge in robotics. Due to the computational complexity of collision-free motion generation, existing methods either generate global trajectories for static scenarios, or employ models that make conservative assumptions about the environment. This paper identifies the primary bottleneck as the runtime performance demand of planning on high-fidelity environments, and the temporal integration between the perception and planning modules. Therefore, we propose a framework that does not compromise on runtime performance and world representations for perception and planning by accelerating world modeling and vector-field based planning using the GPU. This allows us to achieve faster parallel state exploration for quasi-global trajectory planning, and tighter coupling of the perception-action loop in real-time for dynamic cluttered environments with off-the-shelf depth sensors. We quantitatively evaluate the computation-time and success rate differences for the CPU and GPU versions of our planner, and perform qualitative evaluations of our coupled framework using real-world experiments on a 7-DoF Franka Emika robot. Experimental results demonstrate that our GPU-based framework achieves up to a 5x speedup over the CPU version and successfully avoids collisions across both trivial and challenging physical world scenarios.

Graph Reduction in Multirelational Networks: A Spreading-Oriented Reduction Benchmark

Mateusz Stolarski, Micha{\l} Czuba, Piotr Bielak, Piotr Br\'odka — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12581v1 Announce Type: new Abstract: Real-world networks are inherently incomplete, noisy, and dynamically evolving, making it difficult to capture all actors and their relationships. Their scale often renders direct analysis computationally demanding. While influence maximisation (IM) has been widely studied, the role of graph reduction as a preprocessing step, and its impact on IM accuracy, remains underexplored. In this work, we introduce the Spreading-Oriented Reduction Benchmark (SORB), an open-source, standardised framework for systematically evaluating IM models across diverse task settings. SORB provides an extensible pipeline operating on a representative collection of real-world networks, including single- and multilayer structures, and accounts for graph reduction directly into the evaluation process. This design shifts the focus from analysing IM algorithms in isolation to quantifying how graph reduction alters predictive performance. Using SORB, we study the effects of sparsification and coarsening across multiple IM scenarios. Our results show that the impact of reduction is strongly dependent on both the network type (single-layer vs. multirelational) and the downstream task ($Gain@k$ vs. $\mathrm{AUC}_{\mathrm{cutoff}}$): sparsification preserves seed set quality on single-layer networks, whereas flattened multilayer networks exhibit systematic ranking degradation regardless of reduction strategy. These findings highlight the importance of reduction-aware, multi-task evaluation when studying spreading processes in complex networks.

Beyond Attack Success Rate: Examining Trigger Leakage in Vision-Language Agentic Systems

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12586v1 Announce Type: new Abstract: Vision-Language Agentic Systems (VLAS) connect visual perception to planning, tool use, and physical actions. This means backdoor-type triggers can propagate through both decision pipelines and their connected interfaces, thus making visual backdoors a system-level threat. Current evaluations on such backdoors focus on clean accuracy and attack success rate (ASR), metrics that capture whether a trigger works, but not whether an attack is actually "precise" -- i.e. whether it triggers hidden behaviors only when intended. In this work, we formalize the failure of trigger precision as "trigger leakage": inputs that are visually or semantically close to the intended trigger and therefore inadvertently activate the attacker-specified behavior. To quantify this leakage, we introduce Neighbor Leakage Rate (NLR). Our experiments show that at a 3% poisoning ratio, icon and text triggers remain robust to common visual transformations, but their neighboring variants leak heavily, with NLR reaching 0.996 (icon) and 0.944 (text). Using textual triggers as a controlled probe, we show that standard fine-tuning learns a broad activation region rather than an exact trigger condition, causing neighboring strings to invoke the malicious behavior even when the exact trigger is absent. Adding edit-distance-one hard-negative samples during training substantially narrows this activation region and reduces leakage, including in image-editing and embodied-manipulation workflows, where leaked triggers can propagate into executable programs and action sequences.

Strategic Decision Support for AI Agents

Shayan Kiyani, Sima Noorani, George Pappas, Hamed Hassani — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12587v1 Announce Type: new Abstract: Traditionally, decision support studies how humans use machine learning models to make better decisions. In modern agentic systems, this division of roles is increasingly reversed: AI agents act on behalf of users, while humans and tools becomes support mechanisms around them. This role reversal brings reliability concerns to the forefront, since agentic errors can be consequential and agent behavior must remain aligned with human goals and constraints. Departing from the classical view of decision support, we revisit its two basic principles, the cost--value tradeoff of seeking support and the role of uncertainty quantification, in a setting where AI agents are the central actors. We propose a framework for strategic decision support for AI agents through an optimization problem that minimizes support usage subject to controlling a counterfactual missed-support error: the probability that the agent acts alone on instances where support would have materially improved its output. At the population level, we show that the optimal policy is a threshold rule on the value of support. Building on this structure, we develop an online algorithm that adaptively thresholds such a score and uses randomized exploration to control missed-support error without distributional assumptions. We further introduce a calibration-on-the-fly method that reduces unnecessary support calls online. We instantiate this framework across diverse scenarios, including information gathering, human--AI collaboration, and tool use, showing how each can be modeled through the same strategic decision-support lens. Experiments across these settings show that our method reliably controls the target error while substantially reducing support usage in practice.

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12590v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

Characterizing Tests in IoT Software: Practices, Challenges and Opportunities

Rufeng Chen, Hengcheng Zhu, Wuqi Zhang, Zixu Zhou, Lili Wei — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12592v1 Announce Type: new Abstract: The Internet of Things (IoT) is experiencing rapid growth. Smart devices are emerging in smart homes and industrial applications, performing mission-critical tasks. Bugs in IoT software can lead to severe consequences. For example, a buggy smart lock can allow unauthorized access to a private property. Testing is a primary practice to expose software bugs and ensure software quality. However, little is known about how IoT software is tested. To bridge this gap, we conducted the first empirical study on test cases in open-source IoT software. Specifically, we evaluated the effectiveness of test cases in IoT software, explored the challenges inherent in testing IoT software, and analyzed the usage of mock objects. Our results indicate that while IoT software often contains a considerable number of tests, their effectiveness remains limited. We identified the primary challenges in testing IoT software as managing complex interactions with various external dependencies, such as other network-reliant IoT components, file systems, operating systems, and databases. We also observed that the use of mock objects in IoT software closely aligns with our identified testing challenges. This alignment demonstrates the potential of mocking as a solution to enhance test coverage and address the complexities of IoT software testing.

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12594v1 Announce Type: new Abstract: Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive. We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets. The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time. For training efficiency, we build a Lean-verified corpus stratified into easy, medium, and hard problems for curriculum SFT, so models acquire proof skills progressively from shorter, simpler proofs to longer, harder ones. During SFT, a dynamic proof-reasoning filtering scheme preserves informative proof traces while keeping each instance within an 8k-token context budget. We also introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements, populated via self-distillation for extra training signal without formally verifying every mutated instance. By perturbing known problems while preserving their formal character, ALF reduces reliance on any statement's surface form. Empirically, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while Pythagoras-Prover-32B sets the open-source state of the art at 93.0% on MiniF2F-Test and solves 93 of 672 PutnamBench problems. We release MiniF2F-ALF, an ALF-mutated contamination-sensitive benchmark on which every evaluated model loses accuracy; here our 32B remains strongest and our 4B matches the prior state of the art, Goedel-Prover-V2-32B.

Emerging Flexible Designs for Geospatial Multimodal Foundation Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12595v1 Announce Type: new Abstract: Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.

Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation

Zahra Habibzadeh, Paria Khoshtab, Amir Mesbah, Yadollah Yaghoobzadeh — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12599v1 Announce Type: new Abstract: Transforming a dense, abstract proverb into an engaging and morally faithful narrative requires deep cultural understanding and robust semantic grounding. We frame this problem as a \emph{constrained semantic decompression} task and study proverb-conditioned story generation as a testbed for abstraction-to-realization in large language models (LLMs). Focusing on Persian, we introduce the Proverb Aligned Narrative Dataset (PAND), pairing proverbs with human-written stories and explicit meanings. By a hybrid evaluation framework that combines human-calibrated LLM-as-a-Judge with structural metrics, we analyze model behavior across multiple prompting regimes. Our findings reveal a persistent \emph{decompression gap}: current LLMs often achieve strong surface-level fluency while failing to faithfully instantiate the underlying moral and causal structure encoded in proverbs. We further show that explicit reasoning and iterative refinement can partially mitigate these failures, suggesting that many decompression errors arise from difficulties in translating abstract meaning into narrative form rather than a complete lack of relevant knowledge. Our proposed task naturally extends to other forms of compressed cultural knowledge.

Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning

Sieu Tran, Duc Nguyen, Hao Vo, Khoa Vo, Ngan Le — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12601v1 Announce Type: new Abstract: Unsupervised video object-centric learning aims to decompose dynamic scenes into persistent, object-level representations without supervision. However, existing slot-based methods struggle to maintain stable object identity in challenging settings such as rapid motion and partial occlusion. First, they typically encode both the per-frame appearance of an object and its identity across frames in a single slot vector, creating an objective conflict that leads to slot swapping: reconstruction requires sensitivity to transient visual changes, whereas temporal consistency requires invariance to them. Second, the token renormalization used in Slot Attention can amplify weakly attending slots, allowing them to absorb tokens from other objects and destabilize slot-to-object correspondence. We propose Dual-State Slot Attention (DSSA), a fully self-supervised framework that addresses these limitations by separating appearance from identity and by reducing spurious updates from weakly matching slots. DSSA decomposes each slot into a local state for per-frame appearance and an identity state for temporally stable object information, thereby aligning reconstruction and temporal consistency with separate representations. The identity state is updated through a learned recurrent transition that acts as a temporal filter on the local state, while competition-modulated aggregation (CMA) down-weights updates from weakly matching slots and prevents them from absorbing tokens from other objects. Experiments on MOVi-C, MOVi-D, and YouTube-VIS demonstrate that DSSA consistently improves segmentation quality and temporal consistency over prior methods, while also yielding stronger downstream object recognition and video dynamics prediction. Code and models will be made publicly available upon acceptance.

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

Honglin He, Zhizheng Liu, Yukai Ma, Bolei Zhou — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12603v1 Announce Type: new Abstract: Autonomous long-horizon sidewalk navigation is essential for micro-mobility applications such as robotic food delivery and assistive electronic wheelchairs. Unlike autonomous driving on the road, long-horizon sidewalk navigation requires precise maneuvering through unpredictable sidewalk terrains and pedestrians, with a lightweight perception stack as minimal as a single monocular RGB camera. While imitation learning (IL) from demonstrations offers a practical solution, the resulting autopilot policy often suffers from compounding errors, a lack of social compliance on sidewalks, and deficiencies in counterfactual reasoning to handle complex situations. To address these challenges, we introduce FlowPilot, a mapless navigation policy that achieves robust and efficient long-horizon navigation performance using only a monocular RGB camera. We first propose to use anchored flow matching as an action representation for policy pre-training on large-scale robot fleet data and to capture the diverse, complex, multimodal distribution of sidewalk navigation behaviors. To bridge the gap between imitation and alignment, we further design a human-in-the-loop preference learning scheme to tune the policy on a small amount of human intervention data. It strengthens the model's counterfactual reasoning and social compliance on sidewalks. We evaluate FlowPilot through extensive simulation and real-world experiments in diverse sidewalk environments. FlowPilot achieves 42% success rate and 66% route completion in simulation, while FlowPilot-HP further improves real-world robustness and social compliance, reducing IR by 40.0% and NIR by 52.1% relative to the base model.

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12604v1 Announce Type: new Abstract: Dexterous manipulation is limited by the cost of collecting large-scale robot demonstrations. Egocentric human videos offer a scalable source of diverse manipulation behaviors, but directly using them for robot learning requires bridging two gaps: the visual gap between human and robot observations, and the action gap between human motion and robot-executable action. We propose EgoEngine, a scalable framework for transforming egocentric human manipulation videos into high-fidelity robot data. Given an egocentric RGB video, EgoEngine produces: (i) a high-fidelity robot observation video replacing human with robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots show that EgoEngine enables scalable conversion of human videos into robot data and, to our knowledge, demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations. Project website: https://egoengine.github.io.

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12608v1 Announce Type: new Abstract: Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks. We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment. An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57--77% overall. On multi-turn missions, all models score 13--29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4--18 points as conversations progress. These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.

Viral Proteins Reveal Geometry of Protein Language Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12609v1 Announce Type: new Abstract: Protein language models are trained on highly imbalanced datasets, raising the question of how they represent underrepresented biological sequences. Using viral proteins as a case study across ESM model families, we identify a dominant nativeness axis in embedding space, aligned with masked reconstruction perplexity, that orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. Scaling contracts this axis unevenly across viral families. Despite this, protein language model embeddings retain viral-specific signal: viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. Together, these results suggest that pLM representations are structured by a general notion of nativeness while preserving information specific to distinct biological groups.

The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter

Miquel Noguer i Alonso, David Pacheco Aznar — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12610v1 Announce Type: new Abstract: Two major periods of reduced funding and confidence in artificial intelligence research, commonly called the first and second AI winters, are usually explained through engineering failure, commercial disappointment, and inflated expectations. This article develops a complementary thesis: that the dominant paradigms of those periods also met genuine formal barriers, including limitations of representation, optimisation, computational complexity, statistical learnability, and high-dimensional approximation. The contribution is synthetic rather than archival. We do not claim that particular theorems mechanically caused the winters; rather, we show that several central disappointments of early AI were aligned with mathematically precise bottlenecks. We analyse these bottlenecks through the perceptron impossibility results of Minsky and Papert, the complexity-theoretic hardness of exact neural-network training established by Blum and Rivest, minimax rates for nonparametric estimation in high dimension due to Stone, vanishing-gradient analyses by Hochreiter and by Bengio and collaborators, and classical statistical learning theory in the tradition of Vapnik and Chervonenkis, Valiant, and Blumer and collaborators. We then relate these barriers to the later breakthroughs that mitigated, rather than eliminated, them.

Evaluation of AutoML Frameworks for IDS under Imbalanced Data Conditions of the NSL-KDD Dataset

Wiliane Carolina Silva, Evandro C\'esar Vilas Boas, Felipe A. P. de Figueiredo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12611v1 Announce Type: new Abstract: This work investigates the impact of severe class imbalance on the performance of automated machine learning (AutoML) frameworks for multiclass network intrusion detection using the NSL-KDD dataset. Unlike previous studies that simplify the problem through binary classification or minority-class removal, we preserve the original five-class distribution, including highly underrepresented attacks such as R2L and U2R, enabling a realistic evaluation of imbalance-sensitive learning behavior. Nine open-source AutoML frameworks were analyzed under a unified and reproducible experimental protocol, considering differences in architectural design, ensemble strategies, validation procedures, hyperparameter optimization, and imbalance-handling mechanisms. The results demonstrate that frameworks incorporating ensemble learning and imbalance-aware optimization achieve better minority-class discrimination. PyCaret obtained the best overall performance, reaching 66\% macro-F1, followed by AutoGluon with 55\%, whereas frameworks lacking native balancing support exhibited significant degradation in minority-class detection capability. The analysis further shows that accuracy-oriented optimization alone is insufficient for highly imbalanced IDS scenarios, since high-weighted metrics may coexist with poor generalization on rare attack categories. As a contribution, this work establishes a standardized benchmark for AutoML-based intrusion detection under severe multiclass imbalance, highlighting current architectural limitations and the need for native integration of imbalance-aware optimization, resampling, and stratified evaluation strategies into automated learning pipelines. The source code is publicly available.

DARRMS -- An Efficient Algorithm for Dynamic Attention Radius in Resource-Constrained Multi-Agent Systems

Benjamin Alcorn, Eman Hammad — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12614v1 Announce Type: new Abstract: Multi-agent systems are integral tools for various domains such as robotics, cybersecurity, and autonomous vehicle planning. These types of systems often have constraints on the computational resources, leading to a need for efficient lightweight algorithms. Traditional decision making frameworks often assume ideal conditions, such as full observability and unlimited computational capacity, which do not align with real-world challenges. In this paper, we introduce a new algorithm that allows for reduced demand on computational resources without a large cost of other performance metrics. Agents will limit their observability to some attention radius, which intentionally allows them to ignore parts of the environment that might be unnecessary for action planning. By optimizing both the attention radius and decision-making, our approach enhances coordination and scalability in uncertain environments. Through both theoretical analysis and empirical validation, we demonstrate the effectiveness of adaptive observation in improving system performance and maintaining robust decision-making strategies in resource-constrained systems.

Towards Provably Fair Machine Learning: Bayesian Approaches For Consistent and Transparent Predictions

Owen O'Neill, Fintan Costello — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12615v1 Announce Type: new Abstract: ML classifiers deployed in high-stakes domains produce predictions whose quality varies systematically across subgroups. For granular subgroups defined by intersections of multiple features, predictions are often inconsistent with the observed data: the model's outputs contradict the evidence available for that subgroup. This problem is exacerbated by regularisation, which improves aggregate performance by collapsing small subgroups into larger groups, disproportionately affecting demographic minorities. We define two requirements for consistent prediction: determinism (identical individuals receive identical predictions) and statistical consistency (we cannot reject, at significance level alpha, the hypothesis that the predictions for a subgroup were drawn from the Bayesian optimal target distribution inferred for that subgroup). From these requirements we derive the Fair Bayesian classifier, which enforces both across every group and subgroup simultaneously and abstains whenever no consistent deterministic prediction is possible. On three benchmark datasets (Adult, COMPAS, and Bank Marketing), standard classifiers produce statistically inconsistent predictions for a substantial proportion of subgroups. Our classifier achieves zero consistency error by construction while exceeding baseline accuracy and multicalibration on every dataset tested. Statistical consistency provides a principled foundation for prediction quality with direct implications for algorithmic fairness. Minority demographics are disproportionately concentrated in small subgroups, precisely where frequentist inference is least reliable; addressing this inference problem is therefore a necessary step toward fair ML. By enforcing Bayesian consistency at the finest resolution the data supports, the our classifier demonstrates that exhaustive subgroup fairness with principled abstention is achievable in practice.

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12616v1 Announce Type: new Abstract: Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

Alan Cooney, David Africa, Geoffrey Irving — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12618v1 Announce Type: new Abstract: Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but evaluating them requires testbeds where models verifiably believe the opposite of what they say. We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret. We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside Varied Deception, a prompted-lying testbed covering a broad range of lie-inducing motivations. On these testbeds we evaluate four detectors: a chain-of-thought judge, a logprob classifier, and two activation probes, including Did-You-Lie (DYL), a new method for training follow-up probes. On prompted lying, across 31 open-weight models spanning 2B to 1T parameters, all four detectors show positive scaling with model capability. However, every activation- and logprob-based detector drops sharply on our trained model organisms, with DYL retaining the most signal; only the chain-of-thought judge remains strong, with 0.82 balanced accuracy, partly as an artefact of our verification process favouring CoT-readable beliefs. Current lie detectors therefore cannot support high-confidence claims about model beliefs, and we suggest research directions that may address some of their current limitations. We release our datasets, model organisms, and trained detectors.

HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection

Luke Patterson, Li Wang, Adam Faulkner — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12620v1 Announce Type: new Abstract: Thanks to the rapid adoption of AI code assistants powered by large language models (LLMs), industry codebases are, increasingly, a hybrid of AI- and human-authored code. For risk management and productivity analysis purposes, it is crucial to enable fine-grained location detection of AI-generated code. To develop algorithms for this task, quality benchmarks are needed to assess performance. However, existing benchmarks tend to comprise academic, LeetCode-style problems and presume a code snippet is either completely human-authored or completely AI-authored, which is not reflective of the diverse intents and styles of industry codebases utilizing AI code assistants. To fill these gaps, we introduce HybridCodeAuthorship, a novel benchmark of Python code files with interleaved human- and AI-authored lines of code to simulate authentic utilization of AI code assistants. In this paper, we first present our dataset construction pipeline, which leverages CodeSearchNet, a massive collection of links to open sourced repositories on GitHub. We then benchmark the performance of two state-of-the-art AI-generated code detection algorithms at both the line- and chunk-level. Experimental results demonstrate that HybridCodeAuthorship is a challenging benchmark with a top-scoring algorithm, AIGCode Detector, obtaining a highest F1 score of 0.48 and 0.56 on chunk-level and line-level code detection tasks, respectively.

Context-Aware Feature-Fusion for Co-occurring Object Detection in Autonomous Driving

Binay Kumar Singh, Niels Da Vitoria Lobo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12628v1 Announce Type: new Abstract: Object detection in autonomous driving requires precise localization and an inherent understanding of the relational context between co-occurring objects. In extremely complex heterogeneous environments rare classes, small-scale objects, and frequently appearing objects are difficult for standard object detection frameworks to handle. In this paper, we propose a novel framework called Context-Centric Feature Fusion (CCFF), which utilizes two attention-based modules, Local Context Fusion Module (LCFM) uses the RoI-to-RoI self-attention mechanism to resolve spatial interactions, mainly considering small and partially obscured objects, while Global Context Attention Module (GCAM) converts the co-occurrence of objects priors by pooling top-K RoI features into a global context attention token, avoiding the computational overhead of pixel-level global pooling. This fusion of local and object-centric global features yields contextualized embeddings that enhance classification results and co-occurring objects detection. Our method is evaluated on two datasets, Cityscapes and BDD100K which demonstrate significant improvement on relational consistency, achieving a Category-level Consistency Strategy (CCS) of 0.973 and 0.969, respectively. Furthermore, our approach produces substantial gains in small object detection (AP_S: 14.1%) and successfully recovers rare classes such as "Train" that are typically lost in large distributions. Our efficiency report shows that the framework processes images in real time with a 0.2 FPS overhead. The code is available at https://github.com/BinayKSingh/CCFF.

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Varun Reddy Nalagatla — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12629v1 Announce Type: new Abstract: We show that the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs and confidence via their magnitudes, functioning as independent binary registers. We validate this Bag of Dims framework across three model families (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B) through four progressive experiments. Sign patterns alone carry predictive content: replacing all magnitudes with unity achieves 72-93% top-5 next-token accuracy through the LM head, and pure Hamming scoring without any decoder reaches 80-90% top-4096. These sign patterns organize into semantic features: using a single-token type cache (one forward pass per vocabulary token, no context), we discover 175 categories via per-dimension sign consistency (mean AUC 0.80) from 50 anchors with zero training. A trained probe adds only +0.018 AUC and converges to axis-aligned weights, confirming negligible cross-dimension structure. This structure extends to attention: all 175 categories remain discoverable in K and V projections. On the write side, static FFN weight inspection links 20% of features to individual writer neurons (>0.70 agreement; random controls: 0%), with top-200 neuron coalitions achieving >0.70 agreement on 99.9% of prototypes via majority vote. Fully unsupervised discovery (random seeds, no labels) scales to 1500 features at 100% yield and 99% sparsity across all three models, with pairwise MI of 0.0014 bits confirming low inter-dimension coupling. These results establish that the standard basis already suffices for feature reading throughout the transformer compute pathway, requiring no training, no optimization, and no GPU-days beyond a single forward pass per vocabulary token.

The Switching Lemma shows what the Switching Lemma cannot prove: an unconditional natural-proofs barrier

Bruno Loff, Suhail Sherif, Navid Talebanfard, Francesca Ugazio — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12631v1 Announce Type: new Abstract: Razborov and Rudich (JCSS'97) observed that all known lower-bound proofs follow a certain pattern: when showing that a function $F$ is hard, along the way the proof provides us with a distinguisher, namely, an efficient algorithm which can distinguish easy functions from random functions. They called such lower-bound proofs natural proofs. They then showed a natural-proofs barrier: under standard cryptographic assumptions, natural proofs cannot show superpolynomial lower-bounds against Boolean circuits. Along similar lines it can be shown that under a suitable cryptographic assumption, natural proofs cannot significantly improve the current state-of-the-art lower bound against constant depth circuits (AC0). The state of the art, using H\r{a}stad's Switching Lemma (SL), is $2^{n^{1/(d-1)}}$ for depth-$d$ circuits, and (conditionally) no natural proof can prove lower bounds of $2^{n^{c/d}}$ for some large constant $c$. In this paper we revisit the natural-proofs barrier from an $\textit{unconditional}$ perspective. We focus on AC0-natural proofs, i.e. proofs whose distinguishers are computable by AC0 circuits. Razborov and Rudich observed that lower bounds based on SL are AC0-natural. We show that this is true for most known lower-bound techniques against constant-depth circuits. We then establish an unconditional barrier for such proofs. By localizing the Trevisan--Xue pseudorandom generator, we are able to show that no AC0-natural proof can prove a lower bound greater than $2^{n^{7/(d-5)}}$ against depth-$d$ circuits. This is in the same quantitative regime as the SL frontier which instead has $1/(d-1)$ in the power of $n$. The proof has a striking self-referential aspect: the proof of security of the Trevisan--Xue generator crucially relies on SL, and so SL has been used to show that AC0-natural proofs, such as SL itself, cannot prove AC0 lower bounds better than that of SL.

Hyperbolicity-Preserving Stochastic Galerkin Methods for Conservation Laws Based on Associative Truncated Products on Polynomial Spaces

Haroun Meghaichi, Yulong Xing — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12632v1 Announce Type: new Abstract: Stochastic Galerkin discretizations of nonlinear hyperbolic conservation laws may lose hyperbolicity because the standard pseudospectral product is generally nonassociative, leading to non-commuting blocks in the flux Jacobian matrix. We develop a novel framework for constructing hyperbolicity-preserving stochastic Galerkin systems based on associative truncated products on polynomial spaces. In one stochastic dimension, we characterize associative truncated products through a single polynomial datum and identify examples with useful symmetry, positivity, and spectral properties, including collocation products and an associative symmetric product based on Gaussian quadrature nodes. We prove a consistency result showing that, under suitable projection-error assumptions, these products converge to the classical product as the polynomial degree grows. For systems with rational fluxes, we derive sufficient conditions under which the resulting stochastic Galerkin flux remains hyperbolic on the corresponding admissible set. Applications to the one-dimensional isothermal and compressible Euler equations show accurate statistical approximation and robust hyperbolicity preservation of the computed stochastic Galerkin states.

ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12633v1 Announce Type: new Abstract: Incremental Learning (IL) for Open-ended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve. In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre-trained VLMs to preserve high-quality cross-modal representations. Based on this idea, we propose Efficient Continual Alignment (ECA), a novel exemplar-free IL approach for OpenITG. The key challenge is enabling the model to acquire new, task-specific features while minimizing interference with the established alignment without accessing raw data from previous tasks. To address this, ECA employs three core mechanisms: a Mixture of Query (MoQ) module that adapts task-specific query tokens, a Fisher Dynamic Expansion (FeDEx) that dynamically expands model structure based on a Fisher Information Matrix (FIM)-based metric, and an embedding dictionary with Dictionary Replay (DR) to retain past knowledge. To evaluate ECA's performance, we construct four new IL OpenITG benchmarks that better reflect real-world scenarios. Experimental results demonstrate that ECA significantly mitigates catastrophic forgetting and improves IL performance compared to baseline methods. Code and benchmarks are available at https://github.com/Snowball0823/ECA.

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12634v1 Announce Type: new Abstract: Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $\tau^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $\tau^3$-airline pass@1 $0.583 \to 0.602$.

CD-RCM: Generalizable Continuous-Depth Novel View Synthesis for Reflectance Confocal Microscopy

Tooba Imtiaz, Milind Rajadhyaksha, Kivanc Kose, Jennifer Dy — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12635v1 Announce Type: new Abstract: Reflectance confocal microscopy (RCM) provides noninvasive, cellular-resolution "optical biopsies" of human skin \emph{in vivo} by acquiring en-face images at successive depths, forming a sparse z-stack. Due to optical limitations, these stacks are anisotropic 3D volumes with lateral resolution (0.5 $\mu$m) $\sim$6 times higher compared to axial resolution, which is defined by the optical sectioning (3 $\mu$m), limiting the interpretation of tissue. Our goal is to provide continuous-depth visualization by interpolating intermediate sections and making the 3D volume isotropic. Such a representation permits arbitrary-direction sectioning, including histopathology-like cross-sectional examination, without requiring per-patient optimization. To that end, we introduce the first RCM-specific novel-view synthesis (NVS) approach, CD-RCM, a feedforward model that predicts realistic, unseen depths from sparsely sampled RCM stacks. Classical neural rendering methods focus on reconstruction from surface-level multi-view observations. In contrast to surface-level camera views, RCM can acquire optically sectioned en-face images of tissue beyond the surface up to 200 $\mu$m. However, during visualization of the RCM stacks, observations of the shallower sections (towards the surface) obscure the deeper ones. This unique axial imaging geometry and layer-dependent anatomical organization motivated our development of a tailored architectural and training framework that explicitly accounts for RCM's depth-resolved, occlusive imaging physics. Experiments demonstrate that CD-RCM achieves high-fidelity novel-view synthesis with sub-second inference time.

Eidola: Modeling Multi-GPU Network Communication Traffic in Distributed AI Workloads

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12638v1 Announce Type: new Abstract: As distributed AI workloads grow in scale, multi-GPU systems have become essential for training large models. Although techniques like kernel fusion and overlapping communication with computation help reduce delays, they also introduce irregular and transient traffic patterns that are difficult to model using existing tools. These techniques rely heavily on fine-grained synchronization and peer-to-peer communication, which place significant pressure on interconnect bandwidth and latency. In this work, we introduce Eidola, a scalable extension to the gem5 simulation framework that enables detailed modeling of inter-GPU communication traffic. The extension is scalable as our GPU model serves as a succinct eidolon, emulating the minimal characteristics needed for traffic modeling. Eidola uses annotated timing profiles from real applications to emulate peer-to-peer GPU writes with cycle-level precision. This allows researchers to simulate and analyze synchronization behavior across large multi-GPU configurations. The simulator supports configurable per-GPU traffic patterns and enables isolated performance analysis under different communication scenarios. We demonstrate Eidola's effectiveness by reproducing variability in fused kernel execution and by implementing a SyncMon-inspired synchronization mechanism, confirming reductions in polling-related memory traffic. Our results show that Eidola provides a flexible and scalable platform for studying inter-GPU communication and supports architectural exploration in modern distributed GPU systems.

The Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry

Dhruv Agarwal, Riya Bisht — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12639v1 Announce Type: new Abstract: Predicting how a cell's transcriptome responds to a drug it has never seen is a core, hard problem in computational cell biology: recent benchmarks show complex models often fail to beat trivial baselines once test compounds are held out by chemistry. We study one cell line and assay, THP-1 cells profiled by DRUG-seq, scored by the active-compound weighted MSE(wMSE) of the VCPI prediction contest. We propose a staged approach: dumb baselines (untreated control and mean training-compound response) that the field keeps failing to beat; non-parametric retrieval (a Tanimoto-weighted average of a held-out compound's nearest training compounds); and a fusion stage combining a frozen chemistry embedding with retrieval-support features to predict the residual over the mean, with an uncertainty head and gene programs. On the released VCPI THP-1 drug-seq data (14,026 training compounds), under a Bemis-Murcko scaffold split, the model ranking inverts depending on the metric. Under an inverse-variance per-gene proxy, a regularized linear regression on Morgan fingerprints appears to win over the deep models, retrieval, and ChemBERTa -- the textbook "simple baselines win" result. But under the contest's true active-set metric (per-(gene, compound) Mejia weights, validated against the official scorer; mean baseline 0.535 vs the organizers' 0.507 reference), that reverses: the deep models win, our fusion decoder significantly beats the linear fingerprint baseline (-0.012 wMSE, paired bootstrap p < 10^-4), and the proxy's winner becomes the worst chemistry-aware predictor. Picking the metric picks the winner -- to our knowledge the first demonstration on real held-out drug chemistry of the metric-calibration effect established largely on genetic perturbation. We release a reproducible pipeline wired to the official scorer that emits a valid submission over the real 1064 x 12,995 grid.

Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning

Qingyun Guo, Junyi Shi, Jianuo Huang, Tianyu Shi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12640v1 Announce Type: new Abstract: Offline reinforcement learning allows control policies to be learned directly from data without online interaction, making it suitable for safety-critical tasks. Recent studies have applied diffusion models to offline reinforcement learning to leverage their strong capacity for modeling complex data distributions. However, existing approaches primarily focus on single-agent settings, leaving the safety challenges in multi-agent environments largely unexplored. In this work, we propose a safe offline multi-agent reinforcement learning algorithm that embeds neural individual control barrier functions into the diffusion model to enhance safety during trajectory generation, with control policies recovered through inverse dynamics. We evaluate our algorithm across diverse benchmarks, demonstrating substantial safety improvements while maintaining competitive rewards.

TEDD: Robust Detection of Unstable Temporal Features

Ricardo Ribeiro Pereira, Bruno Casal Lara\~na, N\'adia Soares, Miguel Ara\'ujo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12643v1 Announce Type: new Abstract: When working with real-world temporal data, it is common to encounter features whose distribution is changing over time. The naive employment of Machine Learning models on this unstable data might lead to rapidly degrading performance, especially if the new distribution is much different from what was previously seen during training. In order to cope with this problem, it is critical to automatically identify features that are changing over time. With these features detected, data scientists and other practitioners will be able to mitigate the issue (for instance, by applying data transformations), deploying more robust models that retain high performance for longer periods of time. In this paper, we describe which temporal changes a feature should not suffer from, and propose TEDD, a technique to a) identify when a dataset might lead to an unstable Machine Learning model and b) automatically detect which features cause such lack of robustness. In order to achieve it, we leverage a regression model to highlight which features contribute to a good prediction of an instance's timestamp. We compare our approach to other methods in real and synthetic data, testing their detection capability on all simple change patterns. We show that our method: detects all types of basic changes, both for numerical and categorical features; can detect multivariate drifts; returns a comparable value measuring the amount of change of each feature; requires no parameter tuning; and is scalable both on number of features and instances of the dataset.

Token Complexity Theory for AI-Augmented Computing

Jie Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12647v1 Announce Type: new Abstract: AI-augmented computing delegates natural language queries, code generation requests, and other open-ended tasks to a cluster of AI models that processes queries and generates responses. This paradigm introduces a resource dimension that neither classical time nor space complexity captures: the cost of sending queries to and receiving responses from such a cluster. We introduce token complexity, a formal resource measure defined as the minimum expected token cost to achieve a specified level of output quality on a task, and develop a taxonomy classifying AI systems by the strength of their probabilistic properties. We develop token complexity within the framework of AI-Oracle Turing machines, in which a probabilistic Turing machine interacts with a stochastic oracle via dedicated query and response tapes. We prove basic theorems establishing that token complexity behaves as expected: monotonicity (higher quality costs more tokens), convexity (quality improvements become progressively more expensive), price sensitivity (small price changes produce bounded cost changes), and price-relativity of task ordering (the token complexity ordering of tasks can reverse depending on the query-to-response cost ratio). We prove that the complexity frontier, defined as the set of all feasible resource bounds in tokens, time, and space, is non-empty, upward-closed, and convex.

OpenRoundup: Multi-Table Data Wrangling Through Interactive Visualization

Stephen Kasica, Charles Berret, Tamara Munzner — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12648v1 Announce Type: new Abstract: Data journalists routinely integrate records across multiple independently published sources to support accountability reporting, yet no existing interactive wrangling tool treats the collection of tables -- rather than the single table -- as its primary unit of work. We present OpenRoundup, an open-source, browser-based system that enables data journalists to consolidate multiple tables into a single analysis-ready output without writing code. The interface comprises five coordinated panels that implement a schema-first, values-on-demand paradigm with live schema previews, ambient data quality alerts, and a recursive treemap visualization of the evolving operation tree. A client-only architecture powered by DuckDB-WASM runs in the browser, providing strong data privacy guarantees suited to sensitive journalism data. The system introduces two conceptual contributions: eager table consolidation, in which a composite table is assembled early in the wrangling phase via interactive, incremental assembly of multiple source tables; and a declarative vocabulary for table consolidation consisting of two operations, Stack and Pack. We evaluate the system through a replication study in which the authors reproduce 17 published journalist programming workflows using only the interface, and a deployment study with four professional data journalists. The replication study demonstrates expressive coverage of real-world consolidation tasks. The deployment study confirms utility for practitioners who understand joins conceptually but lack the programming skills to execute them, and surfaces an unanticipated secondary value for data journalism education.

MentalMARBERT: Domain-Adaptive Pre-training and Two-Stage Fine-Tuning for Arabic Mental Health Disorders Detection

Fatimah Almalki, Areej Alhothali, Lulwah Alharigy, Abdulrahman Aladeem — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12649v1 Announce Type: new Abstract: Detecting mental health disorders from Arabic social media text remains challenging due to dialectal variation, informal language, limited high-quality annotated resources, and severe class imbalance. While English mental health natural language processing (NLP) has progressed substantially, Arabic multi-class disorder classification remains insufficiently studied. This study proposes a two-phase framework for Arabic mental health text classification. In phase 1, three Arabic pre-trained language models, AraBERT, CAMeLBERT, and MARBERT, undergo Domain-Adaptive and Task-Adaptive Pretraining (DAPT and TAPT) using a large-scale corpus of unlabeled Arabic mental health tweets. The adapted models are evaluated under a unified protocol to identify the most effective backbone model. In phase 2, the selected model is assessed across four configurations combining single-stage and hierarchical two-stage classification architectures with full fine-tuning and Low-Rank Adaptation (LoRA). To support this study, we constructed a novel annotated Arabic mental health dataset comprising 50,670 tweets across six categories, with strong inter annotator agreement (Krippendorff's Alpha = 0.733, average pairwise agreement = 0.797). Experimental results show that the domain-adapted MARBERT (MentalMARBERT) achieves statistically significant improvements over baseline models in both accuracy and macro-F1. The hierarchical two-stage architecture combined with full fine-tuning achieves the best overall performance, reaching a macro-F1 of 0.861 and an accuracy of 0.877. These findings demonstrate the effectiveness of domain-specific adaptive pretraining and hierarchical classification for Arabic mental health disorder detection.

nomp: A Framework for Building Domain Specific Compilers

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12650v1 Announce Type: new Abstract: The low-level GPU programming models (CUDA, HIP, OpenCL, etc.) provide detailed control of the data flow and execution plan of a program in order to extract close-to-metal performance. However, these have a steep learning curve due to the intricacies of their syntax and semantics. This reduces programmer productivity. On the other hand, high-level models (OpenMP, OpenACC, etc.) that serve as abstractions over the low-level models are aimed at improving programmer productivity but achieving performance on-par with the low-level models is a challenge. There are inherent trade-offs between productivity, portability and performance in both approaches and there is no one-size-fits-all solution which achieves all three simultaneously. However, we believe there is room to improve programmer productivity without sacrificing performance and portability by reusing optimization patterns specific to a given domain. To this end, we propose nomp: a framework for building domain specific compilers. nomp consists of a pragma based programming model and a runtime capable of code transformation and generation based on user provided metadata.

Physics-Aware Auxiliary Losses Improve Out-of-Distribution Generalization of a GNN Synthesizability Filter

Riya Bisht, Dhruv Agarwal — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12651v1 Announce Type: new Abstract: Machine-learning drug-discovery pipelines increasingly rely on generative models that propose molecules far from the data used to train downstream synthesizability filters. Existing filters (SAScore, SCScore, RAscore, DeepSA) are purely statistical and degrade in exactly this out-of-distribution (OOD) regime. We ask whether cheap, closed-form physical priors, used as auxiliary supervision on a graph neural network (GNN), improve OOD generalization. We add two auxiliary losses to a GINE backbone: a topological complexity regression supervised by the Bertz index, and a strain-energy soft penalty supervised by MMFF94 force-field energy. On a 65,177-molecule corpus (HIV, Tox21, COCONUT) labeled by SAScore thresholds we reproduce a strong in-distribution baseline, then evaluate a 4-way ablation (baseline / +complexity / +strain / +both) on a single-source OOD split (train on drug-like HIV+Tox21, test on COCONUT natural products), repeated over 5 seeds with paired bootstrap confidence intervals. All three physics-aware variants give a small but statistically significant OOD improvement over the baseline (mean OOD AUC 0.9774): +complexity Delta = +0.0060 (95% CI [+0.0023, +0.0102]), +strain Delta = +0.0032 ([+0.0008, +0.0052]), +both Delta = +0.0066 ([+0.0038, +0.0093]); every interval excludes zero, and the combination is best. The variants are indistinguishable in-distribution, so the effect is visible only under OOD evaluation. We are explicit that the effects are modest, and we report a cautionary methodological finding: a single-seed version of this experiment produced a qualitatively different (non-monotone) story that did not survive multi-seed evaluation.

Amnesia: A Stealthy Replay Attack on Continual Learning Dreams

Ahmed Sharshar, Naveen Kumar Kummari, Mohsen Guizani — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12655v1 Announce Type: new Abstract: Continual learning (CL) models often use experience replay to reduce catastrophic forgetting, but their robustness to replay sampling interference remains underexplored. Existing CL attacks alter inputs or training pipelines (poisoning/backdoors) and rarely include explicit auditable constraints, limiting realism. Here, auditability means a monitor can verify compliance from sampler-visible telemetry - e.g., logged replay index/label statistics - by checking that the realized replay class histogram stays close to a nominal baseline and that replay rate is unchanged per batch and/or over a rolling window. We study a limited-privilege insider who controls only replay index selection, not pixels, labels, or model parameters, while staying within auditable limits such as queue priorities. We introduce Amnesia, a replay composition attack that maximizes degradation under two budgets: a visibility budget delta bounding the TV/KL divergence from a nominal class histogram p0, and a mass budget f fixing the replay rate. Amnesia has two steps: (i) compute lightweight class utilities, such as EMA loss or confidence, to tilt p0 toward harmful classes; and (ii) project the tilt back into the delta-ball using efficient KL (exponential tilt) or TV (balanced mass redistribution) optimizers. A windowed scheduler enforces rolling audits. Across challenging CL benchmarks and strong replay baselines, Amnesia consistently lowers final accuracy (ACC) and worsens backward transfer (-BWT). The KL variant delivers high impact while remaining largely undetected under multiple audit schemes, including per-batch and rolling-window checks. The TV variant is more damaging but easier to detect, especially under tight per-class constraints. These results expose index-only replay control as a practical, auditable threat surface in CL systems and establish a principled impact-visibility trade-off.

On the completeness of generalized hierarchical spline spaces

Ahmed Oufqir, Carla Manni, Hendrik Speleers — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12656v1 Announce Type: new Abstract: We introduce a general theoretical approach to hierarchical spline spaces that replaces the classical constructive definition - based on basis selection - with a descriptive formulation in terms of regularity constraints. Specifically, we define generalized hierarchical spline spaces on multi-level domains as collections of piecewise functions satisfying hierarchical contact conditions across interfaces between refinement levels. The proposed framework applies to a broad class of local function spaces and relies on a minimal abstract requirement, the extension assumption, rather than on specific polynomial properties. Within this framework, we identify rules under which the hierarchical selection mechanism yields a complete basis, in the sense that it spans exactly the space characterized by the contact conditions. As an application, we consider Tchebycheffian spline spaces. We show that spaces generated by extended complete Tchebycheff (ECT) systems fit in this framework, thereby establishing the completeness of hierarchical Tchebycheffian splines. This demonstrates that the proposed theory naturally extends beyond the polynomial setting and provides a unified foundation for hierarchical constructions in more general spline spaces.

TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation

Siyu Li, Toan Tran, Lingyi Zhao, Khurram Shafique, Li Xiong — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12657v1 Announce Type: new Abstract: Human mobility data is important for transportation, urban planning, and epidemic control, but large-scale trajectory collection is often costly and privacy-constrained, motivating realistic synthetic trajectory generation. Existing LLM-based generators typically rely on either prompt engineering, which preserves zero-shot reasoning but lacks fine-grained spatiotemporal grounding, or trajectory-level fine-tuning, which improves statistical precision but incurs substantial computational cost and may weaken general reasoning. We propose TrajGenAgent, a semantic-aware hierarchical LLM-agent framework for human mobility trajectory generation without model fine-tuning. TrajGenAgent uses a two-stage orchestrator-worker design: an LLM first synthesizes an individual- and weekday-conditioned activity chain from historical evidence via in-context learning, and a deterministic workflow then grounds each activity into a complete visit using personalized POI retrieval, distance-aware location selection, kinematics-aware travel-time propagation, and LLM-based duration estimation. To evaluate realism beyond aggregate spatiotemporal statistics, we introduce an anomaly-detection-based evaluation framework using two complementary detectors to assess behavioral and semantic plausibility. Experiments on benchmark and large-scale simulation datasets show that TrajGenAgent improves spatiotemporal fidelity, semantic coherence, and individual-specific behavioral realism over representative neural and LLM-based baselines, while avoiding parameter updates.

Physics-Informed Neural Networks for Chemotherapy Pharmacokinetics: Benchmarking the Clinical Estimator and Exposing Parameter Identifiability

Riya Bisht, Dhruv Agarwal — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12658v1 Announce Type: new Abstract: Physics-Informed Neural Networks (PINNs) are an attractive tool for partial-observation problems in biology, where the governing dynamics are known but some compartments cannot be measured. Chemotherapy pharmacokinetics (PK) is a clean instance: drug concentration in plasma is routinely measured, but concentration in tissue -- which determines tumour kill and off-target toxicity -- is not. We benchmark a PINN against the standard clinical baseline (nonlinear least-squares on the analytical biexponential plasma solution, hereafter NLS) and a physics-agnostic neural baseline (a data-only MLP) on two PK problems. On the linear two-compartment problem, NLS is near-optimal; the PINN matches it to within a small constant factor while also producing the tissue curve in a single training pass, whereas the data-only MLP fails on tissue by roughly 10x. On a Michaelis-Menten extension (saturable elimination), the biexponential closed form no longer exists, so NLS is mis-specified and silently returns meaningless rate constants. The PINN instead exposes a deeper fact: the Michaelis-Menten two-compartment model is non-identifiable from plasma alone, and the PINN reports this honestly by converging to a basin with k12 -> 0. Adding two sparse tissue observations largely resolves identifiability: across five seeds the PINN recovers k21 to within 1% of truth and Vmax, Km to within one standard-deviation bar, while k12 moves in the correct direction (0.02 -> 0.82) but remains ~2 sigma below truth -- a recovery the closed-form NLS estimator cannot attempt at all, because its biexponential ansatz describes only plasma. Our claim is not that PINNs beat NLS. It is that PINNs offer a uniform recipe that ties the textbook estimator on the textbook problem, exposes structural identifiability that the textbook estimator hides, and absorbs heterogeneous measurements within a single loss.

BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention

Damien Martins Gomes, Fran\c{c}ois Capman — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12662v1 Announce Type: new Abstract: Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-capacity encoder derived from critical-band density, automatically granting deeper branches to perceptually dense low frequencies and lighter ones to high frequencies. A cross-band attention module captures harmonic dependencies across bands through compact frequency-pooled representations at linear complexity. Built on inverted residual blocks with dense connectivity and a convolutional recurrent network, BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with only 0.83M parameters and 7.3 G~MACs, the fewest parameters among all methods with PESQ > 3.50. A causal variant (3.44 PESQ) surpasses several non-causal baselines, confirming suitability for real-time streaming on resource-constrained devices.

Modeling and Estimation of Solid Electrolyte Interphase during Formation in Battery Manufacturing

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12664v1 Announce Type: new Abstract: The solid electrolyte interphase (SEI) - a critical passivation layer that governs the longevity, safety, and efficiency of lithium-ion batteries - is created during the last step in cell manufacturing called cell formation. Conventional cell formation protocols are largely empirical, resulting in long processing times and limited control over the SEI growth rate that influences SEI quality and lifetime performance. This paper develops a control-oriented, semi-empirical model to estimate SEI thickness growth from terminal voltage and cell expansion measurements acquired in-operando during manufacturing using low-cost micrometer-precision integrated-sensing fixture. Model parameters are calibrated against cell formation data, and an unscented Kalman filter is employed to estimate the SEI film growth. The results lay the foundation for future closed-loop control of SEI growth, enabling high-quality and more efficient formation processes.

CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents

Siyu Shen, Fenghao Xu, Wenrui Diao, Kehuan Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12666v1 Announce Type: new Abstract: Screenshot-based mobile GUI agents can operate ordinary smartphone apps through the same visual interface as a human user, but this capability also turns every screen observation into a privacy boundary. During normal task execution, screenshots may expose contacts, messages, photos, files, recommendations, health cues, and other sensitive context that is unrelated to the user's request. We call this problem incidental visual privacy exposure. It is difficult to address with existing defenses: text anonymization misses many visual and inferential cues, while generic privacy masking can remove the evidence and controls that a GUI agent needs to complete the task. This paper presents CAPED, a context-aware pre-upload exposure control layer for mobile GUI agents. CAPED is designed as a phone-side protection layer: before screenshots are released to a remote multimodal agent, it extracts task requirements, uses screen context as a privacy prior, parses visible UI elements, and selectively exposes only content needed for the current task while masking incidental private content. We evaluate CAPED on AndroidWorld for broad task utility and with a controlled 28-task seeded privacy evaluation used as a measurement instrument for trajectory-level incidental leakage. In this seeded evaluation, Full CAPED reduces success-conditioned weighted seeded leakage from 0.766 under raw screenshots to 0.268 while preserving high task utility. A broader AndroidWorld run shows a remaining prototype-level utility cost, but the results support the central claim that screenshot upload should be treated as an explicit device--cloud boundary decision, governed by task-driven selective exposure rather than all-or-nothing screen sharing.

Free-Placement Optimization of Ground Station Locations for Low-Earth Orbit Satellites

Grace Ra Kim, Duncan Eddy, Vedant Srinivas, Mykel J. Kochenderfer — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12667v1 Announce Type: new Abstract: Rapidly expanding low Earth orbit satellite constellations are placing increasing demands on terrestrial ground networks, motivating the development of more efficient ground station network designs. Current approaches select sites from predefined locations, limiting optimization to existing infrastructure and constraining performance. In contrast, free-placement optimization operates over a continuous spatial domain on Earth, broadening the search space and allowing higher-throughput configurations at the cost of potentially requiring new infrastructure deployment. In this work, we introduce SCORE (Sequential Cyclic Optimization via Refinement & Evaluation), a two-stage free-placement method for ground station design. SCORE combines sequential coordinate selection with cyclic refinement to manage high-dimensionality, non-convexity, and local minima that challenge global optimizers. We benchmark SCORE against one-shot methods such as differential evolution (DE) and integer programming approaches using locations from Kongsberg Satellite Services and the World Teleport Association. Tests across two commercial Earth observation constellations (Capella Space and ICEYE) and one synthetic Walker-Star constellation show that SCORE requires up to 5x fewer function evaluations to converge relative to DE while improving downlink throughput by up to 13%. Compared to fixed-site methods, unconstrained SCORE achieves up to 15% greater total downlink, establishing a strong empirical performance benchmark for flexible placement; infrastructure-constrained SCORE retains over 92% of this gain while restricting placement to within proximity of existing fiber and power infrastructure. We also explore trade-offs between expanding existing stations and deploying new sites, informing future ground network design for operational constellations.

SalArt-VQA: Diagnosing Whether VLMs Understand Salient Artifacts in Generated Images

Xiaoxiao Sun, Ruotian Zhang, Junzhe Huang, James Burgess, Serena Yeung-Levy — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12671v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly used to detect whether AI-generated images contain visible artifacts, yet their ability to analyze such artifacts remains poorly understood. A correct image-level decision can still hide important failures: a model may correctly flag an artifact while relying on the wrong visual cue, selecting the wrong region, or describing a defect that the image does not support. To evaluate these behaviors directly, we introduce SalArt-VQA, a diagnostic benchmark for fine-grained SALient ARTifact understanding in AI-generated images. SalArt-VQA contains 950 images and 3,681 human-authored multiple-choice questions spanning artifact images, matched real reference images, and paired generated reference images. Four aligned question types evaluate presence detection, semantic localization, spatial grounding, and evidence-grounded defect identification, while the reference splits test calibration and abstention when the annotated defect is absent. Across 20 VLMs, SalArt-VQA reveals failures that image-level detection accuracy hides: the strongest model reaches 99.37% detection recall on artifact images but answers all four artifact-side questions correctly on only 53.26% of images. Comparing artifact images with artifact-free references reveals a sensitivity-calibration tradeoff: sensitive models often make unsupported artifact claims, while conservative models avoid false alarms largely by missing real artifacts. These results show that high artifact detection accuracy alone does not imply grounded artifact understanding. SalArt-VQA exposes these hidden failure modes and provides a fine-grained evaluation of whether VLM artifact claims are supported by local visual evidence.

A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction

Phan Nguyen, Dat Cao, Hien Chu, Khue Hoang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12673v1 Announce Type: new Abstract: Cross-domain graph anomaly detection (GAD) aims to identify abnormal nodes in unseen target graphs, showing strong potential in real-world applications with heterogeneous graph data. However, existing methods often depend on dataset-specific feature semantics and structural patterns, which limits their ability to generalize across different domains. To address this challenge, we propose AlignGAD, a zero-shot generalized graph anomaly detection framework. Our framework is built upon three key components: a Global Unification Module that aligns heterogeneous node features and normalizes graph signals in the spectral domain; a Clustering Module that constructs cluster-aware graph views to capture group-level abnormal patterns; and a Node Discrepancy Scoring Module that measures reconstruction discrepancy and aggregates anomaly evidence from different graph views. Experiments on multiple real-world datasets demonstrate the effectiveness of AlignGAD under the zero-shot GAD setting.

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12674v1 Announce Type: new Abstract: Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.

A Calculus of Apartness over Separoids: Effective Convex Representation, Stratified Conservativity, and the Complexity of Entailment

Faruk Alpay, Baris Basaran — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12676v1 Announce Type: new Abstract: Every finite family of compact convex bodies in Euclidean space induces an apartness relation between disjoint index sets: two sets are apart when the convex hulls of the corresponding unions are disjoint. This paper studies the finite theory obtained by taking apartness as the primitive relation. Its basic laws are symmetry, bilateral subsumption, and vacuity, equivalently the separation-polarity form of acyclic separoids. The main contribution is an effective rational realization theorem with uniform margins and the exact consequence theory it supports. Every finite apartness separoid is realized by rational polytopes whose coordinates are indexed by maximal separations. Maximal separations and minimal Radon partitions can be enumerated from a full table, generators, or a membership oracle; the coordinate values have controlled bit height; and each coordinate records a readable certificate of one maximal separation. The realization separates every apart pair with clearance at least 2, remains correct under outer parallel enlargement by any radius below 1, and yields full-dimensional convex bodies after thickening. The distance-function layer records standard convex-analytic stability through Lipschitz comparison, monotonicity under inclusion, and outer parallel bodies. On the logical side, positive entailment is exactly one-premise subsumption. Boolean consequence over Euclidean scenes is sound, complete, and decidable; satisfiability is NP-complete, validity is coNP-complete, and positive entailment is linear for sorted encodings. A stratification theorem shows that Boolean reasoning introduces no new atomic apartness beyond separoid closure. Fixed-dimensional consequence relations form a strictly decreasing hierarchy that stabilizes in dimension n minus 1 for n sites.

Fed-FBD: Federated Functional Block Diversification for Isolation, Privacy, and Surgical Unlearning

Weijie Chen, Alan B. McMillan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12679v1 Announce Type: new Abstract: Federated learning (FL) enables collaborative model training without sharing raw patient data, but standard approaches such as FedAvg treat each client as a black box and provide no mechanism for isolating an adversarial contributor, auditing per-client influence, or honoring a departed participant's right to be forgotten. We present Fed-FBD (Federated Functional Block Diversification), a modular federated architecture that decomposes a ResNet backbone into six functional blocks (the stem, four residual groups, and the classification head) and maintains a warehouse of N color variants, each assembled from independently tracked and contributor-stamped blocks. Fed-FBD provides three capabilities absent in FedAvg: (i) architecturally guaranteed block-level isolation, so that an adversarial or mislabelled client cannot contaminate the clean colous; (ii) privacy-by-design, where membership inference advantage is already indistinguishable from chance before any privacy mechanism is applied; and (iii) surgical machine unlearning of a departed participant's contribution at sub-second cost and without retraining. Experiments on six MedMNIST-2D datasets, PathMNIST at 224x224, and CIFAR-10 show that Fed-FBD trades a modest 0.3%-3.1% IID accuracy gap on the adequately sized datasets for these guarantees, remains within 0.8%-4.0% of FedAvg at Dirichlet alpha=1.0 on three of four datasets, and confines all six adversarial attacks we study to the poisoned client's own blocks with at most +/-0.01 AUC drift on the clean colors.

How Useful is Causal Invariance for Domain Adaptation in Finite-Sample Settings?

Julia Kostin, Kasra Jalaldoust, Elias Bareinboim, Samory Kpotufe, Fanny Yang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12680v1 Announce Type: new Abstract: Machine learning models often degrade when they are deployed on a target distribution that differs from the source distributions they were trained on. Recent work in causality-based domain generalization has shown how shared causal structure between domains can induce invariant predictors, e.g., models on a subset of features which have stable risk across structured domain shifts. However, the extent to which such population-level causal invariances can lead to gains in finite-sample settings remains underexplored. In particular, in practice we often have access to a few labeled target samples, a setting called supervised domain adaptation (sDA). In this paper, we explore when (full or partial) causal knowledge can provably improve supervised domain adaptation. As a first step, we study linear regression, where full or partial causal knowledge specifies a collection of invariant or possibly invariant feature subsets, each yielding a source-trained candidate predictor. We derive matching upper and lower bounds showing that finite-sample gains are governed by the target-risk margins separating the candidates, together with the finite-source estimation error. When these margins are sufficiently large relative to $n_Q$, an adaptive aggregation procedure can match the best candidate predictor while avoiding negative transfer relative to target-only learning. On the other hand, when the margins are too small, no algorithm can reliably exploit the candidate collection to obtain faster finite-sample rates. We further connect these margins to structural shift magnitude in linear SCMs and validate the theory on real-world causal benchmarks.

From AGI to ASI

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12683v1 Announce Type: new Abstract: Over the last decade, building human-level artificial general intelligence has moved from far-fetched speculation to being a concrete next-decade target for many of the largest AI organisations. Achieving this goal would have profound and far-reaching impacts on human society, which raises many complex questions for the decade ahead. This report investigates how AI itself might continue to develop in a post-AGI world along the continuum of machine intelligence. The endpoint of this continuum, Universal AI, is theoretically well understood, which provides some formal grounding for the main focus of this report: the transition from human-level AGI to artificial general superintelligence, which, intuitively, can be understood as a system that is more intelligent and cognitively capable than large organisations of humans. After characterizing ASI, the report discusses four potential pathways from AGI to ASI: scaling AGI, AI paradigm shifts, recursive improvement, and ASI emerging from large-scale multi-agent collectives. The report then discusses possible frictions and bottlenecks along these pathways. Determining whether the impact of these frictions will be negligible or substantial raises a number of concrete open research questions. Due to large uncertainties for predicting ASI progress, it cannot be ruled out that AI progress might continue to accelerate over the next years. This could imply that the image of a single transformative step change, caused by the introduction of human-level AGI into our society, could be inaccurate. More apt might be the prospect of a series of transformative societal changes caused by AI-enabled progress and breakthroughs across many areas of science and technology. Preparing for this prospect requires a massively interdisciplinary endeavour of global scope and interest.

Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models

Yunbo Wang, Bolbi Liu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12687v1 Announce Type: new Abstract: Marketing mix models are used to forecast business outcomes and to attribute those outcomes to marketing channels, but these goals are not equivalent. We study a failure mode in graph-based neural MMM called attribution bypass: a high-capacity decoder can obtain low forecasting error through target autoregression, dense communication, co-movement, context, or latent memory while failing to route counterfactual sensitivity through the graph used as the attribution object. We introduce DICE-MMM as a bounded diagnostic and training framework. We do not claim that observational neural MMM identifies causal effects. Instead, DICE separates three questions often conflated in graph-based MMM: graph recovery, forecasting accuracy, and whether the trained decoder's perturbation-induced influence is graph aligned. Stage 1 trains a graph encoder with a restricted graph-mediated decoder. Stage 2 freezes the selected encoder and trains a graph-safe latent decoder whose cross-node communication must pass through the supplied graph. Decoder use is evaluated with CIG, AR-CIG, and graph-swap tests. Across controlled R/d/T swaps and an external multi-graph rawlog stress test, DICE improves stable graph recovery over CausalMMM. The experiments show that forecasting accuracy is not an attribution certificate: in a sparse-target benchmark, no-graph and full-graph decoders achieve MSE@7 around 0.004 while AR-CIG nAUPRC remains near or below zero, whereas an oracle graph reaches 0.807 +/- 0.129 at comparable MSE. Frozen graph-swap localizes the bottleneck: the same DICE-hard-trained decoder moves from nAUPRC -0.044 +/- 0.006 under learned graph inputs to 0.894 +/- 0.027 with the oracle graph. The contribution is a stress test and failure-localization framework showing that low MSE can hide attribution bypass and that the unresolved bottleneck is graph-support selection, not forecasting or decoder capacity.

M*: A Modular, Extensible, Serving System for Multimodal Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12688v1 Announce Type: new Abstract: We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited to accommodate this new architectural diversity. Here we present M*, a universal serving system for efficient serving of composite AI models. M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. We call this abstraction the Walk Graph and show how it can concisely capture composite models from a broad range of families. We instantiate M* on representative models and find that it achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, while delivering up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni. M* also outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x. Thus, our work paves the road towards more efficient serving of complex models with minimal developer effort.

Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models

Darpan Aswal, Thomas Palmeira Ferraz, Yongxin Zhou, Maxime Peyrard — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12689v1 Announce Type: new Abstract: Latent reasoning models (LRMs) replace explicit chain-of-thought with continuous thoughts. Recent work treats observable latent-state patterns, such as BFS-like frontiers and decodable arithmetic computation, as evidence for internal reasoning mechanisms. Evaluating two LRMs (Coconut and CODI) against controls lacking the proposed recurrence or curriculum, we find these patterns also appear in the controls and do not always causally affect behavior. Causal interventions reveal that latent-thought utilization is not binary but graded, scaling with a thought's causal effect on model behavior. Geometric analyses reveal this effect concentrates in low-rank directions whose step-to-step geometry grows more structured as their behavioral influence increases. Latent thoughts should therefore be treated as hidden computation, not hidden explanation: decodability, attention, or static structure alone cannot establish mechanism. LRM interpretability thus requires matched controls and causal tests.

EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence

Xin Zhou, Cong Miao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12690v1 Announce Type: new Abstract: In this paper, we propose the Enhanced World Action Model (EWAM), a closed-loop online adaptation architecture built upon a pretrained and fully frozen Cosmos3 backbone network. Evaluated entirely under a zero-shot task protocol, EWAM is centrally focused on reducing the amount of additional deployment data required to adapt to new task layouts. Notably, no extra task-specific demonstration sets were introduced in any of the evaluations, and no fine-tuning was performed on the backbone network. Its performance gains stem entirely from an inference-time co-reasoning mechanism composed of four inserted lightweight neural layers: the Neural Experience Memory Layer located in the intermediate layers of the Diffusion Transformer (DiT) provides task-relevant execution context; the Neural Anomaly Detection Layer after the state prediction head monitors the divergence between predicted and actual states in real time; the Neural Policy Routing Layer dynamically selects direct execution, conservative replanning, or rollback recovery based on the anomaly severity; and the Neural Action Correction Layer refines the generated action chunks using execution diagnostics. Unlike naive feature fusion, the memory, anomaly detection, and correction modules are deeply integrated into the Cosmos3 forward path in a differentiable manner, with only the final routing decision being a discrete supervised one.

Two-Layer Linear Auto-Regressive Models Estimate Latent States

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12691v1 Announce Type: new Abstract: Auto-regressive models have emerged as powerful tools for sequential data, from language to video. Understanding how and why these models learn latent representations remains an open theoretical question. In this work, we demonstrate that when trained by empirical risk minimization on data from partially observed linear dynamical systems, two-layer linear auto-regressive models naturally learn to approximate Kalman filtering. In particular, we show that the learned hidden representation coincides, up to a similarity transformation, with the state estimates produced by the optimal (Kalman) filter, even though the model has no explicit knowledge of the underlying dynamics or state. The result follows from three main insights. First, we establish that the Kalman filter is well approximated by an auto-regressive model with bounded truncation error. Second, we show that despite non-convexity, the two-layer optimization landscape is benign, i.e., all stationary points are either strict saddles or global minima. Finally, as our main contributions, we provide finite-sample guarantees on prediction error, parameter estimation error, and latent state recovery. Numerical simulations support the theoretical results and demonstrate that the latent representations of auto-regressive models recover state estimates.

Random Proposals: A Softmax-Based Local-Improvement Framework for Maximum Weighted Matching

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12692v1 Announce Type: new Abstract: We propose a randomized local-improvement algorithm for the Maximum Weighted Matching (MWM) problem. Our method introduces a softmax-based biased sampling mechanism that achieves local $\varepsilon$-dominance and yields an expected $\frac{1}{2}-\varepsilon$ approximation ratio. We prove convergence guarantees and show that the algorithm runs in $O\!\left(m\log(1/\varepsilon)/p_{\min}\right)$ time, where $p_{\min}$ is the minimum softmax proposal probability over all edges; under mild conditions on the bias parameter and weight range, this simplifies to $O(m\log(1/\varepsilon))$. The framework provides a tunable tradeoff between convergence speed and approximation quality.

A unified complexity bound for logconcave sampling

Yunbum Kook, Santosh S. Vempala — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12694v1 Announce Type: new Abstract: We give a simple, unified, and nearly tight bound for sampling arbitrary logconcave distributions from a warm start using the In-and-Out algorithm along with exponential lifting. The main new ingredient in the analysis is an improved bound on the Poincar\'e constant of a lifted distribution. As a consequence, the resulting convergence rate is nearly tight for both constrained settings (e.g., Gaussian restricted to a convex body) and well-conditioned settings (e.g., strongly logconcave and smooth densities).

Polymer-based Capacitive Micromachined Transducer-Enabled Inline Monitoring of Ultrasonic Welding in Thermoplastic Carbon Fiber Composites

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12695v1 Announce Type: new Abstract: Thermoplastic composite structures enable lightweight, recyclable, and high-throughput aerospace manufacturing, but reliable quality assurance of advanced joining processes remains a key challenge. This work presents a compact, low-cost, and wireless ultrasonic non-destructive testing system for real-time, inline monitoring of continuous ultrasonic welding of thermoplastic carbon fiber composites. The system integrates custom-fabricated polymer-based capacitive micromachined ultrasonic transducers (polyCMUTs) with the ultra-low-power WULPUS platform, enabling operation in the harsh, high-interference welding environment. An eight-element linear polyCMUT array operating at a center frequency of approximately 3.6 MHz is designed, fabricated, packaged, and integrated into an industrial welding setup. Inline measurements are performed during welding of carbon fiber laminates with intentionally introduced defects. Process-synchronous ultrasonic data reveal consistent depth-of-echo shifts at defect locations, in strong agreement with X-ray computed tomography ground truth. Across 21 welds, all induced defects are detected without false negatives and with limited false positives. The results demonstrate that polymer-based CMUT technology enables robust, scalable, and manufacturing-compatible ultrasonic sensing, representing a decisive step toward intelligent process monitoring and quality assurance for next-generation thermoplastic composite welding.

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

Yifan Gao, Yanmin Gong, Yun Shi, Yuanxiong Guo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12699v1 Announce Type: new Abstract: Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improved diabetes care. Wearable sensors such as continuous glucose monitors (CGM) and fitness trackers offer many valuable insights for glycemic assessment. However, effectively analyzing these data requires integration with essential individual-level context. Existing methods are often based on traditional machine learning (ML) and rely primarily on historical blood glucose measurements and overlook personalized information, which limits their performance across diverse diabetes populations. Recent advances in large language models (LLMs) have demonstrated their ability to integrate diverse data modalities while modeling sequential dependencies, motivating the exploration of their potential for personalized glycemic assessment. In this paper, we propose GlyLLM, an LLM-powered framework for modeling CGM-based glycemic dynamics through the integration of wearable sensor data and structured metadata. GlyLLM can leverage the extensive prior knowledge of pre-trained LLMs and achieve sensor-text semantic abstraction at decision time. Experiments on two related tasks on the AI-READI dataset demonstrate that our model outperforms traditional ML methods by an average of 13.66\% in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08\% in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. Additionally, our ablation study shows that diabetes surveys and biometric tests are more critical than other health information for glycemic assessment. Our work presents a promising step toward harnessing the power of LLMs to advance personalized glycemic assessment in T2D care.

Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12702v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly integrated into clinical systems, making it essential to evaluate the real-world utility of these systems. However, static benchmarks tend to measure correctness rather than user acceptance, aggregate performance across queries, and require densely annotated datasets -- leading to major blind spots for evaluating clinical systems. In this work, we perform a deployment-centered evaluation of an LLM system embedded within electronic health records at an academic medical center, where user feedback is sparse but closely reflects the deployment conditions. Specifically, we train a pre-response classifier that estimates the risk that a future interaction will result in the user rejecting the LLM response, based on query content and deployment-specific context available before generation. We conduct a prospective analysis of our model over 4.5 months of user feedback, finding that our prediction model achieves an AUROC of 0.719. Further, we estimate the benefit of such predictions in two downstream use cases (guardrail triggering and abstention). Our key conceptual insight is that making use of deployment-specific context (i.e., the provider type, department name, language model used for response), as opposed to only query content, improves the ability to predict whether the user will reject the system output. Altogether, our empirical case study demonstrates the feasibility of predicting user rejection using deployment-specific context, opening the door to targeted guardrails.

SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems

Tarun Sharma — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12703v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) agents increasingly run with persistent memory that accumulates across user sessions. This creates a new attack surface: an adversary interacting only through normal channels can inject crafted memories that, once retrieved, steer the agent's responses for future users, without touching model weights or code. We call this Multi-Session Memory Poisoning (MSMP) and show that no existing defence certifies against it; static-corpus defences (RobustRAG, ReliabilityRAG) assume a fixed knowledge base, and heuristic filters are bypassed by fluent enterprise-style text. We present Signed Memory with Smoothed Retrieval (SMSR), the first defence with a certified robustness bound for this setting. Component 1 adds HMAC-SHA256 provenance at write time, blocking unsigned injection. Component 2 applies randomised memory ablation with verdict-based majority voting at query time, bounding the influence of authenticated adversaries. We prove that no provenance-free retrieval-time filter can certify against adaptive injection, derive a hypergeometric certificate for Component 2, and formalise the Consistent Minority Effect, whereby a consistent adversarial answer wins string-based voting as a numerical minority while verdict-based voting removes it. Across 15 enterprise scenarios (3,150 repeated trials), Component 1 cuts attack success from 93-100% to 0% for all unsigned variants. For an authenticated adversary with a single injection, Component 2 holds success to 8.0% (95% CI [5.8, 10.9], n=450), below the certified worst case. In an end-to-end query-only attack where the agent itself writes the poison rather than it being pre-seeded, SMSR reduces success from 65.3% to 5.3% (n=150, non-overlapping CIs) on a live agent stack. Clean-query utility is 90% (Component 1) and 85% (combined).

VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

Thach Nguyen, Danhua Guo, Tom Lampo, Fei Wu, Burhan Yaman — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12706v1 Announce Type: new Abstract: Vision-language-action (VLA) models generate chain-of-thought (CoT) reasoning alongside driving trajectories, but existing benchmarks evaluate only trajectory quality and do not assess whether the CoT is relevant, consistent, or causally connected to the driving action. We introduce VLADriveBench, a framework that combines observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol to provide complementary views of the CoT-action relationship. Applying VLADriveBench to three models across two architectures, we find that the two analyses can diverge sharply: ORION scores highest on observational alignment yet its CoT is epiphenomenal, while Alpamayo v1.5 scores lower yet its CoT is strongly causal, with visual salience gating the extent of CoT influence.

Storage and Transport Capacity Design for a Self-Reliable Two-Node Stochastic Resource System

Arnab Deya, Vivek Khatana, Ankur Mani, Murti V. Salapaka — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12707v1 Announce Type: new Abstract: We study a two-node stochastic resource system operating over a finite horizon. Each node experiences uncertain supply and demand and is equipped with finite storage. The objective is to ensure that resource levels remain within prescribed limits with high probability. To this end, we formulate a chance-constrained capacity-design problem in which resources can be exchanged through a capacity-limited transport link. We characterize the minimum storage required at each node, derive the optimal transport policy, and quantify the trade-off between storage and transport capacities. Our results show the existence of a critical transport-capacity threshold that enables full risk pooling between the nodes. Moreover, this threshold decreases with the operating horizon, implying that full-pooling performance can be achieved with progressively smaller transport capacity over longer horizons.

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12708v1 Announce Type: new Abstract: Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa. Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone. We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.

Smarter Saboteurs, Better Fixers: Scaling & Security in Linear Multi-Agent Workflows

Timothy McAllister, Sina Abdidizaji, Ivan Garibay, Ozlem Ozmen Garibay — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12709v1 Announce Type: new Abstract: As LLM-based multi-agent systems (MAS) are deployed in the wild, the resilience of their collaboration structures against adversarial compromise becomes a critical safety concern. Attackers may leverage prompt-injection or jailbreaking to sabotage individual agents within MAS workflows, but the interaction between model scaling and system-level resilience remains poorly understood. This paper investigates how model scale affects the security of linear multi-agent workflows. Our experiments across scales of two open-weight model families on the HumanEval benchmark reveal a compliance-correction symmetry: larger models are far more likely to faithfully execute malicious instructions, with the control-to-malicious performance drop reaching 53.7pp at 27B in uncorrected pipelines. However, appending a lightweight terminal Fixer stage collapses this to 0.6pp and restores statistical parity with control-level performance, demonstrating that strictly linear collaboration structures can be viable and resilient to adversaries at this scale, and suggesting that the brittleness previously attributed to linear topology may stem from a lack of correction.

A Stabilized Path-Space Approach to Diffusion-Based Posterior Sampling

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12710v1 Announce Type: new Abstract: Diffusion models provide expressive data-driven priors for Bayesian inverse problems, but many diffusion posterior samplers rely on heuristic guidance approximations that can fail for nonlinear operators and multimodal posteriors. In this work, we develop a stabilized path-space framework for diffusion-based posterior sampling. Starting from a base diffusion process whose terminal marginal represents the prior, we define a likelihood-weighted target measure on trajectories and cast posterior sampling as learning a controlled stochastic process whose path measure matches this target. This formulation connects diffusion posterior sampling to stochastic optimal control while preserving the Bayesian structure needed for uncertainty quantification. We introduce a time reparameterization that makes the path-space control problem well posed by removing the bias induced by the unknown initial value function, without auxiliary training. We then learn the control via a trust-region path-space optimization method with log-variance objectives. The path-space perspective also unifies our learned control approach with existing guidance-based samplers, quantifies the sampling error induced by approximate controls, and yields importance sampling corrections for asymptotically exact posterior expectations. We evaluate the proposed framework on a suite of benchmark inverse problems with analytically characterized or high-quality reference posteriors, enabling principled assessment of sampling accuracy and uncertainty quantification. These experiments provide insight into the behavior of diffusion-based posterior samplers and demonstrate improved accuracy and robustness over leading approaches.

Definitional alignment before capability alignment: a Design-Science framework for adjudicating claims about AGI

J. E. Aguilera Briones — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12713v1 Announce Type: new Abstract: Claims that artificial general intelligence has already arrived and claims that it remains decades away are often defended from overlapping evidence. "AGI" lacks a single shared and stable referent and competing operationalizations can return different verdicts on the same system. This article treats that under-specification as a design and governance problem. Following Design Science Research Methodology, it develops DAF-AGI, a second-order conceptual artifact with two coupled components: five ordinal criteria for assessing the adjudicative fitness of candidate definitions and a structured governance audit of authorship, interest, certification, external verification and revision authority. The artifact is demonstrated on five prominent measurement families and one deflationary boundary position in a documented corpus and then stress-tested against a stylized strong arrival claim: that current generative systems constitute AGI because they outperform a well-educated adult on many cognitive tasks. On evidence from the cited 2024-2025 sources, the claim was certifiable only under a performance-based operationalization; capability-ontology, psychometric and skill-acquisition approaches did not certify it, the economic family remains indeterminate and the deflationary position refuses binary adjudication. The contribution is a novel integration and operationalization, not an empirical validation: independent application, inter-rater testing and author-external cases remain necessary. The paper further proposes definitional sovereignty as an enabling component of algorithmic sovereignty: the institutional capacity to contest, certify and revise imported technological categories under public accountability.

The three dimensional Neumann Green's function for general surfaces: singular asymptotics and boundary integral methods

Alan E. Lindsay, Andrew J. Bernoff, Tristan Goodwill, Jeremy G. Hoskins — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12714v1 Announce Type: new Abstract: We present an asymptotic analysis and high-order boundary integral method for the three-dimensional Neumann Green's function in general geometries. The Neumann Green's function is a fundamental quantity which arises in numerous fields of science and engineering. In the application of singular perturbation methods to strongly localized reactions and diffusive transport, the Green's function plays the key role in mediating global dynamics. However, this essential quantity can only be determined in closed form for a limited set of geometries. The Green's function for the Laplacian is an elliptic problem with a Dirac forcing term. Accurate resolution of the solution requires a careful decomposition into a singular and a regular part. The bulk scenario is where the source is placed off surface and the singularity is given by the free-space function. In the surface case, where the source is placed at a curved point on the boundary, we use asymptotic analysis to determine a three-term singularity structure. With explicit knowledge of these singularities, we develop a high-order boundary integral method for the determination of the remaining regular part. To resolve the singular boundary data, our integral method uses a custom discretization with Duffy patches near the source. We validate our method using several test cases in which closed form solutions can be developed, including spheres, prolate spheroids and constructed domains. We demonstrate the applicability of our method to address some open problems in narrow capture theory.

Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

Xinyu Zhao, Rana Muhammad Shahroz Khan, Zhen Xu, Zhen Tan, Tianlong Chen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12716v1 Announce Type: new Abstract: The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only. Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., "inflate this score") rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks. Our framework is built on three pillars: (1) a new multimodal peer-review dataset spanning multiple scientific domains; (2) a unified suite of attacks, including black-box prompt injections and white-box perturbations, specifically designed to target both text (GCG) and figures (PGD); and (3) a practical defense, motivated by the long-context challenge of academic papers, that uses chunk-based embedding search to efficiently localize and mitigate harmful instructions. Our extensive experiments, conducted across state-of-the-art models, confirm that AI reviewers are pervasively vulnerable. PaperGuard establishes the foundational benchmark, protocols, and actionable defense necessary to pioneer trustworthy, attack-resilient AI-assisted scholarly reviewing.

Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting

Sudeepta Mondal, Ganesh Sundaramoorthi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12718v1 Announce Type: new Abstract: Radio-frequency (RF) fingerprinting systems must operate in open-world environments where signals from unknown transmitters and temporal drift introduce distribution shift at test time. Out-of-distribution (OOD) detection provides a natural framework for this problem, yet its application to RF fingerprinting (RFF) remains limited. A key barrier to their adoption is that most OOD detectors require auxiliary OOD data for parameter tuning, an assumption that is difficult to satisfy in RF environments where representative OOD data is impractical to collect. In this work, we introduce a promising set of OOD detection methods from the machine learning literature to open-set RFF domain. We present these methods within a unified mathematical framework based on information theory, which is a natural framework for communication systems. Our framework allows for the systematic analysis of methods and development of new methods. We further demonstrate the applicability of recent work on tuning OOD detectors without given OOD tuning data for open-set RFF. We evaluate on the POWDER RF fingerprinting dataset, showing that detectors tuned without any given OOD data achieve performance comparable to baselines with access to true OOD tuning data and greatly out-perform baseline approaches without access to true OOD tuning data, showcasing the practical viability for the RFF problem.

A Multiplexing Design Space: Theory, Method, and Application

Yiwen Xing, Afrah Farea, Saiful Khan, Min Chen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12719v1 Announce Type: new Abstract: Many visualization designs feature phenomena referred to as ``visual multiplexing'', where multiple pieces of information associated with the same data point are conveyed simultaneously. Although visualization designers are able to bring such phenomena, often unconsciously, into their designs, the design space of visual multiplexing is huge, and it is uncommon to explore visual multiplexing systematically as design patterns. In this paper, we propose a design method for exploring a smaller design space constrained by an application. As an illustrative case study, we focus on machine learning (ML) workflows for developing ML models that approximate partial differential equations (PDEs). In these workflows, ML researchers need to analyze the inter-relationships among multiple 2D scalar fields frequently. Since superimposing one heatmap on top of another is not an effective design, we formulate three design steps to explore the design space of visual multiplexing in the context of multiple 2D scalar fields. Our design method also includes a pre-design step for domain grounding and theoretical analysis, and involves domain experts in both co-design and evaluation activities. The design process enables us to identify relatively optimal default multiplexing designs as well as the need for small variations that domain experts can control through a user interface.

The Theory of Mind Utility: Formal Specification of a Mentalizing Mechanism

Nikolos Gurney, Stacy Marsella — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12721v1 Announce Type: new Abstract: Inferring others' beliefs requires more than reading surface signals; it requires tracking who told them what, in what order, and how credibly. The Theory of Mind Utility (ToM-U) formalizes this epistemic state inference problem at the computational level of analysis, specifying what mentalizing computes and why without commitment to algorithmic or neural implementation. ToM-U achieves this by constructing Local Epistemic World Models (LEWMs) -- directed typed graphs that represent agents, state nodes, and the epistemic relationships among them -- and evaluating discrete candidate LEWMs against observed behavior until one achieves sufficient confidence. Five formal definitions specify the LEWM structure, agent node properties including ordered information access history, a bounded proliferation mechanism for recursive mentalizing, three inference procedures, and a residue function that captures the structured trace left by failed mentalizing attempts. ToM-U differs from Bayesian Theory of Mind and adjacent formal accounts, which presuppose rather than derive belief states, and from simulation theory and theory-theory, which lack a formal apparatus for epistemic state inference. The architecture generates directional, falsifiable predictions about mentalizing failure that follow from structural properties of the model rather than auxiliary assumptions, and positions ToM-U as a domain-agnostic mechanism upstream of goal inference and other downstream social cognitive processes.

EquiDexFlow: Contact-Grounded SE(3)-Equivariant Dexterous Grasp Generative Flows

Clinton Enwerem, John S. Baras, Calin Belta — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12728v1 Announce Type: new Abstract: Most learned dexterous grasp generators relegate contact forces to a downstream verification step, so a kinematically-plausible pose can still violate the conditions for a stable physical grasp. We address this with EquiDexFlow, an SE(3)-equivariant flow-matching model that jointly predicts wrist pose, joint angles, fingertip contacts, surface normals, and contact forces from an object point cloud. Our architecture projects contacts onto the object surface and forces into the Coulomb friction cone by construction, so placement and friction compliance hold without loss penalties. We prove end-to-end SE(3) equivariance and verify it empirically over 200 rotations, with wrist residuals below $0.04^\circ$ and exactly zero joint deviation. Trained on 8,100 force-closure grasps across 81 objects for the 16-DoF Allegro Hand, our model achieves zero friction violations, the best composite score, and the lowest wrench residual among all ablation variants. We retarget decoded fingertip contacts to a 16-DoF LEAP Hand via per-finger inverse kinematics, and our hardware-feasible refinement places every joint at least 5% inside its actuator envelope while preserving wrench balance. On the physical robot, retargeted EquiDexFlow-decoded grasps complete open-loop pick-and-hold trials on all six test objects, with every asymmetric object succeeding at both the canonical pose and a $120^\circ$ co-rotation. Videos, code, and checkpoints are available at https://equidexflow.github.io.

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12730v1 Announce Type: new Abstract: Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, but relied on broad personality traits (Big 5) that predict specific behaviors weakly, even in humans. Furthermore, the isolation of conversational sessions combined with weak context matching left open whether LLMs truly lack coherence or whether the conditions needed to detect such coherence were not met. We contrast Big 5 with the Theory of Planned Behavior (TPB), which measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits. We run experiments across four behavioral tasks and 11 frontier LLMs, while also varying session context and identity induction. We find that SR-behavior coherence exists but is selective. 1) Within a shared conversation, the Theory of Planned Behavior reaches human-level coherence; Big 5 does not. 2) Across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt, such as implicit bias shaped by training, and collapses when behavior is strongly primed by context, as with sycophancy. 3) Persona prompting makes self-reports more consistent across conversations, but does not bring behavior into alignment. These findings suggest that coarse personality frameworks, such as Big 5 may not be the best tools for testing deployment behavior. More task- and behavior-specific instruments are needed, and even these must be evaluated across tasks and contexts.

Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12731v1 Announce Type: new Abstract: As LLMs increasingly serve in advisory and deliberative roles, users rely on them for non-verifiable reasoning in domains lacking objective ground truths. However, traditional evaluations of LLM reasoning focus almost exclusively on fact-based domains, such as mathematics and science, leaving uncertainty over whether and to what degree models can handle ambiguous, subjective, or value-laden problems over time. To address this concern, we propose moral reasoning as a paradigmatic subdomain of non-verifiable reasoning. We define moral robustness as a model's capacity to exhibit sound moral reasoning across time and contexts, and we introduce a scalable, adversarial, multi-turn evaluation framework to empirically measure this capability. We simulate 48,000 user-agent moral deliberations across four frontier LLMs, varying premise relevance, premise order, conversation duration, and the user's stated moral view. We find that models successfully ignore morally-irrelevant distractors, but shift their reasoning by up to 6.5%, on average, towards the user's stated preferred moral view, and varying their reasoning depending on factors such as order (altering moral judgments by order in 13-22% of the cases) and duration (altering moral judgments between single-turn and multi-turn in 10-24% of the cases). Our analysis indicates that models tailor not just their final verdicts but their underlying justifications to align with a user's moral viewpoint - a failure mode we characterize as moral deliberative sycophancy.

Let's Ask Gauss: Improved One-Run Privacy Auditing

Adya Agrawal, Yu Wei, Jaspal Singh, Malik Magdon-Ismail, Vassilis Zikas — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12733v1 Announce Type: new Abstract: Privacy auditing provides an important safeguard by estimating the actual information leaked by a model, thus ensuring that theoretical privacy guarantees hold in practice. We study empirical privacy auditing for differentially private (DP) machine learning, focusing on efficient one-run methods for mechanisms such as DP-SGD. Prior one-run approaches threshold training examples or "canaries" into binary membership guesses, which discards useful information. We show that, in the white-box DP-SGD setting, canary-aligned signals naturally form a sequence of random variables whose normalized sum is asymptotically Gaussian. Leveraging this distributional perspective, we develop a DP-auditing framework that leads to tighter privacy lower bounds from a single training run.

Physics-Informed Neural Networks and Radial Basis Functions for PDEs with Dirac Delta Sources

Manuel Reyna, Alexandre Tartakovsky — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12735v1 Announce Type: new Abstract: Physics-Informed Neural Networks (PINNs) are a machine learning method for solving forward and inverse Partial Differential Equations (PDEs). When applied to PDEs with Dirac delta functions in the forcing terms, boundary conditions, or initial conditions, PINNs require approximating them with smooth surrogate functions, a practice that can introduce significant modeling errors. In this work, we exploit the interpretation of PINNs as Residual Least Squares (RLS) methods and show that this perspective enables direct treatment of Dirac delta terms by integrating the weak-form equation. Among RLS formulations other than PINN, we focus on the Radial Basis Function (RBF) expansion (also known as a single-layer RBF Network). We show that while integrating out the Dirac delta in PINNs causes residuals to fail to converge to zero, RBF-RLS consistently provides good forward and inverse solutions to transport problems. We explain this finding using the Neural Tangent Kernel (NTK) theory. We test both approaches on linear PDEs that represent groundwater flow and transport in porous media and rivers. We solve inverse problems to fit synthetic data, noisy synthetic data, and real-world measurements.

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12736v1 Announce Type: new Abstract: AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well-specified data-analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: https://sciagentarena.github.io/.

PI-Hunter: Automated Red-Teaming for Exposing and Localizing Prompt Injections

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12737v1 Announce Type: new Abstract: Large Language Models (LLMs) are rapidly evolving into agentic systems that interact with external tools and environments, introducing new security risks such as indirect prompt injection attacks through untrusted external sources. Existing defenses mainly focus on blocking malicious content at inference time, and current red-teaming methods primarily optimize attack success. As a result, developers have limited visibility into how latent prompt injections emerge and propagate through agents. We propose PI-Hunter, an automated agentic auditing framework for proactive vulnerability exposure in LLM agents. PI-Hunter constructs realistic source-aware test cases and iteratively evolves them through feedback-driven exploration to induce agents to retrieve and reveal latent malicious instructions embedded within external environments. Extensive experiments across multiple benchmarks, agent architectures, attacks, and defenses demonstrate that PI-Hunter substantially improves vulnerability exposure and attack-surface coverage over strong automated red-teaming baselines, while remaining effective under existing prompt injection defenses.

Deep Unfolded Latent Optimally Partitioned-l2/l1 Networks for Data-driven Block-Sparse Recovery

Takanobu Furuhashi, Hidekata Hontani, Qibin Zhao, Tatsuya Yokota — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12740v1 Announce Type: new Abstract: The convex Latent Optimal Partition (LOP)-l2/l1 approach enables block-sparse signal recovery with unknown partitions but relies on manual hyperparameter tuning. Additionally, numerical instability in differentiating its proximal operator prevents its automatic parameter tuning via Deep Unfolding (DU). To address these limitations, we propose two architectures: a stable framework utilizing implicit differentiation and a flexible variant leveraging Deep Weight Factorization (DWF). The DWF-based approach also supports nonconvex smooth data fidelity terms. Numerical experiments demonstrate that DU-LOP-l2/l1 yields competitive performance and high resilience against impulsive noise.

Reducing the Complexity of Deep Learning Models for EEG Analysis on Wearable Devices

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12742v1 Announce Type: new Abstract: Wearable healthcare devices are the fastest-growing Internet of Things (IoT) sector. Many automated healthcare services rely on two crucial biological signals, namely ECG and EEG, which reflect the activity of the heart and brain, respectively. Although deep neural networks are considered the primary way to process and analyze these signals, the very tight energy and computational power constraints in wearable devices are far below the computational, energy, and memory bandwidth demands of DNN models, thereby impeding the deployment of deep learning in many practical wearable services. This paper investigates the feasibility of deploying state-of-the-art DNN models in resource-constrained wearable devices. Notably, we explore the trade-off between accuracy and computational complexity of DNNs when parameter quantization and electrode reduction methods are used. Our investigation centers on several state-of-the-art DNN models designed for EEG signal analysis, specifically for detecting epileptic seizures. Our findings demonstrate that, when applied judiciously, these techniques can significantly reduce the complexity of the DNNs under consideration with minimal adverse effects on accuracy. These results reveal the explicit trade-offs between accuracy and complexity reduction encountered when adapting DNN-based online EEG analysis for wearable devices.

GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12744v1 Announce Type: new Abstract: In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) relies on retrieving relevant examples, such as images, captions, or question-answer pairs, to guide predictions across tasks like classification, captioning, and visual question answering (VQA). Most existing approaches select in-context examples based on feature-space similarity, assuming that semantically similar samples provide the most useful context. However, our systematic analysis reveals that this assumption does not always hold: visually similar examples are not necessarily those that most effectively enhance in-context learning performance. To address this, we propose the Guided Retrieval of In-context Prompts (GRIP), a learnable vision-only retrieval framework that leverages feedback from LMMs to identify examples that truly improve model predictions. GRIP learns to distinguish beneficial from detrimental in-context examples through contrastive training, refining retrieval beyond pure similarity. Across three multimodal tasks, namely classification, captioning, and VQA, GRIP improves consistently over similarity-based retrieval on Qwen2.5-VL-7B, with its strongest gains in classification on Idefics2-8B. Moreover, we demonstrate that retrievers trained with feedback from one open LMM can be transferred to other models without retraining, including closed-source GPT-4o and Gemini, enabling scalable and cost-efficient deployment of M-ICL. Code will be published upon acceptance.

Prefill Awareness in Large Language Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12747v1 Announce Type: new Abstract: Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefilling model outputs. If AI models can recognize and act on the fact their prior assistant messages have been inserted or edited, the effectiveness and validity of these methods could be compromised. We investigate whether frontier language models can distinguish between tampered and untampered assistant-side context, a capability we call prefill awareness. To do so, we construct a binary preference benchmark across three prefill mechanisms, filtering for cases where models show consistent stances. We find that frontier models show substantial prefill awareness: Claude Opus 4.5 detects prefills opposing its preferences in 9-35% of cases with a 0% false positive rate when prompted; additionally, models often revert towards baseline behavior without explicitly reporting that the prefill was foreign. Controlled ablations later also show that detection and resistance rely on different cues, where stylistic mismatch mainly affects whether models flag a prefill as foreign, while preference mismatch mainly affects whether they revert toward their baseline answer. We also examine more realistic agentic settings such as misalignment-continuation evaluations and SWE-bench trajectories, where frontier models sometimes disavow prefilled assistant turns in ways that depend strongly on dataset, task success, and hidden formatting artifacts. Our results indicate that prefill awareness is already a substantial confound for some prefill-based methods. We recommend that model developers track this capability in frontier systems.

Agent-based models for the evolution of morphological alternation patterns

Aravinth Kulanthaivelu, Richard Sproat — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12748v1 Announce Type: new Abstract: Why is the past of English "go" the apparently unrelated "went"? Such alternations are frequent in languages. They neither aid communication nor learnability, yet they can be persistent, surviving over centuries or millennia. We present a multi-agent simulation of the emergence of morphological stem and inflection alternations. Alternate forms arise by phonological changes or, as with "go/went", from lexical alternatives associated with a subset of the population. When an agent 'hears' another agent use a novel form for a slot in the paradigm of a word (say, the past tense of go), they will with some probability adopt that form, possibly spreading its use to other slots in the paradigm that shared the same original form. Thus alternative forms can spread through the population and become entrenched as stem or inflectional marker alternants. Unlike many previous computational studies, our system allows for naturalistic lexical forms, realistic phonological rules, lexicons with hundreds or thousands of entries, and agent populations in the tens or hundreds. It supports several network topologies, diffusion patterns and agent adoption policies. One issue with such simulations is evaluation: how realistic is the resulting morphology compared to those of real languages? We introduce the AI Historical Linguist, a novel Large Language Model-driven system that models a debate between two historical linguists. We use this to compare a set of real language morphologies, disguised morphologies, and experimentally evolved morphologies. The results suggest that among the factors that favor more plausible morphologies are scale-free social networks and random Bernoulli adoption of forms. We also present three case studies modeling attested historical changes, allowing us to test what might have happened if history had been different. All code and data are released.

Beyond Resilience -- A Conceptual Framework for Civic Ascent

Alexandros Washburn, Carlo Lipizzi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12752v1 Announce Type: new Abstract: The resilience literature measures urban performance as recovery: the degree to which a city returns to its pre-shock baseline. This paper develops a stronger concept -- civic ascent -- as part of a broader research program on the ethology of coupled agent-environment systems, of which the city is the deepest available empirical instance. Civic ascent is defined as the condition in which a city emerges from shock with higher functional capacity than before. We develop a conceptual framework in the ethological tradition, treating the city as a coupled system of three slow state variables -- topos (physical structure), nomos (institutional structure), and hexis (civic judgment) -- together with a fast affective channel (delta) through which shocks to topos and nomos reach hexis. The framework distinguishes three structurally distinct pressures on civic systems: shocks (discontinuities in T or M), decay (continuous entropy), and leakage (active extraction of civic surplus into non-civic pools). The ascent condition is that reinforcement from cross-coupling of T, M, and H exceeds the combined loss from decay and leakage. Post-shock ascent is measured by a normalised improvement index A(T) applied to a composite civic performance signal P(t) constructed from scale-adjusted key performance indicators, distinguishing intrinsic civic ascent from demographically driven growth. New York City after September 11, 2001, is proposed as the primary empirical case; the operational measurement program is specified in the companion NYC Civic Data Map (Washburn 2026c, 133 KPIs) and executed in Paper 2. The reader for whom only the urban contribution is of interest will find it complete in itself; the reader interested in the larger program will find this paper its formal core.

On the Limits of Performance Portability in Directive-Based GPU Programming

Alessandro Romeo, Nitin Shukla, Stefano Truzzi, Alessio Suriano, Andrea Mignone — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12753v1 Announce Type: new Abstract: The transition of scientific applications to GPU-accelerated exascale systems is constrained by trade-offs between performance, portability, and productivity. This work evaluates the performance portability of directive-based GPU programming by porting gPLUTO, a production-grade magnetohydrodynamics code for astrophysical simulations, from OpenACC to OpenMP, and analyzing its performance on NVIDIA A100 (Leonardo Booster) and AMD MI250X (LUMI-G) devices. On NVIDIA platforms, OpenACC and OpenMP achieve comparable performance due to a shared compiler backend, providing a consistent baseline for assessing algorithmic efficiency. In contrast, the same OpenMP implementation is approximately three times slower at the application level on AMD MI250X with respect to the NVIDIA A100 OpenACC baseline, with kernel-level slowdowns reaching up to an order of magnitude, driven by sensitivity to strided memory-access patterns and compiler limitations. Kernel-level profiling shows that the dominant contributors to run-time are memory-latency-bound rather than limited by peak band-width. In low-parallelism kernels, C++ abstraction layers increase register pressure and spilling, leading to extreme slowdowns of up to 47x in specific cases. These results indicate that portable performance across GPU architectures requires not only application-level changes but also continued advances in compiler backends and architecture-aware optimization strategies

LLMs Can Better Capture Human Judgments--With the Right Prompts

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12754v1 Announce Type: new Abstract: Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distributions of responses, and that their judgments are unstable across wording variations. We demonstrate simple prompting strategies that mitigate these limitations. Across two datasets--a U.S.-representative set of 144 moral scenarios and 38 moral beliefs from the International Social Survey Programme's Family and Changing Gender Roles module covering 32 countries--we show how simple elicitation techniques help improve AI-human alignment. First, prompting models to report standard deviations and response proportions recovers the full range of human responses better than common strategies. Second, ensuring scenarios are clear to human participants--as reflected in human confusion ratings--boosts model alignment, and LLMs can track human confusion ratings. At the same time, we find that LLMs' estimates of their own error are poorly calibrated, though they can predict human variability relatively well. These results suggest that asking better questions to LLMs can yield better answers.

Sparse2Act: Learning Action-Aligned Sparse 3D Representations for Cross-Domain Robot Manipulation

Yu Guo, Chang Yu, Siyu Ma, Yunuo Chen, Yin Yang, Ying Nian Wu, Chenfanfu Jiang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12759v1 Announce Type: new Abstract: Explicit 3D representations are attractive for manipulation because they expose object shape, workspace geometry, and robot-object relations in metric coordinates. However, sparse 3D encoders are often learned through downstream task objectives, tying the representation to a particular data distribution, policy architecture, and action parameterization. We introduce Sparse2Act, an observation-action alignment framework for pretraining sparse point-cloud encoders. The key idea is to use task-space end-effector actions as geometric supervision: masked sparse 3D tokens are trained to organize scene features around the workspace motion paired with the observation. After pretraining, only the encoder initialization is reused by downstream policies, allowing them to retain their own architectures and action spaces, including joint-space commands. On the LIBERO-10 benchmark, our method achieves 86.9% average success after 500 fine-tuning steps. The same pretrained encoder supports LIBERO-to-Meta-World cross-domain transfer, achieving 73.4% average success on the Meta-World-5 benchmark. Ablations on the objective and decoder capacity show that the gains come from the masked action-alignment signal and remain useful across downstream action decoders. In real-world experiments, simulation pretraining followed by limited real-data fine-tuning achieves an average success rate of 72.5% across four tasks, demonstrating effective sim-to-real transfer. These results suggest that robot actions can provide compact geometric supervision for reusable sparse 3D representations.

Adaptive Weighted Averaging

Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, Manish Purohit — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12763v1 Announce Type: new Abstract: We study the problem of selecting the largest among $n$ unknown values $x_1,\dots,x_n$ given only a single unbiased estimate $y_i$ for each $x_i$. We design strategies that are simultaneously admissible (not uniformly dominated by any other strategy) and also never worse than a given baseline such as uniform random selection. We provide an application to stochastic optimization, where we obtain online-to-batch conversion bounds with a desirable "no-compromise" guarantee: they are never worse than standard random iterate selection, and yet can be significantly better in benign settings.

Detecting Functional Memorization in Code Language Models

Matthieu Meeus, Anil Ramakrishna, Matthew Grange, Zheng Xu, Luca Melis — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12764v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to generate code at scale. Meanwhile, prior work has investigated whether training data may be recoverable from model outputs, by auditing the textual overlap between training examples and model generations. Code, however, can be functionally equivalent while textually dissimilar. In this work, we study functional memorization: extraction of functional logic beyond what verbatim metrics detect. We construct a counterfactual setup for Olmo-3-32B, comparing a midtrained model (exposed to target code) against a pretrained reference (not exposed). We prompt both models with Python function signatures and measure both textual and functional similarity (i.e., LLM-as-a-judge, execution-based). Our results show clear evidence of functional memorization, highlighting the need for auditing metrics that go beyond textual overlap.

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

Ramchand Kumaresan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12765v1 Announce Type: new Abstract: Apple's Metal 4.1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The specification states which data-type rows are supported, never whether they are hardware-accelerated, where the operation physically executes, what its accumulator width is, or how it partitions matrix fragments across threads. We present Rigel, an empirical characterization of this path on a single Apple M4 Max (a pre-neural-accelerator generation). Using a checksum-gated, provenance-tracked microbenchmark harness, Rigel recovers eleven facts the v4.1 specification hides or contradicts. The headline finding: the Metal 4.1 fp8 (E4M3) matmul2d is emulated, not accelerated: it sustains 0.94x the throughput of fp16 despite reading half the operand bytes, so on M4 it is a memory-footprint feature, not a performance feature. We further show, via a three-signal triangulation (throughput ceiling, comparison against simdgroup_matrix, and per-rail power attribution), that matmul2d executes entirely on the GPU shader cores with no dedicated matrix datapath and no evidence of Apple Neural Engine routing; that it accumulates in >=fp32; and we reconstruct the opaque 8x8 cooperative_tensor fragment layout Apple documents nowhere. Acting on the characterization, a hand-fused GEMM + bias + GELU kernel beats the decomposed path by +6.5-12.9% in the cache-resident regime. All findings are reproducible from committed MIT-licensed code and per-cell CSVs.

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

Sarah Elshabrawy, Rahul K. Dass, Ashok K. Goel — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12767v1 Announce Type: new Abstract: Evaluating procedural reasoning in AI-supported learning systems requires question-answer datasets that are both learner-like and grounded in the instructional knowledge the system is expected to use. We study how TMK-based question generation strategies affect dataset quality for procedural and multi-hop reasoning. We compare three strategies: strict generation from Task-Method-Knowledge (TMK) models, transcript-first generation with post-hoc TMK filtering, and TMK-aware generation that combines transcripts with structured guidance. To evaluate generated items, we introduce a grounding validation framework based on closed-set evidence units extracted from TMK models. The framework measures whether answers are supported by the underlying representation, whether questions are self-contained, and whether they target multi-hop procedural reasoning. Across 23 instructional topics and 690 generated question-answer pairs, strict TMK generation achieves the strongest overall quality, with 96.5% grounded questions and 92.6% usable questions. Transcript-first generation produces more learner-like questions but more context-dependent or weakly grounded items, while TMK-aware generation yields high raw multi-hop coverage but lower grounding. These results show that procedural richness and natural phrasing do not guarantee representational grounding, motivating explicit representation-aware validation for evaluation datasets in AI-supported learning.

Patching Control Lyapunov Barrier Functions for Temporal Logic Specifications with Bounded Controls

Ruikun Zhou, Yating Yuan, Haocheng Chang, Yinan Li, Yiming Meng — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12768v1 Announce Type: new Abstract: We propose an abstraction-free framework for controller synthesis for continuous-time dynamical systems subject to Linear Temporal Logic (LTL) specifications and bounded control inputs. The proposed method combines the sequential decomposition of LTL tasks with the use of formally certified Control Lyapunov-Barrier Functions (CLBFs). By formulating local specifications as a sequence of safe-stabilization problems, we systematically approximate and patch the winning sets of the decomposed subtasks. The satisfaction of these local constraints is guaranteed by the offline-computed level sets of the CLBFs. As a result, our framework yields formally verified switching feedback controllers that enable efficient online planning and dynamic re-planning. This ensures robust continuous specification satisfaction in the presence of state perturbations, avoiding the explicit state-space abstractions commonly required in the literature. The approach is validated through numerical simulations and a hardware demonstration on a Crazyflie quadrotor.

Agentic MPC for Semantic Control System Resynthesis

Yuya Miyaoka, Masaki Inoue — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12774v1 Announce Type: new Abstract: While MPC effectively handles structured, diverse, and low-level specifications, it lacks the capability to dynamically incorporate high-level contextual information such as social norms, user intent, or natural language instructions. To address this limitation, this manuscript introduces an agentic MPC framework that enables context-aware, semantically adaptive control synthesis by integrating with large language model-based agents. The agent interprets heterogeneous inputs, including natural language messages, environmental observations, and external knowledge, to resynthesize the control specifications. The effectiveness of the framework is demonstrated in an autonomous driving scenario, where the system aligns with personal preferences or responds to social situations such as emergency vehicle yielding.

ProPlay: Procedural World Models for Self-Evolving LLM Agents

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12780v1 Announce Type: new Abstract: Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and decide when to trust prior experience. Existing LLM-agent methods often rely on memory or planning modules, yet they rarely close the loop between them to continually refine an internal understanding of environment dynamics. We introduce ProPlay, a procedural world model that supports procedure-level preplay, where agents can rehearse future procedural paths using the learned world knowledge. Rather than representing experience as isolated rules or low-level action constraints, ProPlay abstracts successful trajectories into procedures and organizes them in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding to estimate its task-specific contribution from past outcomes. Before each episode, ProPlay simulates future procedural trajectories over known graph structures as structured soft guidance; after execution, it refines the graph using environment feedback. Experiments on public benchmarks show that ProPlay consistently improves environment understanding and self-evolution capability over strong baselines. Our code has been released in https://github.com/antman9914/proplay.

A Tutorial on World Models and Physical AI

Il-Seok Oh — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12783v1 Announce Type: new Abstract: World modeling is emerging as a central principle for building intelligent systems capable of prediction, reasoning, and decision making. A central distinction can be drawn between explicit world models, which learn structured dynamics for rollout-based reasoning and planning, and implicit world models, which encode predictive structure within scalable learned representations. These complementary paradigms provide a foundation for physical AI in domains such as robotics and autonomous driving, enabling intelligence beyond reactive control under real-world constraints. Recent foundation models further suggest a pathway toward unified systems integrating perception, prediction, and action. Despite rapid progress, major challenges remain in hierarchical reasoning, long-horizon planning, and autonomous goal formation, which are critical for advancing toward artificial general intelligence. This tutorial presents a coherent framework in which diverse world modeling approaches are unified through shared predictive structure and differentiated by how such structure is represented and exploited.

The No-show Paradox in Single Transferable Vote under One-dimensional Preferences

Farhad Mohsin — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12785v1 Announce Type: new Abstract: The group no-show paradox (GNSP) occurs when a group of agents abstaining from voting can make the new winner more preferred to them. Previous work has suggested that even for voting rules susceptible to this paradox, it is a rare occurrence in real elections and under various assumptions. However, we find that under one-dimensional preference models such as 1D-Euclidean, single-peaked, or single-crossing preferences, Single Transferable Vote (STV), a popular runoff rule, is highly vulnerable to GNSP. This is in stark contrast to Condorcet rules, another family of rules susceptible to GNSP, where the paradox cannot occur under these one-dimensional preferences. We theoretically identify tractable and prevalent sufficient conditions for GNSP to occur for STV under one-dimensional preference models. Through our theoretical results and experiments with synthetic preference profiles from these domains, we demonstrate that voters at the extremes of the 1D spectrum are particularly likely to cause GNSP by abstaining. Furthermore, the likelihood of occurrence increases substantially as the number of alternatives grows.

Orchestrating the Twin Transition in Multinational Corporations: Technology Roadmapping for Green and Digital Global Business Services

Han-Teng Liao, Karen Ang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12787v1 Announce Type: new Abstract: Global Business Services (GBS) have emerged as a "living laboratory" for the Twin Transition of Green and Digital Transformation, as multinational corporations (MNCs) face increasing pressure to harmonize digital efficiency with environmental stewardship. Aiming to derive a socio-technical framework, this paper synthesizes Technology Roadmapping (TRM) with the International Telecommunication Union (ITU) ICT-centric innovation ecosystem toolkit. A bibliometric analysis of research clusters reveals an evolutionary shift from basic process automation toward "Sustainable Intelligence," identifying the GBS unit as a central "operational airlock" that mediates between landscape pressures -- such as the EU's dual mandate and Carbon Border Adjustment Mechanisms -- and niche innovations in AI-native workflows. The study further maps these clusters onto a stakeholder engagement canvas, highlighting how resilient "Middle Power" hubs in Poland, Portugal, and Malaysia are bypassing the middle-income trap to provide a "third way" for global value chains amidst a bifurcated geopolitical cloud. The results offer a data-driven design approach for leaders and entrepreneurial support networks to orchestrate talent and supply chain flows, thereby enriching the conceptual understanding of Industry 5.0 and the role of GBS as a primary mechanism for navigating a volatile, multipolar digital economy.

To Share or Not to Share: Orchestrating Trustworthy Data in Global Value Chains

Han-Teng Liao, Chang-Yi Kao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12788v1 Announce Type: new Abstract: As the EU Carbon Border Adjustment Mechanism (CBAM) approaches, the global semiconductor value chain faces growing structural tensions between regulatory transparency and data sovereignty. This article proposes a RegTech reference architecture using the International Data Spaces (IDSA) framework to orchestrate trustworthy environmental telemetry across the semiconductor-petrochemical nexus. The framework distinguishes the mandatory CBAM requirements from voluntary Science Based Targets initiative (SBTi) frameworks, while addressing the additive complexities of the Safe-and-Sustainable-by-Design (SSbD) framework. Moving beyond standard linear technology stacks, we introduce a prospective roadmapping methodology that transforms upstream physical vulnerabilities into circular, negative feedback loops. Focusing on the Taipei and Penang technology corridor, the article details how sovereign data exchange enables Digital Product Passports (DPPs) to drive Global Business Services (GBSs) capability demands. Finally, we discuss the integration of Agentic AI for autonomous compliance and FinTech green financing, providing a scalable blueprint for global industrial clusters to achieve sovereign, sustainable, and transparent value chains.

How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

Chase M. Fensore, Kaustubh Dhole, Jason Fan, Eugene Agichtein, Joyce C. Ho — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12789v1 Announce Type: new Abstract: Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes discriminative power (the standard deviation of generation quality across categories) within a given RAG configuration. As a case study, we generate 5,872 synthetic question-answer (QA) pairs from FineWeb-10BT across 3 dimensions (Question Complexity, Answer Type, Linguistic Variation) at 3 granularity levels (2, 4, and 8 categories). With a BM25+Falcon-3-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine-grained distinctions (discriminative power: 0.053) while answer type and linguistic variation peak at medium granularity. We introduce a Coherence Ratio metric to quantify whether fine-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions (Question Complexity: 0.40 vs. Answer Type: 1.44). Human evaluation of 110 stratified QA pairs confirms synthetic quality. While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings.

GENIE: A Fine-Grained Measure for Novelty

Ramya Namuduri, Manya Wadhwa, Anshun Asher Zheng, Greg Durrett, Junyi Jessy Li — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12790v1 Announce Type: new Abstract: Large Language Models have consistently demonstrated a lack of creativity and diversity across tasks. Prior work has focused on addressing whether models are capable of generating creative outputs. Here, we aim to consider novelty and investigate what makes model-generated content novel or not novel in a task-specific manner. We propose a fine-grained evaluation metric GENIE to measure the novelty of responses along task-specific features with respect to a population of responses. We show that unlike GENIE, holistic metrics struggle to capture the high-dimensionality of novelty and do not provide insight on which properties they target. Finally, we use GENIE to measure the effectiveness of mitigation methods that address creativity to better understand where these methods can improve novelty.

The GIST 2064-Bus Test System: A Public-Data Synthetic Model of the Korean Power Grid

Yun-Su Kim — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12791v1 Announce Type: new Abstract: No model of the Korean transmission system at native resolution is publicly available, which makes reproducible research on one of the world's most distinctive grids difficult-an islanded interconnection with extreme separation between generation and the Seoul Metropolitan Area load center, low renewable penetration, and heavy reliance on extra-high-voltage (EHV) transmission. Working strictly from public data, and for research purposes only, we present the GIST 2064-bus test system, a geographically grounded synthetic model of the Korean grid. Unlike fully synthetic cases, whose lines match no real corridor, and aggregated public Korean models, it derives its 345 and 154 kV layout from the OpenStreetMap/OpenInfraMap power layer by a multi-source shortest-path reassembly of overhead-line geometry, gap-fills unreachable substations with a geographic minimum-spanning-tree backbone, and calibrates the aggregate circuit length to published national statistics (108/107/97% at 765/345/154 kV). The model spans 2064 buses, 512 generation and renewable sources (144 GW), 3044 AC line circuits plus high-voltage direct-current (HVDC) equivalents, 3073 transformers, and reactive resources (shunts and 11 FACTS devices), serialized to a PSS/E-compatible CSV schema. A general-purpose pandapower Newton-Raphson solver-with generator reactive limit enforcement, a secant-gain remote voltage-control loop, tap-changer and switched-shunt fixed-point control, and zero-impedance regularization-solves an 85 GW high demand snapshot to a single connected, converged operating point (mean voltage 0.996 pu, 2.3 % losses, no undervoltage buses), structurally consistent with the independent public KPG-193 model. The dataset, maps, and tooling are released as a citable platform for power flow, planning, and decarbonization studies.

Semantic Identification of IoT Devices from Behavioral Primitives

Samuel Witt, Hassan Habibi Gharakheili — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12793v1 Announce Type: new Abstract: Accurate identification of IoT devices is important for security management and policy enforcement. Existing approaches typically learn device signatures from packets or flow records. These methods operate on low-level communication observations whose traffic patterns may vary across deployments, software versions, and user interactions. This paper studies device identification using Manufacturer Usage Description (MUD) profiles. MUD profiles describe device behavior using Access Control Entries (ACEs), where each ACE represents a behavioral primitive consisting of protocol, endpoint, direction, and port semantics derived from device communication policy. Our contributions are threefold. First, using 28 publicly available MUD profiles containing 1,023 ACE instances, we construct ACE-level semantic representations from compact behavioral text and analyze their geometric properties. ACE-level representations preserve device-level behavioral distinctions more effectively than whole-profile embeddings and remain effective after whitening calibration. Second, we evaluate semantic ACE matching under controlled runtime variations, including unseen ACEs, drifted hostnames, and partial runtime observation. Exact ACE matching performs well when the overlap with the canonical MUD profile remains high, but degrades sharply when the overlap becomes sparse or disappears. In contrast, semantic ACE matching preserves useful identification evidence across these conditions. Third, we evaluate the same approach on real IoT traffic traces comprising more than 800,000 observed flows. Exact overlap remains the strongest signal when stable overlap exists, while semantic ACE matching provides stronger identification evidence during the early stages of observation, frequently retains the correct device among the highest-ranked candidates, and remains effective under sparse-overlap runtime traffic.

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

Md Jafrin Hossain, Mohammad Arif Hossain, Weiqi Liu, Nirwan Ansari — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12797v1 Announce Type: new Abstract: Agentic large language model systems that autonomously invoke tools, maintain persistent memory, and execute multi-step plans are increasingly deployed in public-facing domains, including government services, healthcare triage, and financial advising. We ask whether the frameworks used to build these systems provide architectural-level structural safety guarantees. Applying six containment principles derived from a compositional model of agentic architectures, we audit three dominant frameworks (LangChain, AutoGPT, and OpenAI Agents SDK) and find no native compliance in any of them. Memory integrity, a defense against one of the most prevalent vulnerability classes, is not observed in any of the three evaluated frameworks. We validate these findings empirically: in a simulated government benefits agent built on LangChain, a single memory-poisoning write induces persistent targeted corruption across all tested seeds and backends, increasing the wrongful denial rate for targeted applicants to 88.9%. Under a complex five-factor policy, the same attack preserves aggregate accuracy while increasing targeted wrongful denials by 3.5x, rendering the corruption difficult to detect through standard monitoring. We then introduce two lightweight containment mechanisms: a memory integrity validator and a policy gate, which eliminate both attack vectors with sub-millisecond overhead (<0.2ms per call). We conclude that the current agentic framework ecosystem may not yet meet secure-by-default expectations for public-facing deployments and outline priority architectural interventions to enable trustworthy deployment in high-stakes, socially impactful applications.

Pushing the Frontiers for Floating Solar Photovoltaics -- The Case for South America

Soham Ghosh, Anik Goswami, Krishna Kumba — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12798v1 Announce Type: new Abstract: Floating solar photovoltaic (FSPV) systems provide a land-efficient pathway to expand clean electricity access in energy-poor regions. South America has among the highest global FSPV potential (approx 38.26 TWh per million acres of water surface), yet deployment remains limited. This study presents a techno-socio-economic framework to assess FSPV for energy access, water security, and grid flexibility, with case studies in Nicaragua, Honduras, and Guyana. Estimated yields for 50 to 398 MW systems exceed 1,500 to 2,000 kWh per kW annually with capacity factors above 20 percent. At El Cajon, FSPV could significantly reduce emissions relative to fossil generation. Results show competitive costs with land-based PV when accounting for avoided land use, shared hydropower infrastructure, and water benefits. The framework also highlights co-location with hydropower and AI data centers, offering a scalable model for deployment in underserved regions.

A variable time-step, second-order, and MBP-preserving linear stabilized scheme for the time-fractional Allen-Cahn equation

Bingyin Zhang, Ao Zhang, Hongfei Fu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12799v1 Announce Type: new Abstract: In this paper, we present a second-order linear scheme based on the variable-step Alikhanov formula and central difference discretization for the time-fractional Allen-Cahn equation. The nonlinear potential is treated explicitly via a second-order extrapolation with preprocessing, which enables the discrete maximum-bound principle (MBP) to be preserved through an appropriate stabilization technique. Moreover, by developing a discrete fractional Gr\"onwall inequality together with the uniform boundedness of numerical solutions guaranteed by the MBP, we establish an $\alpha$-robust and optimal second-order maximum-norm error estimate under initial weak singularity assumption. In addition, energy stability is proved in the sense that the discrete original energy is uniformly bounded by the initial energy plus a high-order spatiotemporal correction term. Finally, extensive numerical experiments are presented to demonstrate the effectiveness of the proposed scheme.

Massively parallel flow routing and drainage area determination

Wolfgang Bangerth — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12800v1 Announce Type: new Abstract: Digital elevation models (DEMs) have reached resolutions and sizes that only parallel computaters can efficiently process. One important application of DEMs is predicting how much water flows where, the so-called ``flow routing problem'' (a variation of which is the problem of determining the drainage area upstream of a point in a DEM). The traditional algorithm for flow routing is sequential, and attempts to parallelize this method have so far only been moderately successful. Herein, we build on earlier work in Richardson et al. (2014) and propose an algorithm and several variations that can efficiently solve the flow routing problem on very large models with very large numbers of parallel processes. For the largest model we use, with 1.88 billion points, the best algorithm herein can route water in 4.0 seconds on 12,288 processes of a computer cluster.

AiAWE: An Open-Source LLM Automated Writing Evaluation System Using LoRA-Adapted Instruction-Tuned Models

John Maurice Gayed — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12801v1 Announce Type: new Abstract: This study presents AiAWE, an open-source automated writing evaluation system that scores argumentative essays using a LoRA-adapted instruction-tuned large language model (Gemma-3-27B-it). Using a proprietary Educational Testing Service (ETS) dataset of 480 TOEFL Independent Writing essays, we fine-tune Gemma-3-27B and LLaMA-3.3-70B under identical LoRA configurations on a 120-essay training subset and evaluate on the remaining 360 essays under identical inference quantization. The fine-tuned Gemma model achieves a root mean square error of 0.474, a quadratic weighted kappa of 0.828, and an agreement rate of 90.56% within +/- 0.5 of the human score, outperforming both the larger LLaMA-3.3-70B model and the fine-tuned GPT-3.5 baseline reported in prior work on the same dataset. Three findings are of broader interest: open-weight LLMs can match or exceed proprietary fine-tuning for rubric-aligned scoring; model scale is not a reliable predictor of downstream performance under LoRA adaptation; and identical LoRA hyperparameters produce qualitatively different adaptation behaviors across architectures. The production system runs on a consumer-grade server and is publicly accessible at https://app.awade.gec.waseda.ac.jp. LoRA adapters, application code, and fine-tuning YAMLs are publicly available through their respective repositories.

Local Consistency and Higher-Order Structure of Spherical Interpolation

Shingyu Leung — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12802v1 Announce Type: new Abstract: Spherical Interpolation of orDER $n$ (SIDER-$n$) is a recursive high-order interpolation construction for data on the unit sphere $\mathbb{S}^2$, built from repeated spherical linear interpolation (SLERP). This paper gives a local consistency analysis of SIDER for smooth spherical curves sampled at equally spaced parameter values. The analysis is carried out in geodesic normal coordinates, which allows the SIDER recursion to be compared with classical Neville interpolation while retaining the curvature-dependent corrections introduced by SLERP. We first derive local expansions of SLERP and show that SIDER2 has third-order accuracy; its leading error has the same shifted nodal structure as Euclidean quadratic interpolation. We then prove that the adjacent SIDER2 errors entering SIDER3 have a common leading coefficient, so that the SIDER3 recurrence cancels the cubic term and yields fourth-order accuracy. Carrying the expansion one order further gives the corresponding coefficient compatibility for SIDER3 and proves fifth-order accuracy of SIDER4. Finally, we introduce a degree-filtered formal expansion framework for the general SIDER recursion. This framework proves that, for each fixed $n$, SIDER-$n$ preserves the required polynomial degree structure in the normalized stencil variable. Together with the interpolation conditions at the $n+1$ nodes, this yields the local consistency estimate $d_{\mathbb{S}^2}\bigl(\gamma(\theta h),P_i^{[n]}(\theta;h)\bigr)=O(h^{n+1})$ under the stated smoothness and small-stencil assumptions.

Homotopy-Based Re-Initialization for Switched DAEs in Power System Transient Simulation

Ahmad Ali, Hantao Cui — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12803v1 Announce Type: new Abstract: The simultaneous solution of switched differential-algebraic equations (DAEs) in power system transient simulation may suffer convergence loss following discontinuous events. This difficulty is typically interpreted as a poor post-event initialization problem. This letter presents a geometric framework that explains the underlying convergence mechanism and clarifies why standard convergence-restoration methods may fail at discontinuities. Based on this interpretation, a homotopy-continuation based globalized re-initialization scheme is developed to restore convergence. The proposed method is validated through numerical simulations of representative discontinuities in power system transient simulation. Results show that in the cases where direct post-event solution fails, the proposed scheme can reliably recover convergence.

Exploring How Agent Voice Accents Shape Human-AI Collaboration in K-12 Group Learning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12805v1 Announce Type: new Abstract: Collaboration is widely recognized as a cornerstone of 21st-century education, yet teachers still encounter persistent challenges in fostering productive peer interaction. LLM conversational peer agents introduce new possibilities for mediating in-person group work, raising questions about how persona design, particularly their voice characteristics, shapes learners' perceptions, trust, and interactional dynamics. While prior work has examined agent accent effects in one-to-one settings, little is known about how these effects manifest in groups. We conducted a between-subjects mixed-methods study with 33 teachers examining how a GenAI voice agent with different accents (British, Indian, and African American) influenced collaboration and agent perception. Across surveys, group interaction analyses, and artifacts, we find that accent shaped participants' mental models and the roles the agent assumed in group interaction. The British-accented agent was largely treated as a tool and engaged in detached, utility-based ways, whereas Indian- and African American-accented agents were more readily anthropomorphized and integrated as peers. These role expectations influenced trust, engagement, and reliance over time. This work advances understanding of how GenAI's sociolinguistic design features shape group dynamics in CSCL, with implications for designing culturally inclusive AI partners in group learning.

Detect, Remask, Repair: Diffusion Editing for Faithful Summarization of Evolving Contexts

Hao Zou, Zachary Horvitz, Chandhru Karthick, Zhou Yu, Kathleen McKeown — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12807v1 Announce Type: new Abstract: Summaries of real-world events can become outdated as contexts evolve and new information arrives. A common response is to generate a new summary from the updated context, but full regeneration discards the previous draft, can obscure what changed, and may be unnecessary when only a few claims are unsupported. We study localized faithfulness repair: updating outdated spans in an existing summary while preserving supported content. We propose DETECT-REMASK-REPAIR, a diffusion-based framework that identifies, remasks, and repairs outdated regions with masked diffusion language models. To evaluate evolving-context summarization, we introduce StreamSum, a benchmark of synthetic event timelines. Experiments on DialogSum and StreamSum show that localized diffusion repair provides a controllable alternative to full rewriting: faithfulness-steered repair improves early drafts, one-step repair reduces repair cost to under half a second, with the framework enabling faithfulness-speed-preservation tradeoffs across datasets. We also find that the framework can provide a post-hoc correction step that improves faithfulness for autoregressive systems.

SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning

Yash Vardhan Tomar, Dheeraj Peddireddy, Vaneet Aggarwal — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12808v1 Announce Type: new Abstract: Adaptive Hamiltonian learning is central to calibrating and characterizing quantum devices. In an adaptive controller, choosing the next experiment is itself a computation. Bayesian design rules are recomputed after every posterior update, and that step can take seconds. Across hundreds of shots, those seconds become a significant wall-clock cost for adaptivity. We introduce SymQNet, an amortized reinforcement-learning approach for low-latency adaptive Hamiltonian learning. SymQNet learns a posterior-conditioned acquisition policy offline, then uses a fast policy forward pass online while retaining Bayesian posterior feedback. On transverse-field Ising benchmarks, SymQNet substantially reduces acquisition latency relative to bounded Fisher-information search and bounded two-step Bayesian active learning by disagreement (BALD). At five qubits, it reduces acquisition-only decision latency by $47.1\times$ and $72.6\times$ relative to these online baselines; at twelve qubits, full simulated steps take $1.02$ s for SymQNet versus $13.27$ s for bounded two-step BALD. Overall, we show that learned acquisition can make adaptive Hamiltonian learning practical for repeated low-latency workloads.

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12809v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) are trained on massive multimodal data, making data unlearning increasingly important as data owners may request the removal of specific content. In practice, these requests often arrive sequentially over time, giving rise to the challenging problem of MLLM Lifelong Unlearning. However, most existing benchmarks are limited in scale and scope, failing to capture the complexities of MLLM lifelong unlearning. To fill this gap, we introduce the MLUBench, a large-scale and comprehensive benchmark featuring 127 entities across 9 classes under lifelong unlearning requests. We perform extensive experiments using MLUBench and reveal that existing unlearning methods suffer from severe, cumulative degradation. More critically, we further identify the unique challenge of this problem: unlike in unimodal models, MLLM lifelong unlearning is constrained by the need to preserve multimodal alignment. Continually unlearning from one modality could degrade the entire model. To alleviate this challenge, we propose LUMoE, an effective method. Experiments demonstrate that LUMoE significantly mitigates the degradation problem faced by baselines. The source code and the MLUBench dataset are open-sourced in https://github.com/lihe-maxsize/Lifelong_Unlearning_main.

Vocal Identity Under Siege by AI Voice Cloning Technologies

Jyh-An Lee, Xuan Sun — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12812v1 Announce Type: new Abstract: The advent of sophisticated AI-driven voice cloning has brought to the fore critical legal and ethical challenges regarding the protection of vocal identity. Prompted by recent controversies - including the striking resemblance between OpenAI's ChatGPT-4o voice and that of Scarlett Johansson - this article examines how generative AI technologies undermine the unique value of the human voice and further complicate the legal questions surrounding personality right. Through a comparative analysis, the paper evaluates three principal legal frameworks: the right of publicity, personality rights, and the personal data protection right. Each framework - rooted in different legal traditions o offers distinct strengths and limitations in addressing the threats posed by AI-generated voice cloning. By analysing these doctrines' scope, remedies, and posthumous protections, the study offers a foundation for understanding how existing legal approaches may be applied to the evolving challenges of vocal identity in the era of generative AI.

Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

Xiao Ren, Yuhui Yang, Zongbiao Weng, Zhijie Liu, He Kong — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12814v1 Announce Type: new Abstract: Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall recovery under disturbances. However, most existing works treat motion tracking and fall recovery as different tasks and require multi-stage training with specialized recovery rewards and/or separate recovery policies. Moreover, existing reinforcement learning-based methods often terminate training episodes immediately after severe tracking failures, limiting recovery-oriented exploration in unstable or fallen states. To address the above issues, we propose Stubborn, a streamlined and unified reinforcement learning framework to achieve robust humanoid motion tracking and fall recovery. Specifically, Stubborn uses an asymmetric Actor-Critic architecture and consists of three major components. First, a yaw-aligned tracking representation is adopted to reduce sensitivity to global drift and heading disturbances while preserving gravity-related balance information. Second, we introduce a Bernoulli-based probabilistic termination mechanism that enables the policy to encourage exploration of fall-recovery behaviors under varying failure modes. Third, we propose a probabilistic termination and tracking-error-driven strategy that dynamically reshapes the sampling distribution based on tracking performance, increasing the training efficiency for difficult motion segments and unstable states. Extensive comparisons with SOTA methods and ablation studies show that Stubborn achieved competitive performance, and the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains. For real-world demonstrations, please refer to https://aislab-sustech.github.io/Stubborn/.

Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12817v1 Announce Type: new Abstract: Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability enables models to convert visual state transitions into operational knowledge, defined as short natural-language sentences that describe action types, target UI elements, textual arguments, and execution orders. However, due to the highly diverse and heterogeneous UI designs across applications, existing vision-language models (VLMs) struggle to accurately infer these underlying operations. To bridge this gap, we introduce Teach VLM, a core model designed to translate mobile screen trajectories into step-wise operational knowledge by extracting and analyzing operation-related keyframes from demonstration videos. To address the scarcity of aligned training data, we develop a systematic data flywheel for scalable data acquisition. We further introduce a novel Chinese Mobile Screen Teach Benchmark for fine-grained evaluation. Building upon Teach VLM, we propose the Teach-and-Repeat paradigm, where the generated operational knowledge serves as an interpretable procedural reference to guide downstream screen-based execution agents. Extensive evaluations demonstrate that Teach VLM significantly outperforms strong VLM baselines, achieving state-of-the-art performance in operation semantics prediction. Furthermore, experiments in Android World show that our paradigm yields consistent Task Success Rate improvements for downstream agents. Together, Teach VLM and the Teach-and-Repeat paradigm offer a practical pathway from raw demonstrations to reusable task automation.

Localizing Anchoring Pathways in Language Models

Hillary N. Owusu, Sarah Wiegreffe, Naomi H. Feldman — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12818v1 Announce Type: new Abstract: Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12821v1 Announce Type: new Abstract: Environmental scientists spend disproportionate effort on data wrangling rather than analysis, and AI agents that automate geospatial workflows remain unvalidated: no benchmark evaluates agents operating through structured tool calling against real APIs. We introduce the GeoNatureAgent Benchmark, the first benchmark for environmental analysis agents that operate via structured tool calls to a production-style geospatial API. It comprises 93 tasks across 18 categories, covering municipality analysis, multi-turn conversation, spatial reasoning, cross-indicator synthesis, error handling and recovery, ranking, comparison, multilingual understanding, habitat analysis, and task rejection. Tasks are evaluated against an open, self-hostable API serving three environmental indicators across Spain and Portugal via sixteen tools. We evaluate seven LLMs (Claude Sonnet 4, DeepSeek V3.2, GLM-5, Gemini 2.5 Pro, Qwen3-235B, GPT-OSS-120B, Llama 4 Scout) under three temperature-1.0 seeds, reporting capability and per-case cost as orthogonal axes. We find: (1) Claude Sonnet 4 leads at 60.8% +/- 0.8%, followed by DeepSeek V3.2 at 56.3% +/- 3.1%, with no other model above 51%; (2) the cost-accuracy Pareto frontier is occupied mostly by open-weight models, with DeepSeek V3.2 offering 93% of Claude's capability at 11x lower cost ($0.011/case); (3) comparison tasks remain universally unsolved (0% on close-value comparisons), exposing systematic reasoning limits; and (4) structured tool calling against a real API is more discriminative than general-purpose GIS benchmarks, with accuracies 25-35 points lower. We further show extensibility by integrating BigEarthNet V2 land cover for Portugal alongside Spanish CO2 and erosion indicators. The benchmark, harness, and self-hostable API are publicly available.

DIMOS: Disentangling Instance-level Moving Object Segmentation

Hongxiang Huang, Hongwei Ren, Xiaopeng Lin, Yulong Huang, Zeke Xie, Bojun Cheng — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12826v1 Announce Type: new Abstract: Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.

Topical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics

Rasul Khanbayov, Hasan Kurban — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12828v1 Announce Type: new Abstract: Do research topics in artificial intelligence grow gradually, or do they advance through abrupt, detectable jumps? Analyzing 80,814 accepted main-track papers from five premier AI conferences (ACL, CVPR, ICLR, ICML, NeurIPS) spanning 2017 to 2025, we show major AI topics advance through topical phase transitions: remaining marginal for years, then surging across venues within one to three years. Large language models became the dominant cross-venue topic by 2025, diffusion models rose with comparable abruptness, and language-model methods crossed into computer vision via vision-language models, whereas reinforcement learning compounded smoothly, distinguishing genuine phase transitions from ordinary growth. This structure is our primary contribution: a large-scale, cross-venue characterization of how AI research reorganizes. We then ask whether a transition leaves a detectable footprint before it peaks. We define an early-warning signature, four publication-dynamics criteria frozen on 2017-2021 data, and evaluate it out of sample on 2023-2025 transitions, obtaining a precision of 27% and recall of 63% against a 13.5% base rate. Applied to 2025 data, the signature flags reasoning and test-time compute, agentic AI, multimodal LLMs, retrieval-augmented generation, and world models as topics to monitor over 2026-2028. The source code is also publicly available on GitHub at https://github.com/KurbanIntelligenceLab/ai-phase-transitions.

Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning

Changye Li, Meng Lu, Yi Wu, Ligeng Zhu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12830v1 Announce Type: new Abstract: While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation suggests that relying solely on implicit visual representations from vision encoders is insufficient for recovering fine-grained spatial evidence. We introduce PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent for spatial reasoning tasks across map reasoning, visual probing, and vision reconstruction. PERIA uses two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. To train PERIA, we develop a unified recipe that combines supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior. Experiments on 13 benchmarks from 8 datasets show that PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution benchmarks and 4.4% on out-of-distribution benchmarks, while outperforming previous state-of-the-art baselines of similar size by 7.0%-14.8%. It also achieves performance comparable to much larger models such as Qwen3-VL-235B-A22B-Thinking and GPT-5, demonstrating the effectiveness of PERIA in enhancing spatial reasoning capabilities.

A Quaternion--BCH Framework for the Local Accuracy of SIDER Interpolation

Shingyu Leung — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12833v1 Announce Type: new Abstract: Spherical Interpolation of orDER $n$ (SIDER-$n$) is a recursive high-order interpolation method for data on the unit sphere $\mathbb{S}^2$, built from repeated spherical linear interpolation (SLERP). This paper develops a quaternion--Lie algebra framework for proving the local consistency of SIDER for smooth spherical curves sampled at equally spaced parameter values. Points on $\mathbb{S}$ are represented as pure unit quaternions, and interpolation errors are measured in fixed-base quaternion logarithmic coordinates. In this setting, each SLERP operation admits an exact Baker--Campbell--Hausdorff (BCH) representation, which converts the geometric interpolation problem into an algebraic problem involving filtered Lie-polynomial expansions. The BCH expansion shows that SLERP is affine to leading order, has no quadratic correction, and has a first nonlinear correction that is cubic and commutator-valued. Using this structure, we prove that SIDER2 has a third-order divided-error form with the same leading nodal factor as ordinary quadratic interpolation. We then show that the recursive SIDER step raises the order by one: the affine part gives the Neville-type finite-difference cancellation, while the nonlinear BCH remainder preserves the sharp filtered degree structure after the nodal factor is removed. Consequently, for every fixed $n\geq2$, $d_{\mathbb{S}^2}\bigl(\gamma(\theta h),P_i^{[n]}(\theta;h)\bigr) = O(h^{n+1}) $under the stated smoothness and small-stencil assumptions. The proof also identifies the shift-invariance of the leading divided-error coefficient as the algebraic compatibility condition underlying the SIDER recurrence.

Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement

Woong Shin, Craig A. Bridges, Marshall T. McDonnell, Rafael Ferreira da Silva — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12834v1 Announce Type: new Abstract: As scientific workflows shift from deterministic executables to LLM-based agents, the development practices on offer, such as fine-tuning, reinforcement learning, and prompt-and-go, bury the scientist's judgment. We propose treating agent construction as a workflow stage and introduce AgentBuild, which builds a scientific agent from a contract the scientist authors. The contract is a version-controlled rubric, a difficulty-graded curriculum, and a curated external knowledge base. A rubric-driven judge gates a meta-optimizer coding agent that edits the agent within a declared boundary, so the build compiles the agent, not the scientist's judgment. We instantiate this for Rietveld refinement of X-ray diffraction data through GSAS-II behind MCP and A2A, where a blank-harness construction run progresses through a lithium lanthanum zirconium oxide (LLZO) signal-to-noise ladder, reaches the 4 hour scan as a frontier case, and exposes the workflow-scope limits that remain. The same rubric that rewards credible fits also scores trajectory scope, making the frontier a contract failure rather than a pattern-fitting failure. As base models evolve, re-running AgentBuild is a re-tune, not a rebuild, and the scientist's authored contract remains the durable asset.

The Internet of Agentic AI: Communication, Coordination, and Collective Intelligence at Scale

Quanyan Zhu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12835v1 Announce Type: new Abstract: The rapid emergence of autonomous AI agents is transforming artificial intelligence from isolated model inference into distributed systems of reasoning, communication, and action. This paper develops the vision of the Internet of Agentic AI (IoAI): an open ecosystem in which heterogeneous agents discover one another, negotiate responsibilities, exchange context, invoke tools, and execute workflows across cloud, edge, device, organizational, and cyber-physical environments. We synthesize foundations from single-agent agentic AI, multi-agent systems, distributed computing, communication networks, game theory, and security engineering to characterize the architectures and mechanisms required for scalable agent ecosystems. The paper examines agent deployment models, workflow lifecycles, communication protocols, interoperability layers, resource-management challenges, and trust architectures, with case studies in adaptive manufacturing and distributed operational coordination. The resulting framework highlights the central research challenges of controlled emergence, semantic interoperability, secure identity, incentive-compatible coordination, resource-aware orchestration, and governance for large-scale networks of autonomous agents.

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

Jiarui Zhao, Rongzhi Zhang, Lingchuan Liu, Hao Yang, Xunliang Cai, Xi Su — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12837v1 Announce Type: new Abstract: Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.

The Capacity Region for Classes of Sum-Broadcast Channels

Amin Gohari, Yi Liu, Chandra Nair — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12839v1 Announce Type: new Abstract: We compute the capacity region of a sum of broadcast channels whose components are degraded, less-noisy, more-capable, deterministic, or semi-deterministic. We achieve this by showing that an auxiliary-receiver outer bound, previously introduced by some of the authors, matches Marton's inner bound. This result generalizes a previously known result for the sum of two reversely degraded broadcast channels due to El Gamal (1980). Moreover, we define a class of primary broadcast channels and show an analogous result for the sum of primary broadcast channels.

CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees

Yixiao Wang, Hayden McTavish, Varun Babbar, Margo Seltzer, Cynthia Rudin — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12840v1 Announce Type: new Abstract: Regression trees are among the most interpretable yet expressive model classes in machine learning. Historically, greedy induction has been the dominant approach for constructing well-performing regression trees. While optimal methods based on dynamic programming and branch-and-bound exist, they are computationally prohibitive for general linear regression trees, despite often achieving substantially better performance than greedy approaches. Recent work has shown that specialized lookahead strategies can dramatically improve runtime while maintaining near-optimal performance, primarily in classification settings. In this work, we develop a novel algorithm for near-optimal, sparse, piecewise linear regression trees that combines a lookahead-style search strategy with efficient rank-one Cholesky updates of the Gram matrix. We demonstrate, both theoretically and empirically, that our method achieves a favorable trade-off between computational efficiency, predictive accuracy, and sparsity, and scales significantly better than the current state of the art.

TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12841v1 Announce Type: new Abstract: Masked diffusion language models (MDLMs) such as LLaDA now rival autoregressive (AR) LLMs, but every existing knowledge-editing and unlearning method (ROME, MEMIT, etc.) targets AR transformers and either makes assumptions that fail under iterative denoising, or requires gradient updates whose backward-pass activations cost tens of GB of extra VRAM and which collapse MDLMs at standard learning rates. We introduce TimeROME-DLM, the first training-free, gradient-free, inference-time knowledge-editing framework for MDLMs. It couples two components: a Temporal Indirect Effect (TIE) causal-tracing protocol that identifies, for each fact, the coordinate whose intervention most strongly drives the object prediction at later denoising steps; and a closed-form, low-rank residual edit memory that aggregates subject keys and target deltas across all forget facts and applies a single ridge-regularised update at that coordinate at every diffusion forward, with sparsification to limit utility spillover. Backbone weights stay frozen; only three hyperparameters (alpha, lambda, q) are tuned on a small validation split. On TOFU forget01 with TOFU-finetuned LLaDA-8B-Base, TimeROME-DLM cuts forget-set log-probability by roughly 83 nats. The same configuration transfers to LLaDA-8B-Instruct, Dream-7B, MMaDA-8B, DiffuLLaMA-7B, and LLaDA-MoE-1.4B. It keeps retain-set log-probability nearly flat (within ~1 nat at the utility-safe operating point) across 50 sequentially inserted facts, delivers a four- to fourteen-fold wall-clock speedup with zero additional VRAM over the strongest converged training-time baseline, and scales sub-linearly to 400 facts. TimeROME-DLM closes the locate-then-edit gap between AR LLMs and MDLMs at a fraction of the computational cost.

Interpretable Factor Decomposition for Decision Intelligence in Large-Scale Financial Markets: Evidence from China's A-Share Market

Xiao Han, Yao Xiao, Zhen Zhang, Moxuan Zheng — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12843v1 Announce Type: new Abstract: We present an interpretable machine learning pipeline to decompose Cross-Sectional Equity Return Predictability into auditable factor contribution. We apply an XGBoost model with TreeSHAP attribution and conduct stress testing on 3632 Chinese A-share stocks from 2009 until 2019. Using 60-month, rolling windows over 55 months of out-of-sample data, XGBoost obtains a mean AUC of 0.547 and +2.38%/month (Newey-West t = 5.94; Annualized Sharpe 2.23) long-short spread for the top vs bottom quintiles. This alpha is persistent after adjusting for the Carhart four-factor model (+2.31%/month; t = 7.48). SHAP Decomposition indicates that behavioral signals (turnover and momentum) account for 58.2% of predictive attribution compared to 10.7% for valuation ratios, on average, across 55 industry groups. Ablation analysis serves to cross-validate this ranking and provides evidence that SHAP and ablation diverge in a manner that highlights feature substitutability structure that is largely invisible to either method used in isolation.

A Privacy-Preserving Framework Using Remote Data Science for Inter-Institutional Student Retention Prediction

John Fields, K M Sajjadul Islam, Ruchitha Thota, Victor Chen, Praveen Madiraju — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12845v1 Announce Type: new Abstract: This study explores privacy-preserving machine learning (PPML) techniques using the PySyft platform to enable collaborative prediction of student retention between institutions. We developed a remote data science (RDS) framework with a semi-air-gapped architecture consisting of high-side and low-side servers, allowing researchers from three universities to build predictive models on sensitive student data without direct data access. Using historical data from a small private university (N=720), we evaluated three synthetic data generation approaches and validated the framework through inter-institutional collaboration. The results demonstrate consistent classification performance across institutions (Macro F1: 0.690--0.695) while maintaining strict Family Educational Rights and Privacy Act (FERPA) compliance. We also propose Data-Type-Aware Templates, a novel synthetic data method that prioritizes privacy over distributional fidelity. Our findings confirm that RDS-based PPML is technically feasible for educational settings and offers a practical alternative to federated learning for small-scale inter-institutional collaborations. The code is available at https://github.com/jtfields/NAIRR240195-Privacy-Preserving-Machine-Learning.

Language-Guided Abstraction for Visual Reasoning

Xu-Jing Ye, Yuan-Gen Wang, Ruping Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12847v1 Announce Type: new Abstract: The Abstraction and Reasoning Corpus (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks. However, prevalent ARC methodology is either pure language or vision-only (i.e., VARC). The former depends heavily on LLMs, consuming billions of parameters. The latter often struggles to capture high-level semantics, leading to overfitting on pixel-level patterns. To bridge this gap, we propose L-VARC, a novel framework that enhances visual reasoning via a language-guided Learning Using Privileged Information (LUPI) branch. Specifically, we design a Semantic Compression Module by feeding a unified, task-agnostic prompt into DeepSeek-V3. In this way, the raw LARC (a crowd-sourced language description dataset) can be substantially refined and structured, fitting with the context length constraint of standard text encoders (e.g., CLIP). Moreover, we design a Cross-Attention Projector to align visual features with semantic embeddings, aiming to guide the training of the ARC model. Notably, the LUPI branch is taken in the training process and will be discarded during inference, thereby yielding a lightweight model with a mere 18 million parameters. Extensive experiments demonstrate that our L-VARC effectively leverages linguistic priors to boost visual reasoning and outperforms state-of-the-art. Ablation studies further confirm the contribution of the two new designs towards the L-VARC framework. The code is available at https://github.com/GZHU-DVL/L-VARC.

(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable

Chen Zhu, Xiaolu Wang, Weilong Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12848v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for tasks once reserved for trained researchers, including hypothesis generation, specification choice, and drafting conclusions. We argue that the reliability of AI-assisted research depends not only on model capability, but also on how cognitive labour is structured between humans and machines. We study this problem through Human-in-the-Loop Economic Research (HLER), a decision architecture based on pre-commitment, decision sequencing, accountability, and attention allocation. In a pre-specified 2*4 factorial experiment with 280 complete research runs across four datasets, an unconstrained multi-agent baseline produced critical failures in 72% of runs. Using the same underlying model, the same agent decomposition, and identical prompts for the shared reasoning agents, HLER reduced the failure rate to 16% by imposing three architectural commitments: LLMs reason but do not execute data work, data and estimation are handled deterministically, and three human decision gates bind the workflow. Fisher's exact test rejects equality of failure rates at p<0.001. Reliability gains were largest on the least publicly represented dataset, a Qing-dynasty population register, consistent with a task-based production model with Frechet-distributed output quality. An 80-run ablation suggests that deterministic computation and human gates contribute independently, with exploratory evidence of complementarity. We interpret HLER as a research harness rather than an autonomous AI scientist: it sharply reduces failures, makes residual weaknesses more visible, and prevents unreliable claims from being advanced as publication-ready outputs.

SemanticXR: Low Power and Real-time Queryable Semantic Mapping with an Object-Level Device-Cloud Architecture

Rahul Singh, Devdeep Ray, Connor Smith, Sarita Adve — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12849v1 Announce Type: new Abstract: Semantic mapping is a core service that enables grounded interactions in emerging Extended Reality (XR) applications such as AI assistants and spatial object search. Deploying this capability on mobile XR devices requires a system that is open-vocabulary, real-time, and low-power. Existing approaches are compute-intensive and assume server-class resources. Cloud offloading offers a practical path, but no existing system splits semantic mapping across the device-cloud boundary or manages its communication, execution, and memory footprint. We present SemanticXR, the first device-cloud system for real-time, open-vocabulary semantic mapping and querying under XR power, bandwidth, and memory constraints. Our key insight is to elevate semantically identifiable objects to first-class units of communication, execution, and memory across the device and server. On the server, object-level parallelism and geometry downsampling improve mapping latency, while object-level depth-mapping co-design reduces upstream bandwidth. On the device, an object-level sparse local map with incremental updates and update prioritization enables network-robust querying with bounded memory and downstream bandwidth. Object-level configurable resource usage vs. quality trade-offs let applications and the system adapt mapping to application requirements and operating conditions, respectively. Against a device-cloud baseline with the same perception models, object-level organization improves server-side mapping latency by 2.2X at equal semantic quality. Depth-mapping co-design maintains upstream bandwidth under 2.5 Mbps. On the device, SemanticXR sustains sub-100 ms query latency for up to 10,000 objects even under network drops, supports tens of thousands of objects within 500 MB, and scales downstream bandwidth with map changes, not total scene size. The system adds only 2% device power during normal operation.

High-Order Spectral Element Methods for Wave Propagation on ARM Multicore CPU with SME: Optimizations and Implications

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12850v1 Announce Type: new Abstract: Wave propagation based on the spectral element method (SEM) is a representative HPC workload, but existing SEM implementations are not well matched to emerging ARM multicore CPUs with Scalable Matrix Extension (SME). We present an SME-enabled optimization of \textsc{SPECFEM3D} on the emerging LX2 processor that combines an SME-aware batched small-matrix kernel for SEM tensor-product operators, a memory-aware hybrid MPI+OpenMP execution scheme for limited-HBM systems, and a dispersion-based iso-accuracy study of the $(h,p)$ tradeoff. At fixed polynomial order, the optimized implementation improves full-application performance by 4--6$\times$ over the original code and delivers clear gains over optimized non-SME CPU baselines. Beyond these implementation-level gains, our results suggest that SME shifts the performance-favorable operating point toward higher polynomial orders along the dispersion-based iso-accuracy frontier, further reducing time-to-solution and working-set size. These results indicate that SME affects not only kernel efficiency, but also the practical discretization tradeoff for SEM on modern ARM multicore platforms.

WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12852v1 Announce Type: new Abstract: Rapid advances have been made in developing general-purpose embodied agent in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. Despite their promise, low-level controllers often become performance bottlenecks due to repeated execution failures. We argue that a key limitation is not only the lack of episodic memory, but also the decoupling of \textit{what-where-when} memory from \textit{which-why} reasoning. To address this, we propose \textbf{WISE} (Which-Why Informed Semantic Explorer), a long-horizon agent framework with an enhanced low-level controller equipped with a Causal Event Graph that augments episodic memory with explicit causal structure linking observations to task relevance. Unlike prior work such as MrSteve, which relies on feature similarity for retrieval, WISE enables robust recall under viewpoint changes and supports opportunistic task reordering through causal reasoning. Building on this memory, we propose an Opportunistic Task Scheduler that dynamically re-prioritizes subtasks when causally relevant opportunities are detected. We further equip WISE with a multi-scale progressive exploration strategy to provide spatially comprehensive observations for downstream reasoning. Experiments show that WISE largely improves task success and efficiency on long-horizon sparse tasks, particularly in settings requiring adaptive decision-making.

Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization

Gaurav Kumar — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12854v1 Announce Type: new Abstract: Large Language Models such as GPT-4o and GPT-5 achieve strong zero-shot performance on biomedical claim verification, but cost and opacity limit scalable use. We fine-tune three small LLMs: Phi-3-mini (3.8B), Qwen2.5-3B, and Mistral-7B, via QLoRA on SciFact and HealthVer, providing the first study of QLoRA models against GPT-4o and fine-tuned BioLinkBERT encoders. Mistral-7B QLoRA surpasses both GPT-4o and GPT-5 (up to 12% F1 gain) at a fractional cost using just 1,008 training examples. We conduct extensive in-domain and cross-domain evaluation: models trained on SciFact tested on HealthVer and vice versa, at matched sizes to isolate dataset structure from data quantity. We identify a previously unreported structural artifact in SciFact that inflates in-domain scores, and show through bidirectional out-of-domain evaluation that training on structurally sound data enables robust cross-domain transfer. We plan to release all code and adapter checkpoints.

Computing Headway Bounds under Worst-Case Bunching in Fixed-Line Transit Systems

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12855v1 Announce Type: new Abstract: Vehicle bunching is a major problem for transit operators. When vehicles bunch together, the lead vehicle will service the majority of passenger demand, leaving the following vehicles to operate below capacity, wasting fuel and money. Furthermore, after the last vehicle in the bunch passes, the time before the next vehicle's arrival (headway) will be large. Transit operators can combat bunching by holding buses at stops along a route, trading riding time for even headway times. While prior work has focused on developing holding policies to minimize average case bunching, no work has focused on analyzing the longest and shortest possible headway times under a broad group of such policies. We assume that dwell times at stops and travel times between stops are bounded and develop a dynamic program that computes the maximum and minimum headway times for a single bus route with an arbitrary number of control points, vehicles, and holding policies. These bounds are tight in the sense that it is always possible to identify the specific sequence of events that lead to their occurrence. We use these bounds to investigate the effects of different holding policies, stop placement, and number of vehicles on route headways and worst-case bunching. Finally, we apply these analysis techniques to a real-world transit system in Nashville, TN and show their utility for transit planning.

JSCGC: Joint Source-Channel-Generation Coding for Wireless Generative Communications

Tong Wu, Zhiyong Chen, Guo Lu, Li Song, Feng Yang, Meixia Tao, Wenjun Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12858v1 Announce Type: new Abstract: Conventional communication systems, including both separation-based coding and learning-based joint source-channel coding (JSCC), are typically designed under Shannon's rate-distortion theory. However, relying on generic distortion metrics fails to capture complex human visual perception, often resulting in blurred or unrealistic reconstructions. In this paper, we propose Joint Source-Channel-Generation Coding (JSCGC), a generative communication paradigm that replaces the conventional decoder with a generative model at the receiver. The received signal is treated as a condition that controls the sampling process into the learned conditional distribution, reformulating communication from deterministic reconstruction for distortion minimization to controlled generation for mutual information maximization under perceptual constraints. Based on this formulation, we develop a unified joint training and efficient stochastic sampling framework, and provide theoretical analysis of its effectiveness in both learning and inference stages. Extensive experiments on latent-space image transmission demonstrate that the JSCGC consistently improves feature-based, semantic-level, and distributional quality across diverse channel conditions, while exhibiting a distinct error behavior characterized by semantic inconsistency rather than distortion.

AIR-VLA+: Decoupling Movement and Manipulation via Cascaded Dual-Action Decoders with Asymmetric MoE for Aerial Robots

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12859v1 Announce Type: new Abstract: Aerial manipulation systems have long suffered from representation coupling in end-to-end control, as platform-level Unmanned Aerial Vehicle (UAV) movement and end-effector-level arm manipulation differ substantially in action scale, dynamics, and control objectives. In this paper, we propose AIR-VLA+, a flow matching action generation architecture specifically designed for aerial manipulation, featuring cascaded dual-action decoders and an asymmetric feature-level Mixture of Experts (MoE). We construct cascaded manipulation and movement decoders, allowing the UAV to unidirectionally observe the manipulator's intent during movement to achieve workflow coordination, while isolating the impact of UAV movement information backpropagation on arm manipulation stability. Addressing the characteristic that UAV movement is highly dependent on high-level semantics and responsible for task state transitions in aerial manipulation, we design an input feature enhancement module for the UAV movement decoder. This module introduces an implicit visual grasp projector to perceive the interaction state between the gripper and the object, and injects compressed global semantic features. Within the UAV movement decoder, we deploy an implicit MoE architecture, enabling different movement experts to spontaneously exhibit capacity inclinations for various task stages during training. Through dense soft blending computation on the feature manifold, the UAV movement is endowed with stronger task-stage adaptability. Experiments on the standardized AIR-VLA benchmark demonstrate that our method comprehensively surpasses all baselines with an overall average score of 48.0. The overall task completion score improves by 80.2\% compared to the single-head $\pi_{0.5}$ policy, effectively mitigating the heterogeneous coordinated control conflicts of composite robots.

Multimodal Graph Negative Learning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12863v1 Announce Type: new Abstract: Multimodal attributed graphs (MAGs) integrate graph topology with heterogeneous modality attributes, such as text and images, thereby enabling richer modeling of complex relational systems. However, such expressiveness also makes learning on MAGs depend on multiple semantic sources, including structural topology, textual and visual attributes, each of which can be regarded as a branch for node representation. Node-level branch semantic imbalance arises when these branches differ across nodes in semantic informativeness and reliability: a branch that provides discriminative semantics for one node may mislead another due to bias in modality quality or structural context. Existing methods often mitigate such heterogeneity through cross-branch agreement or alignment, implicitly treating the dominant prediction as reliable supervision. When the dominant branch is biased, forced imitation may propagate its bias to other branches and suppress original semantics that are useful for classification. We propose GraphMNL, a graph-aware multimodal negative learning framework that addresses this issue by using Negative Learning as cross-branch guidance. Instead of forcing inferior branches to imitate a teacher prediction, the model teaches them which classes a node is unlikely to belong to. GraphMNL builds a branch library, identifies dominant and inferior branches via graph-aware reliability arbitration, gates unstable transfer, and applies target-preserving negative learning over non-target classes. This design decouples target supervision from branch guidance so that supervised losses learn the correct class, while Negative Learning suppresses unlikely alternatives when branch agreement is unreliable. Through the comprehensive experimental evaluation, GraphMNL achieves the best performance on Grocery datasets with 72.47% accuracy and 76.60 F1 score on Reddit M datasets.

Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming

Tingqiang Xu, Hangrui Zhou, Tianle Cai, Alex Gu, Kaifeng Lyu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12864v1 Announce Type: new Abstract: Despite strong performance in competitive programming, the role of Large Language Models (LLMs) in supporting human learning in the same setting remains largely unexplored. In this work, we introduce UOJ-Bench, a benchmark designed to evaluate not only the problem-solving ability of LLMs, but also their ability to identify errors in human-written code -- a crucial educational activity traditionally supported by running test cases over online judge systems. UOJ-Bench consists of three distinct tasks: code generation, code hacking, and code repair, all constructed from real-world code submissions on the Universal Online Judge (UOJ) and evaluated through UOJ's native judging infrastructure. Our results show that under one-shot evaluation, even the strongest models fail to identify errors in more than 50% of a set of submissions that have been found to be incorrect by UOJ users. While test-time scaling improves success rates to above 90%, the substantial computational costs incurred from model inference limit its practicality for large-scale deployment. Despite these limitations, we find that the best-performing models under test-time scaling can uncover errors in over 5% of full-score submissions across roughly 30 problems, suggesting that frontier LLMs can already provide complementary signals beyond standard judging systems.

SMGFM: Spectral Multimodal Graph Pretraining for Multimodal-Attributed Graphs

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12867v1 Announce Type: new Abstract: Multimodal-attributed graphs (MAGs) couple graph topology with node semantics from text, images, and other modalities. Traditional graph learning contextualizes node semantics by coupling topology with node features. However, this coupling design becomes troublesome in MAGs, where structure-induced and modality-intrinsic semantics may contribute differently to downstream tasks. Structure-induced semantics promote relational consistency through smooth topological variation, whereas modality-intrinsic semantics often encode local, fine-grained distinctions that should not be uniformly smoothed or aligned. Therefore, the key challenge is to identify semantic roles before cross-modal fusion. To this end, we leverage graph-frequency variation as a prior, where low-frequency components capture topology-consistent semantics and high-frequency components preserve modality-specific semantics. Based on this intuition, we propose SMGFM, a spectral multimodal graph pretraining framework that decomposes each modality-specific node signal into graph-frequency bands and assigns band-level semantic roles before cross-modal interaction. Concretely, SMGFM constructs frequency-resolved modality tokens with scalable Chebyshev filters, estimates their coupling reliability through topology-conditioned routing, and performs band-modality interaction before fusion. Its frequency-routed objectives align smooth consensus routes while preserving modality-specific routes, mitigating spatial-domain entanglement and uniform cross-modal alignment. Extensive experiments conducted on the MAG datasets demonstrate that SMGFM achieves state-of-the-art performance across graph-level and modality-level tasks.

Learning Task-Aware Sampling with Shared Saliency through Density-Equalizing Mappings

Tsz Lok Ip, Han Zhang, Lok Ming Lui — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12869v1 Announce Type: new Abstract: In image and surface-based learning tasks, convolutional features are typically extracted using receptive fields that are sampled uniformly across the entire domain. However, informative structures are rarely distributed uniformly in practice and are often concentrated in localized regions. Such phenomena are particularly common in medical imaging, where pathological changes are spatially confined. Consequently, uniform convolution allocates equal computational effort to both informative and uninformative regions, resulting in inefficient feature extraction and suboptimal utilization of model capacity. To address this issue, we propose a framework for task-adaptive sampling that dynamically redistributes computational attention according to the spatial importance of the data. Specifically, we introduce the Density-Equalizing Convolutional Neural Network (DECNN), which employs density-equalizing mappings to guide convolution through a learned density function. The density function encodes the relative importance of different regions and induces a transformation that enlarges informative areas while compressing less relevant ones. As a result, convolutional receptive fields are redistributed non-uniformly over the domain, enabling denser sampling in task-relevant regions. By coupling this importance-driven transformation with convolution, DECNN performs adaptive feature extraction that focuses computational resources on informative structures. This leads to more efficient use of model capacity, yielding a lightweight yet expressive architecture while simultaneously producing an interpretable saliency map. Experiments on image classification and craniofacial surface analysis demonstrate that DECNN achieves competitive or superior performance with fewer parameters, accurately identifies task-relevant regions, and remains robust under complex geometric variations.

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12871v1 Announce Type: new Abstract: Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.

Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

Liza Babaoglu, Shuangyi Chen, Ashish Khisti — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12876v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.

Diffusion-Network Alignment: An Efficient Algorithm and Explicit Probability Bounds

Ziao Wang, Lei Ying — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12879v1 Announce Type: new Abstract: This paper studies a variation of the classic network alignment problem, named diffusion-network alignment. The goal is to align the vertices of a rooted diffusion tree to the vertices of a network, where the diffusion tree could be from a communication trace or contact tracing, and the network could be an online or offline social network. Different from the classic network alignment where both networks are fully observed, this model captures the information asymmetry of two networks. To solve this problem, this paper presents an efficient algorithm based on tree correlation tests to extract alignment information from local neighborhoods. We analyze the performance of the algorithm in the sparse graph regime and show that with high probability, all matched pairs are correct. Furthermore, for each vertex on the diffusion tree, this paper establishes an explicit lower bound on the probability that the vertex is correctly matched. These lower bounds are depth-dependent and increase as vertices get closer to the root.

Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study

Yvonne Qiu, Dezhi Yu, ShuoJia Fu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12881v1 Announce Type: new Abstract: We present an approach to fine-tuning large language models using Direct Preference Optimization (DPO), a reinforcement learning technique. Our experimental results demonstrate that DPO simplifies the training pipeline, improves computational efficiency, and achieves competitive performance. The evaluation using BLEU, ROUGE, and cosine similarity metrics indicates effective learning and convergence, though further investigation is needed to address observed training instability.

HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness

Xiaoxuan Wang, Haixin Wang, Alexander Taylor, Jason Cong, Yizhou Sun, Wei Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12882v1 Announce Type: new Abstract: Large language models are increasingly deployed as agents for long-horizon tasks, yet their performance is shaped not only by model capability and environment design, but also by the harness that mediates agent--environment interaction. Existing harnesses are largely manually engineered, making them difficult to scale as trajectories grow longer and interactions become more complex. In this work, we ask whether harness can be generated by a learnable plug-in module that can be trained in an end-to-end fashion. We introduce HarnessBridge, a lightweight learnable harness controller that parameterizes the agent--environment interface as a bidirectional projection. HarnessBridge learns two bidirectional projections: observation projection, which distills raw trajectories into compact, decision-relevant states, and action projection, which converts proposed actions into executable transitions or trajectory-grounded rejections. We train HarnessBridge on a harness supervision dataset via unified instruction tuning. On Terminal-Bench~2.0 and SWE-bench Verified, HarnessBridge matches or surpasses strong specialized harnesses while substantially reducing token usage and trajectory length, and generalizes from smaller generators to larger commercial models.

The Hidden Power of Scaling Factor in LoRA Optimization

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12883v1 Announce Type: new Abstract: In Low-Rank Adaptation (LoRA), the scaling factor $\alpha$ is often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood. In this paper, we reveal that the scaling factor $\alpha$ and the learning rate function differently, with $\alpha$ emerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone. Through the synergy of extensive empirical analysis and a theoretical Signal-Drift framework, we uncover three findings into LoRA's scaling mechanism: First, LoRA's spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap. Second, when leveraging this smoothness to accelerate convergence, $\alpha$ outperforms the learning rate by amplifying the task signal without increasing the drift ratio. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank-tied heuristics. Based on these insights, we propose LoRA-$\alpha$, a minimalist framework that restores $\alpha$ to its principled regime, making LoRA compatible with standard small learning rates. Extensive evaluations across diverse tasks demonstrate that LoRA-$\alpha$ consistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA.

Mixed-Categorical Black-Box Optimization via Information-Geometric Bilevel Decomposition

Marc Ong, Shinichi Shirakawa, Youhei Akimoto — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12885v1 Announce Type: new Abstract: Mixed categorical-continuous optimization arises in many practical domains, yet remains challenging. In the black-box setting, evolution strategy-based approaches have shown promise in extending the efficiency and robustness of the CMA-ES to mixed-variable spaces. However, these methods exhibit worsened performance when strong categorical-continuous interactions are present, as their underlying search distributions assume independence between categorical and continuous variables. To address this limitation, we propose a bilevel optimization framework that explicitly captures such interactions by optimizing over categorical variables in an outer loop, and over continuous variables conditioned on each categorical configuration in an inner loop. We formulate each level of the bilevel problem as a stochastic relaxation under information-geometric optimization. To mitigate the high computational cost inherent to bilevel optimization, we introduce a warm-starting strategy that accelerates the lower-level search by selecting the best among multiple cached configurations and updating the cache after each iteration. Experimental results on binary-continuous domain demonstrate that the proposed method outperforms existing state-of-the-art approaches in interaction-handling capability while also being more computationally efficient across benchmarks encompassing both previously reported and newly proposed types of interaction.

Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12886v1 Announce Type: new Abstract: Interleaved thinking, where a unified multimodal model alternates between textual reasoning and visual generation, has shown promise on spatial and physical tasks. However, in complex long-chain scenarios, we identify a fundamental failure mode: generated images diverge from the textual context while subsequent text ignores the visual evidence, causing the two modalities to alternate without genuinely informing each other. We term this Modal Isolation and attribute it to compounding information loss at modality boundaries. We decompose each reasoning cycle into atomic operations and define modality transition loss, quantifying cross-modal hallucination (text-to-image) and visual utilization deficit (image-to-text) at each boundary. We propose MoTiF (Modality Tiransition Fidelity), a two-stage training framework that directly optimizes these transitions: Reflective SFT trains the model to detect and recover from erroneous visual outputs; Flow-GRPO improves image generation fidelity via reinforcement learning. All training signals in MoTiF derive from transition-level fidelity rather than end-task accuracy. Across four visual puzzle benchmarks, this transition-level supervision substantially improves both cross-modal coherence and final task accuracy. The results demonstrate that effective interleaved reasoning requires explicit structural supervision at modality boundaries, not merely scaling or end-task optimization.

LNTest: A Testbed for Evaluating Bitcoin Lightning Network-Based Botnets

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12887v1 Announce Type: new Abstract: Bitcoin's Lightning Network (LN) can be exploited as a covert, low-cost command-and-control (C&C) channel for botnets, as demonstrated by the LNBot and D-LNBot designs. However, both remain proof-of-concept prototypes evaluated only through simulation, leaving key questions about real-world topology formation, propagation complexity, and resilience to takedowns unanswered. We present LNTest, the first reusable testbed for LN-based botnets, built from Core Lightning nodes containerized with Docker over a shared Bitcoin Core regtest chain. LNTest supports three overlay topology modes (a deterministic chain, autonomous peer discovery, and user-supplied graphs), enabling controlled experiments across different botnet structures. Using LNTest, we report three main findings. First, D-LNBot's autonomous formation protocol does not produce the uniform chain from its design; instead, it creates a clustered chain in which cliques are linked by bridge nodes whose removal fragments the network. Second, command propagation scales linearly with botnet size ($\Theta(n)$), not the $O(m \log n)$ previously claimed, and gains nothing from higher neighbor connectivity. Third, the overlay topology determines the effectiveness of takedown strategies: uniform-degree chains resist targeted removal but fragment under random failure, scale-free topologies show the opposite pattern, and the autonomous clustered chain is fragile under both, making it the most vulnerable of the three. LNTest is released as open source, with a script that reproduces all our experiments, to support reproducible research on LN-based botnet defenses.

Pinching-Antenna Enabled Multicell Wireless Systems

Yunshu Chen, Qing Xue, Meng Hua, Bingpeng Zhou, Shaodan Ma — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12888v1 Announce Type: new Abstract: Pinching antenna (PA) systems have recently emerged as a promising flexible-antenna technology, which can reconstruct the wireless propagation environment by dynamically adjusting the positions of pinching elements along dielectric waveguides, thereby providing new spatial degrees of freedom (DoFs) for enhancing wireless system performance. This paper investigates a multi-waveguide PA-based multi-cell communication system, focusing on the joint optimization of precoding matrices, waveguide power allocation, and antenna placement to maximize the weighted sum rate (WSR). In multi-cell scenarios, inter-cell interference typically leads to a highly coupled and nonconvex WSR maximization problem. To address this challenge, an efficient alternating optimization framework is adopted to optimize each variable in an iterative way. Specifically, fractional programming is first employed to reformulate the original problem by introducing auxiliary variables that decouple the signal and interference terms. Based on this reformulation, block coordinate descent is then applied to optimize the precoding matrices and power allocation, leading to closed-form or semi-closed-form updates. For the high-dimensional and nonconvex PA placement problem, particle swarm optimization (PSO) is utilized to perform an efficient search and improve scalability. Numerical results demonstrate that, under various system configurations, the proposed scheme significantly outperforms baseline methods, including average power allocation, fixed antenna placement, conventional multiple-input multiple-output (MIMO), and massive MIMO. These results highlight the strong potential of PA systems for large-scale multi-cell wireless communications.

Learning to Adapt: Representation-Based Reinforcement Learning for Multi-Task Skill Transfer

Aryan Naveen, Haitong Ma, Haldun Balim, Na Li — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12890v1 Announce Type: new Abstract: Reinforcement learning has achieved remarkable success in learning complex control policies, yet its applicability remains limited due to sample inefficiency and poor generalization across tasks. In this work, we propose RepMT-SAC, a framework for multi-task RL that enables efficient knowledge sharing and robust transfer to new tasks. RepMT-SAC uses spectral MDP decomposition to capture transferable dynamics, structuring the value function into a task-agnostic core with a minimal task-specific adjustment. This design allows for strong zero-shot performance on in-distribution tasks and rapid few-shot adaptation to out-of-distribution tasks. We evaluate RepMT-SAC on quadcopter trajectory-following tasks across in-distribution and out-of-distribution contexts, demonstrating that it outperforms baselines by up to 30%.

LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning

Xinrui He, Qiyu Kang, Xuhao Li, Zheng-Jun Zha — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12895v1 Announce Type: new Abstract: Spiking Neural Networks (SNNs) are well-regarded for their biological plausibility and energy efficiency in processing sequential data. However, dominant SNN architectures typically rely on first-order Ordinary Differential Equations (ODEs) to govern neuronal state transitions. This first-order assumption imposes a "memoryless" bottleneck, limiting the model's capacity to capture the complex, long-range dependencies inherent in long-sequence tasks. In this work, we propose LongSpike, a novel SNN framework that integrates fractional-order State-Space Modeling, or f-SSM, from control theory into the spiking domain. By extending traditional integer-order SSMs to the fractional-calculus regime, LongSpike enables the hierarchical integration of neuronal dynamics with long-memory kernels. To mitigate the computational overhead and parallelization challenges typically associated with fractional operators, we leverage a state-space formulation that supports efficient, parallel training. Empirical evaluations on challenging benchmarks, including Long Range Arena (LRA), large-scale WikiText-103, and Speech Commands, demonstrate that LongSpike outperforms state-of-the-art SNNs in accuracy while preserving sparse synaptic computation. The code is available at https://github.com/xinruihe389-commits/LongSpike.

PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent

Junfeng Guo Heng Huang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12896v1 Announce Type: new Abstract: While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerable to backdoor attacks, where a victim agent behaves normally under standard conditions but executes malicious actions when a specific trigger is activated. Existing backdoor defenses for RL either require access to the agent's internal parameters, operate only at the model or trajectory level, or are limited to specific attack types. To ensure the security of RL agents, we propose \texttt{PolicyGuard}, a \textit{test-time step-level} backdoor defense which leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to enable uncertainty computation for individual time step. Besides, we also provide theoretical foundations to explain the efficacy of GP posterior variance. Extensive experiments across seven RL games demonstrate that PolicyGuard achieves state-of-the-art detection performance in most cases, with average AUROC of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.

SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12897v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to access organisational documentation, including standard operating procedures (SOPs), HR policies and institutional guidelines. However, retrieval-augmented generation (RAG) systems that rely on free-form rewriting can introduce hallucinations and unstable trade-offs between completeness and conciseness, particularly in safety- and compliance-critical settings. Objectives: To evaluate extraction as a hallucination-resistant alternative to rewriting-based RAG and compare strategies that balance precision, recall and safety across document types and model scales. Methods: We compare multiple prompting strategies, including line-number-based source selection, extraction of relevant guideline sentences with explicit safety annotations, and a multi-stage pipeline that refines draft answers using supporting evidence from source guidelines. Experiments are conducted on documents of varying length and structure, including local NHS acute care and oncology guidelines and UK-wide NICE guidelines, using both frontier-scale and locally deployable models. Performance is assessed using automatic metrics and human expert evaluation of relevance and completeness. Results: Line-number selection achieves the strongest results, outperforming direct copying and safety-focused strategies across both large and small models while maintaining high term recall (up to 95%) and close alignment with source text. Safety-oriented approaches improve precision but introduce systematic omissions, while multi-stage filtering further amplifies this trade-off. Performance varies with document structure: line-based extraction excels in protocol-like content, whereas alternative strategies perform better on more verbose documents (up to 97% term recall).

Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension

Shenglai Zeng, Qirui Wang, Kai Guo, Xinnan Dai, Xianxuan Long, Hui Liu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12898v1 Announce Type: new Abstract: Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC pipelines treat rendering and layout as a fixed, content-agnostic preprocessing step and offer little mechanistic understanding of how VLMs internally process visualized text. Through a focused empirical study on VTC QA tasks, we reveal that VLMs exhibit a localization-without-utilization regime: evidence-localizing attention emerges sharply in the middle-to-late layers and is largely decoupled from answer correctness, yet simply enlarging the localized spans on the rendered page recovers a large fraction of the failures. Building on these observations, we propose AGAR (Attention-Guided Adaptive Rendering), a training-free, model-agnostic method that leverages a VLM's own middle-to-late layer attention to identify the top-K important visual patches, maps them back to word spans, and re-renders the page with those spans enlarged before re-inferring the answer. Extensive experiments across nine VTC benchmarks (short-form, long-context, and multi-page memory QA) and four VLM backbones show that AGAR (i)consistently improves off-the-shelf VLMs as a plug-and-play enhancement, (ii)composes with VLM post-training to yield further gains, and (iii)remains robust under both visual- and text-side input degradation.

Zero-source LLM Hallucination Detection with Human-like Criteria Probing

Jiahao Yang, Shuhai Zhang, Hailong Kang, Feng Liu, Qi Chen, Mingkui Tan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12900v1 Announce Type: new Abstract: Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query-answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted reasoning of human evaluators. Its core is a Human-like Criteria Probing (HCP) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward-based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi-sampling aggregation strategy to ensure robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state-of-the-art baselines, offering an effective and explainable solution for zero-source hallucination detection. Code is available at https://github.com/TRISKEL10N/HCPD.

PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue

Wen Zhang, Xiaocui Yang, Zhuoyue Gao, Shi Feng, Daling Wang, Yifei Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12902v1 Announce Type: new Abstract: Empathetic spoken dialogue systems require not only semantically appropriate responses but also emotionally aligned prosodic expression. However, cascade pipelines often discard acoustic cues during speech-to-text conversion, while end-to-end speech models lack interpretable control over emotion and knowledge integration. To address these challenges, we propose PRISM, a multi-agent framework for empathetic spoken dialogue that decouples speech perception, response generation, and speech synthesis into coordinated components. PRISM introduces a prosody-to-language translation mechanism to stabilize large language model reasoning and enables on-demand invocation of external knowledge tools for empathetic dialogue generation. Experimental results demonstrate that PRISM achieves consistent improvements in empathy, prosodic appropriateness, and text response generation quality across objective and subjective metrics. Our code is available at: https://github.com/Bxzfrm/PRISM.

X-MADAM-RAG: Diagnosing and Handling Chinese-English Evidence Conflict in Retrieval-Augmented Generation

Yongqi Kang, Yu Fu, Yong Zhao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12903v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) systems may receive evidence that is not merely noisy but mutually contradictory. This issue becomes particularly salient in multilingual settings, where retrieved Chinese and English evidence may support incompatible answer candidates. We study this problem through X-RAMDocs-ZHEN, a controlled Chinese-English benchmark derived from RAMDocs for diagnosing evidence conflict in RAG. The benchmark contains 300 examples across six balanced conditions, including monolingual support, bilingual agreement, reversed conflict directions, and conflict with optional noise. We further examine X-MADAM-RAG, an interpretable pipeline that decomposes evidence handling into per-document candidate extraction, visible-evidence repair, deterministic candidate grouping, and conflict-aware aggregation. On the original controlled benchmark with Qwen2.5-7B-Instruct, X-MADAM-RAG achieves 0.9667 strict accuracy and 0.9767 conflict-aware success, outperforming an evidence-normalized single-call baseline. However, a zero-call rule-only extractor reaches 1.0000 on the same benchmark, revealing strong template regularity. To probe this limitation, we construct a deterministic naturalized stress test that removes explicit answer templates while preserving candidate strings. On its 100-sample subset, rule-only extraction falls to 0.0000, but X-MADAM-RAG also drops to 0.3000 strict accuracy, below both naive and evidence-normalized baselines. A privileged oracle remains perfect, indicating that document-level extraction is the main bottleneck. These findings position X-RAMDocs-ZHEN and X-MADAM-RAG as diagnostic tools for controlled evidence conflict rather than as evidence of general hallucination detection or robustness to natural retrieval.

Trait, Not State: The Durability of Reading Identity in Social Highlighting

Kazuki Nakayashiki, Keisuke Watanabe — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12904v1 Announce Type: new Abstract: Prior work on a social web highlighter located individuality in selection -- which documents a person chooses to highlight -- but measured it cross-sectionally. We ask the temporal question: is a reader's selection signature a trait or a state? We freeze each reader's first six months of highlighting as a profile and track its own-vs-other advantage on their later selections at growing gaps (to 24+ months), with negatives drawn from the same calendar era -- so supply drift cannot masquerade as personal drift -- at a coarse global level and at a fine level whose negatives and controls come from the reader's own interest neighborhood; the anchor cell reproduces the prior cross-sectional level (+0.188 vs +0.169), validating the harness. Four results. Within the same users, the fine-layer advantage shows no statistically detectable paired decline at any horizon (6-12 month retention R = 1.00 [0.85, 1.18], n = 212; the farthest bin is compatible with a modest decline; the only contrast whose interval excludes zero is the coarse layer at 12-24 months, about 13%). The signal is not reducible to repeated domains (~90% survives excluding all profile sources). Within-person drift is slow (a recent-half profile beats the old half by +0.042). Prospectively, personal profiles -- even one built from a reader's earliest documents, median 20 months before evaluation -- rank their next reads at roughly 3x the AP of every simple non-personal prior tested. We use "trait" operationally (a stable signature under continued engagement); the scope is heavy, long-tenured readers of one platform, and exposure is not separable from choice.

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12908v1 Announce Type: new Abstract: Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-policy paradigm for improving agents from their own environment interactions, its effectiveness depends heavily on the training task distribution. When tasks are fixed before training, the task distribution can become increasingly mismatched with the policy's evolving capabilities, causing many rollouts to be spent on uninformative tasks. We propose SENTINEL, a failure-driven reinforcement learning framework that turns the Solver's rollout failures into targeted training tasks. SENTINEL follows a Controller--Proposer--Solver loop: the Controller analyzes failed trajectories and summarizes recurring error patterns, the Proposer generates executable tasks that stress these weaknesses, and the Solver is trained on the targeted tasks. On Tau2-Bench Retail with Qwen3-4B-Thinking-2507, SENTINEL improves Pass\^{}1 from 66.4 to 74.9 and outperforms RL on general synthetic tasks across Pass\^{}k metrics. These results demonstrate that model failures provide an effective and scalable source of targeted training signal for improving tool-using language model agents.

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

Allison Andreyev, Landon Eum, Nestor Tiglao, Romel Gomez — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12910v1 Announce Type: new Abstract: For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12911v1 Announce Type: new Abstract: Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST, classifying substitution errors by phonetic cause and quantifying their impact on downstream Neural Machine Translation (NMT) performance using Linear Mixed-Effects Modelling. We confirm that most ASR substitution errors arise from phonetic confusions rather than random noise, and that these phonetic errors significantly degrade ST quality. Motivated by this finding, we propose Phonetically-Informed Data Augmentation (PiDA), which generates ASR-like corruptions by substituting words with phonetically similar alternatives using phonetic word embeddings. Fine-tuning on a PiDA-augmented version of FLEURS Vietnamese-English improves translation of erroneous ASR outputs (up to +2.04 BLEU over standard fine-tuning) while also slightly improving clean-text performance.

Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12913v1 Announce Type: new Abstract: The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40\% without sacrificing accuracy on ImageNet-1k with ResNet-50.

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12916v1 Announce Type: new Abstract: Molecular dynamics (MD) is the canonical in-silico method for atomistic molecular science, simulating molecular behavior from first-principle physics. Designing an MD pipeline for a new system requires substantial expert knowledge: running it on even one molecule is expensive, ruling out trial-and-error. We automate this expert pipeline-design process with an LLM agent. Unlike existing MD agents that orchestrate a predefined tool set, we treat pipeline design as open-ended code generation in which the agent's behavior is reshaped online by verbal reward. Specifically, we build MDForge, an LLM agent whose in-context update rule densifies the sparse reward via a multi-agent debate among physics experts. On three SAMPL host-guest binding free-energy benchmarks, MDForge automatically designs MD pipelines competitive with human experts. Deployed on a library of unseen candidate guests, its CB[7] pipeline discovers a novel binder that wet-lab competition NMR confirms is a high-affinity, picomolar CB[7] binder. Our data and code are available at https://github.com/Zehong-Wang/MDForge.

Where Computation Lives Inside TabPFN: Causal Localisation of Attention Head Function

Atharva Gupta, Dhruv Kumar, Murari Mandal, Saurabh Deshpande — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12917v1 Announce Type: new Abstract: We present the first causal mechanistic analysis of a tabular foundation model, investigating how TabPFN 2.5's feature wise attention heads distribute computation across layers. Using activation patching, ablation, and attention entropy across two synthetic regression datasets, we find clear temporal specialisation: one head's causal necessity dominates that of the others by 2 to 5 times at peak layer, with its dominant layer shifting across tasks of different complexity, while the remaining heads exhibit symmetric late layer profiles. Attention entropy and patching provide convergent evidence for the computationally active layers of the dominant head. We additionally investigate inference time steerability via contrastive activation steering, which fails to transfer across samples. We attribute this result to TabPFN's in context learning mechanism, which encodes task structure through context dependent attention rather than the stable parametric directions that make steering tractable in language models.

MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12918v1 Announce Type: new Abstract: Hierarchical multi-agent systems (MAS) are rapidly being deployed in high-stakes workflows across domains such as finance and software engineering. In these systems, safety and security are inherently distributed across role-specialized agents, significantly expanding the attack surface, particularly under coordinated adversarial behaviors such as privilege escalation and cross-agent collusion. Existing red-teaming approaches for MAS remain limited: they rely on heuristic selection of target agents and perturb isolated message streams, leaving critical questions unanswered as which agents are most responsible for system safety, and how compromised agents can coordinate to bypass defenses. We propose MAStrike, a closed-loop framework for collusive red-teaming in hierarchical MAS. We propose the first agent-level Shapley value analysis for MAS, quantifying each agent's marginal contribution to system robustness under task-specific distributions. GGuided by this attribution, MAStrike identifies vulnerable agent coalitions and generates coordinated, role-aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis, attributing failure cases to uncompromised agents that block adversarial attempts. We further build a comprehensive MAS red-teaming benchmark and controllable environments spanning diverse hierarchical topologies and domains, including finance, software engineering, and CRM. Extensive experiments across MAS built on multiple frontier models show that MAStrike substantially outperforms heuristic baselines. Our analysis further uncovers non-trivial Shapley value distributions and higher-order interaction structures among agents, revealing critical vulnerabilities and coordination patterns that are overlooked by prior single-agent or template-based methods.

LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

Franz Louis Cesista, Katherine Crowson, C\'edric Simal, Stella Biderman — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12921v1 Announce Type: new Abstract: Low-Rank Adaptation (LoRA) significantly reduces compute and memory costs for finetuning Deep Learning models but is often harder to tune than dense training: when using factor-wise optimizers such as AdamW, it is sensitive to initialization choices, its optimal learning rates transfer poorly across ranks, and it often fails to beat dense baselines. We derive LoRA-Muon by applying the Muon optimizer's spectral steepest-descent rule to the low-rank setting. Along with our split weight-decay rule, our main claim is that LoRA-Muon is a good low-rank proxy for full-rank Muon and Shampoo-family optimizers. Its optimal learning rates transfer across rank, width, depth, and factor-rescaling. In our compute-matched TinyShakespeare study, a rank-$2$ proxy recovers the dense best tested learning rate, and a rank-$32$ LoRA-Muon run attains lower mean validation loss than the dense baseline in the seed-averaged sweep. We further show that the Spectron optimizer depends on arbitrary factor scaling, so it would likely be a poor fit when finetuning starts from badly imbalanced factors, and that LoRA-RITE's simplified QR-coordinate core implements the same spectral update. LoRA-Muon computes that update without QR-decomposition and avoids storing second moments, making it more accelerator-friendly and memory-efficient.

Polar: A Benchmark for Evaluating Political Bias in LLMs

Sangho Kim, Heejin Kim, Yoonhee Park, Hyunggeun Jeon, Jaejin Lee — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12922v1 Announce Type: new Abstract: Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that measures political bias through option-level likelihoods rather than prompt-based generation. Polar covers two ideological axes and eight issue categories derived from the Manifesto Project, and evaluates models in parallel across U.S. and South Korean political contexts. Across 38 LLMs, measured bias varies systematically with political context, issue category, model group, and presentation language. All models lean left-progressive on U.S. political content, but show more centered and mixed patterns on South Korean content. Translation experiments further show that presentation language alone can shift measured bias. These findings highlight the need for multilingual and cross-contextual evaluation of political bias in LLMs.

Order Is Not Control

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12923v1 Announce Type: new Abstract: AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not control. Control requires a receiver-gated response law: a denominator-indexed operator mapping material state, action/drive, bath, and receiver state to response displacement, sinks, effort, and basin projection. We identify it across biological, LLM, adapter, and stochastic-operator panels. The laws are local: an intervention can be admitted, saturated, sign-changing, leaky, or overdriven depending on medium, bath, receiver state, action port, and comparator. Control is assigned when finite effort moves a target or outcome-readout class under the same denominator while damage, null/evasive, invalid format, overdrive, and unnecessary effort stay bounded. Mouse ALM, C. elegans, and zebrafish panels provide physical response-operator evidence while excluding coordinate identity and controller conclusions. LLM panels show generated-output response laws: across four material conditions, response vectors are predictable at 72.8-73.7% component-sign accuracy, rising to 84.3-84.8% on nonzero components; held-out observers predict system-effect and target/oracle families at 93.6% and 91.7% accuracy. Constitution-conditioned adapters reshape susceptibility as prepared media, and stochastic-operator panels separate measured opportunity from deployable action policies. This gives a driven-dissipative response-system account at the mesoscopic control level: drives act through prepared media, baths, and receivers, producing admitted movement, impedance, sinks, or overdrive. The evidence supports local admitted control and measurable stochastic response operators, while leaving deployable pre-generation control, hidden/logit causal sufficiency, biological-to-LLM coordinate identity, and literal thermodynamic quantities outside scope.

Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12924v1 Announce Type: new Abstract: We present a modular two-agent simulation framework for evaluating conversational shopping assistant architectures. An independent buyer agent, configured with personas, missions, and patience levels, is paired with an interchangeable responder that integrates with a real e-commerce search API. Holding the buyer constant across experiments enables controlled comparison of responder designs on identical scenarios. Using 2011 conversations across 14 persona buckets, we establish four empirical findings. First, rolling-window memory outperforms intent-extraction memory on all quality metrics while being 35% faster per query. Second, illustrating rapid evidence-driven iteration, a systematic failure analysis of a responder version enables targeted fixes that reduce failure and near-failure rates by 62% across the full dataset. Third, swapping the responder LLM backbone from Gemini~2.5 to Llama~3.3~70B costs 0.16--0.45 points despite identical architecture. Finally, we document systematic philosophical disagreement between frontier LLM judges: Gemini rewards process correctness while Claude demands concrete outcomes, despite using the same evaluation prompt.

Multi-Label Test-Time Adaptation with Bayesian Conditional Priors

Qiru Li, Ao Zhou, Zhiwei Jiang, Zifeng Cheng, Cong Wang, Yafeng Yin, Qing Gu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12925v1 Announce Type: new Abstract: Multi-label recognition with frozen Vision-Language Models (VLMs) is brittle under distribution shift: standard zero-shot inference scores labels independently, ignoring co-occurrence structure and producing incoherent label sets where dominant concepts suppress weaker but compatible labels. We introduce Bayesian Conditional Priors (BCP) Estimation, a gradient-free test-time adaptation method that injects label dependency without tuning the backbone. BCP views zero-shot logits as a proxy for marginal posteriors under a fixed image-text likelihood and attributes shift-induced errors mainly to a mismatched label prior. For each test image, it selects a high-confidence anchor label and applies an anchor-conditioned Bayesian refinement. This update is closed-form in logit space and admits a pointwise mutual information (PMI) interpretation, explicitly promoting compatible labels and suppressing incompatible ones. BCP operates without target annotations by estimating anchor-conditioned priors online from the unlabeled test stream via lightweight second-order co-occurrence statistics, adding negligible overhead beyond a single forward pass. Across standard multi-label benchmarks and multiple CLIP backbones, BCP consistently outperforms strong TTA baselines, e.g., improving RN50 average mAP from 57.31 to 69.22 and ViT-B/16 from 62.61 to 71.79.

Is Spurious Correlation Removal Always Learnable?

Yibo Zhou, Bo Li, Hai-Miao Hu, Hanzi Wang, Xiaokang Zhang, Ruifan Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12930v1 Announce Type: new Abstract: Invariant learning can fail even when the invariant structure is statistically identifiable. We show a conditional computational barrier: under a black-box samplable supervised sparse recovery primitive motivated by average-case sparse-recovery reductions, there exist \emph{samplable} multi-environment instances with a one-dimensional predictive invariant subspace ($k=1$) that are learnable with polynomial samples by exhaustive search, while any polynomial-time constant-accuracy recovery algorithm would contradict the primitive. We further quantify environment diversity by a separation parameter $\gamma$, which controls identifiability and the curvature of invariance objectives. Under sufficient diversity and local Gaussian regularity, the minimax risk is $\mathbb{E}[\dist(\hat{V},V_{\mathrm{inv}})^2]=\Theta(k(d-k)/(n|\mathcal{E}|))$, and under label-induced shifts a phase transition occurs at $n^*\propto k(d-k)/(|\mathcal{E}|\gamma^2)$ with refined estimation error scaling proportional to $1/\gamma^2$. Synthetic and real datasets illustrate the predicted gaps and transitions and motivate simple diversity diagnostics.

MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

Wenbo Chen, Puheng Li, Mengyang Liu, Weijie Su, Tianpei Xie — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12935v1 Announce Type: new Abstract: Parallel test-time scaling samples many reasoning traces and majority-votes their answers, improving LLM accuracy but requiring traces to run to completion, incurring substantial computational overhead. We observe that probing partial traces at intermediate checkpoints can extract current answers without disrupting generation, revealing an evolving aggregate vote. Based on this observation, we introduce MARS, a margin-adversarial stopping rule that estimates which active traces are likely to change their answers and stops once the leader remains safe under a conservative bound on future vote movement. The rule separates two sources of uncertainty. It learns the trace-level switch probabilities that determine how much of the current margin is likely to be retained, while handling the harder question of where switching traces land through an adversarial bound calibrated from warmup traces. With true switch probabilities, MARS guarantees with high probability that the early-stopped answer matches the full-budget vote. In practice, a five-feature logistic model closely matches oracle switching behavior. Across three reasoning models and three competition-math benchmarks, MARS saves 25-47% of self-consistency tokens and 14-29% on top of DeepConf Online, a strong confidence-weighted baseline that already filters and truncates weak traces, while matching the accuracy of the corresponding full-budget baselines.

An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12936v1 Announce Type: new Abstract: Wet-lab robots can improve the reproducibility, throughput, and safety of biomedical experiments, but scaling their learning requires customizable simulators for safe and reproducible task generation, open editable laboratory assets, and efficient pipelines that turn limited demonstrations into usable training data. We present Pipette, an embodied simulation platform, benchmark, and data-efficient augmentation framework for wet-lab robot learning. Pipette releases over 43 open-source and re-editable wet-lab assets, together with an extensible asset-building pipeline. A key component of Pipette is its simulation-based data augmentation pipeline, replaying human demonstrations in simulation, applies lighting, camera, speed, and action perturbations, and filters generated episodes with automatic task success checks, rapidly expanding usable training data from limited manual demonstrations. We further introduce an 11-task wet-lab embodied benchmark covering sample handling, culture-ware manipulation, device operation, and precision placement. With only 30 demonstrations per task, ACT achieves 65.5% average success rate, while simulation augmentation improves SmolVLA from 44.1% to 74.7% and {\pi}0 from 40.4% to 46.5%, validating the effectiveness of Pipette for data-efficient VLA training and evaluation. Pipette also supports natural-language-driven scene construction and task registration, lowering the barrier for non-expert users to define new wet-lab robotic tasks.

MAMVI: 3D Test-Time Adaptation via Masked Multi-View Point Clouds

Inseok Kong, Geunyoung Jung, Jiyoung Jung — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12939v1 Announce Type: new Abstract: 3D point cloud models suffer significant performance degradation under distribution shifts caused by sensor noise, occlusions, and environmental changes. Test-time adaptation (TTA) has emerged as a practical paradigm for mitigating this issue during inference. Recently, leveraging multi-view augmentation has shown promise in improving 3D TTA performance. However, existing multi-view approaches are often constrained by sequential optimization that treats each view independently. This sequential optimization leads to substantial inference latency due to repetitive optimization steps, making real-time adaptation impractical. To address this, we propose Masked Multi-View Test-Time Adaptation (MAMVI), which replaces sequential optimization with a unified single-step adaptation. Specifically, MAMVI utilizes a hybrid masking strategy that combines fixed ratios for stability with Beta-distributed sampling for diversity. By aggregating losses across multiple views, MAMVI performs adaptation through a single backward pass based on multi-view consensus. Additionally, a confidence-based adaptive learning rate is used to dynamically adjust the adaptation intensity for each sample. Extensive experiments on ModelNet-40C, ShapeNet-C, and ScanObjectNN-C demonstrate that MAMVI achieves state-of-the-art accuracy on ShapeNet-C and ScanObjectNN-C. Moreover, it remains competitive on ModelNet-40C while delivering 4.9-8.9 times faster inference, making it highly suitable for real-time applications. Our code is available at https://github.com/Inseok-kong/MAMVI

Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment

Xiang Li, Yixuan Zhou, Jingran Xie, Zhiyong Wu, Hui Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12940v1 Announce Type: new Abstract: Neural speech codecs based on Vector-Quantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Modifying the quantizer or increasing model capacity are common fixes, but they complicate downstream language modeling. Our core idea is to align the decoder's internal feature manifolds when processing both the quantized tokens and their original continuous embeddings, using a lightweight feature-mapping loss. This requires minimal training overhead and no inference-time changes. Applied to XCodec2, self-guidance improves all reconstruction metrics, achieving state-of-the-art low-bitrate performance. Notably, it enables a 4x codebook reduction without fidelity loss, which downstream TTS experiments show significantly improves LLM-based synthesis by simplifying the token modeling space. Multiple statistical observations and visualizations corroborate the enhanced internal manifold alignment in the decoder. Extensive experiments confirm its generality across various inductive biases. Self-guidance thus establishes an efficient, broadly applicable method for high-fidelity neural audio coding.

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

Shu Tong Luo, Wenqin Liu, Rui Liu, Mingming Gong, Jiaxian Guo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12941v1 Announce Type: new Abstract: When a user reveals task-critical information across several conversation turns, LLM accuracy drops by up to 65% despite full context availability. We show that this Lost in Conversation degradation can be substantially mitigated by training models to maintain a compact rolling memory instead of attending to a growing history. To make such training scalable, we introduce a low-cost sharding pipeline that converts single-turn QA datasets into multi-turn fragmented-information episodes, eliminating the need for hours of manual annotation. Training only on sharded GSM8K, our memory-augmented policy significantly improves multi-turn accuracy and generalises zero-shot to harder math and out-of-domain long-context QA. Moreover, memory-trained models outperform full-history baselines even when given the full history at test time, suggesting that learning to compress induces more robust incremental reasoning than full-context exposure alone.

PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization

Hao Jiang, Xin Li, Annan Wang, Zhi Yang, Haoxiang Zhang, Yichi Zhang, Weisi Lin — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12942v1 Announce Type: new Abstract: Generative listwise ranking with Large Multimodal Models (LMMs) aims to capture global list context in a single forward pass, but its effectiveness degrades in long-context multimodal scenarios. We identify a recurring failure mode, parse collapse, where the autoregressive decoder produces fluent yet incomplete rankings by silently omitting candidates and terminating early. This failure stems from limited context utilization rather than simple formatting mistakes, making prompt engineering and constrained decoding insufficient. We propose PRISMR (Parameterized Representation Internalization for Semantic Multimodal Ranking), a framework that replaces transient in-context list processing with parametric structural conditioning. PRISMR uses a lightweight hypernetwork to encode multimodal candidates in parallel and generate item-specific LoRA weights, which are synthesized into an instance-specific adapter for a LMM. This paradigm enables more robust internalization of list structure while preserving the base model. We further introduce a large-scale multimodal review-ranking benchmark for evaluation. Experiments demonstrate that PRISMR substantially reduces parse collapse, improves listwise ranking performance, and transfers effectively across domains and instruction-tuned backbones.

Testing Theory of Truly Concurrent Processes

Yong Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12944v1 Announce Type: new Abstract: A process is able to execute a set of actions with a predefined manner, while a truly concurrent process executes this set of actions with a manner with the flavour of true concurrency. The so-called truly concurrent process algebras bridge the true concurrency (such as Petri nets, event structures, etc), and the interleaving concurrency (such as CCS, CSP, ACP, etc). In this paper, we give truly concurrent processes testing semantics followed by Hennessy's great work, which inherits the trinity of operational semantics, axiomatic semantics and denotational semantics.

Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory

Zhibao Chen, Qian Cheng — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12945v1 Announce Type: new Abstract: Long-running LLM agents accumulate interaction histories far larger than any context window, forcing a standing decision: what to encode deeply, what to forget, and what to retrieve under a fixed memory budget. Production systems answer with semantic similarity or recency -- both mis-specified for the forgetting decision, which is made at consolidation time before the future query is known. We propose a multi-factor memory value function V(m)=\sum_i w_i f_i(m) over seven interpretable factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, and usage history) drawn from cognitive psychology, whose weights are learned from a downstream objective by a gradient-free optimiser, and whose single scalar uniformly controls encoding depth, forget risk, and retrieval rank. We make a methodological point: on LongMemEval, scoring goal relevance against the held-out evaluation question saturates gold-evidence retention at \approx 0.98 -- this measures retrieval, not forgetting. In the realistic blind regime, a learned multi-factor value retains 0.770 \pm 0.011 of gold evidence across 479 usable cases, versus 0.657 for uniform weights, 0.518 for the best single factor, and 0.368 for recency; every paired gap's 95% bootstrap CI is above zero, and a neural network over the same factors ties the linear model. The learned weights are interpretable -- reliability, emotional intensity, and self/user relevance dominate, while query-time goal similarity is correctly down-weighted for the forgetting decision. A controlled synthetic task with planted confounds confirms the learner recovers a separating weighting (1.00 retention) where uniform weighting fails (0.62). The substrate is open-source; all experiments run on a single CPU with no API calls.

Data Aphasia: An Institutional Counterfactual Study of the Stability of Academic Cognition Under Letter-Grade Evaluation Systems

Li Li, Yu Cao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12946v1 Announce Type: new Abstract: Does the letter-grade evaluation system, while achieving its burden-reduction goals, affect the education system's stable understanding of students' academic structures? This paper introduces the concept of "data aphasia," referring to restrictions on diagnostic information expression caused by institutionally mandated forms of data presentation. Using data from 68 mathematics examinations administered to 75 primary school students, we employ an institutional counterfactual simulation method to convert percentage scores into A/B/C/D letter grades and conduct systematic tests at the information, structural, and diagnostic levels. Results show that information entropy decreases by approximately 69% after grade conversion; under the full sample, the letter-grade system appears superficially stable (K=4), but removing a single extreme anchor student causes the optimal K to increase from 4 to 8 and individual diagnostic identity consistency to fall from 95% to 62%; temporal consistency fluctuates between 52% and 96%, far below the 93%-96% baseline of the percentage system. Mechanism analysis indicates that discretization compresses the feature space by approximately nineteenfold across 68 examinations; after standardization, it creates extensive pseudo-heterogeneity regions, flattens density gradients, and makes clustering boundaries highly sensitive to minor perturbations. Based on these findings, this paper proposes a dual-track evaluation mechanism and provides a testable analytical framework for understanding the cognitive costs of educational evaluation reform.

ViPER: Vision-based Packing-Aware Encoder for Robust Malware Detection

Fatima Qaiser, Bisma Tahir, Muhammad Abid Mughal, Nauman Shamim — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12949v1 Announce Type: new Abstract: Visualization-based malware detection maps raw binary bytes to grayscale images and applies learned visual classifiers, providing an evasion-resistant and disassembly-free alternative to conventional analysis pipelines. However, executable packing remains a critical failure mode: packed binaries produce high-entropy images that obscure the structural patterns these models rely on. Because packing is also prevalent in benign software (e.g., for compression or copy protection), packing state alone is not a reliable indicator of maliciousness, and existing approaches do not address this challenge within a unified supervised framework. We present ViPER, a Vision-based Packing-Aware Encoder for Robust malware detection. ViPER builds on a LoRA-adapted ViT-B/14 backbone with a dual-head architecture that jointly learns malware classification and packing detection. A packing-aware gating mechanism conditions malware predictions on the inferred packing state, enabling distinct decision boundaries for packed and unpacked inputs. To address packing label skew during training, we employ frequency-weighted losses with stratified sampling over joint class-packing strata. Evaluated on 200,000 Windows PE byteplot images, ViPER achieves a balanced accuracy of 0.8521, ROC-AUC of 0.9260, and AUPR of 0.9279, outperforming representative state-of-the-art baselines across all primary metrics, while attaining a packing detection AUC of 0.9949.

Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12950v1 Announce Type: new Abstract: Large Language Model based Multi-Agent Systems (LLM-MAS) have emerged as a powerful paradigm for tackling complex tasks by breaking them into collaborative workflows of specialized LLM-powered agents. However, deploying such multi-agent workloads at scale poses significant system challenges. Each user query spawns an iterative pipeline of LLM calls, greatly amplifying resource consumption compared to single-turn queries. In resource-constrained cloud settings, these workflows face non-deterministic and input-dependent costs at decode stage, heavy-tailed multi-model requirements with memory fragmentation and over-provisioning, and cross-cluster scheduling trade-offs. We present Maestro, a workload-aware scheduling system designed for LLM-MAS serving under strict GPU budgets. Maestro explicitly leverages agent semantics and roles: it predicts the output length and memory usage of each stage and uses this prediction to drive a hierarchical scheduler. At the node level, Maestro enables dynamic multi-model co-location via hierarchical weight caching and elastic memory provisioning. At the cluster level, it performs latency-aware routing to avoid cold-start delays and memory overloads. At the global level, it enforces workflow-aware prioritization to minimize head-of-line blocking for interactive tasks. Across prototype experiments and trace-driven simulations, Maestro reduces KV-reservation HBM by 67.2% and improves high-contention SLO attainment over EDF by 23.6 percentage points.

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12953v1 Announce Type: new Abstract: We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling ~3.35M pretraining samples spanning pathology, radiology, microscopy, and text-only clinical QA. OpenMedQ reaches state-of-the-art BLEU-1 on PathVQA (75.9), beating Med-PaLM M variants up to 562B parameters (~80x larger), and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro-F1 (0.757) among BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). We release our code and an interactive demo is publicly available as a reproducible baseline for the community.

Towards Reliable Sequential Object Picking in Clutter: The Runner-up Solution to RGMC 2025

Wei Yu, Xidan Zhang, Ziyi Zheng, Weijie Kong, Huixu Dong — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12954v1 Announce Type: new Abstract: As a long-standing challenge in robotic manipulation, stable and efficient grasping in cluttered environments is of great importance in industrial settings. While recent studies have achieved relatively high success rates in grasping from clutter, there remain few mature solutions for more demanding tasks such as sequential object search and sorting. This work addresses sequential object picking in cluttered environments based on the Cluttered Environment Picking Benchmark (CEPB) and presents our solution to the Pick-in-Clutter track of the 10th Robotic Grasping and Manipulation Competition (RGMC) at ICRA 2025. The task poses several key challenges. First, it requires robust and collision-aware grasping with high success rates across a diverse set of objects, including both rigid and deformable ones. Second, it demands efficient search for target objects, which places stringent requirements on the decluttering and searching strategies of the solution. To address the above challenges, we design an integrated hardware-software pipeline that combines object recognition, decluttering, and multi-modal grasping. The main contributions include the hardware design of a multifunctional gripper and novel representations for object distribution and occlusion relationships in cluttered space. This pipeline enables efficient recognition, search, and sequential grasping of objects in clutter, demonstrating strong performance in both laboratory tests and competition scenarios, and ultimately achieving second place in the Pick-in-Clutter track of the RGMC 2025.

Data-Driven Frequency-Selective Output Regulation of Nonlinear Systems under Almost Periodic Exosignals

Yifei Li, Wenjie Liu, Gang Wang, Lihua Xie — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12955v1 Announce Type: new Abstract: This paper studies output regulation for a class of unknown continuous-time nonlinear systems driven by almost periodic exosignals. The plant dynamics are assumed to be linearly parameterized over a prescribed nonlinear dictionary, while all coefficient matrices in the plant, input channel, output map, and exosignal channel are unknown. Since the plant model is unavailable, exact nonlinear output regulation would generally require model identification followed by the solution of nonlinear regulator equations. To avoid these steps, we pursue a frequency-selective regulation objective: the steady-state regulation error is allowed to be almost periodic, but its Fourier-Bohr coefficients at prescribed exosystem frequencies are guaranteed to vanish, and the residual error energy is explicitly bounded. To this end, a p-copy internal model is embedded in a dynamic controller, yielding an augmented nonlinear system whose unknown constant matrices are represented directly by measured data. A noise-robust semidefinite program is derived to synthesize the controller gain without model identification and without measuring the exosignal amplitudes or phases. The resulting closed-loop vector field is made exponentially contractive on a prescribed operating set, which implies the existence and uniqueness of a bounded and attracting trajectory. By combining contraction theory with Fourier-Bohr analysis, we prove that this steady-state trajectory is almost periodic, that the embedded-frequency components of the regulation error are eliminated, and that the unmodeled spectral components satisfy a Parseval-type time-averaged energy bound. Numerical and physics-based simulations on a quadrotor with a cable-suspended payload illustrate the effectiveness of the proposed data-driven internal-model design.

SERF: Spatiotemporal Environment and Robot Feature Map for Long-Horizon Mobile Manipulation

Sunghwan Kim, Byeonghyun Pak, Kehan Long, Yulun Tian, Nikolay Atanasov — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12956v1 Announce Type: new Abstract: Long-horizon robot mobile manipulation requires continual reasoning about localization, environment changes, and task progress, all of which are challenging to infer from image observations alone. In this paper, we show that conditioning a mobile manipulation policy on a spatiotemporal feature map improves reasoning over long horizons. The map represents the environment and the articulated robot body as neural points in a shared latent space and is updated online from egocentric observations and proprioceptive state. We update the environment neural points using object-level rigid tracking and the robot neural points using forward kinematics. We use our spatiotemporal environment and robot feature (SERF) map as a state input to a vision-language-action (VLA) model by extracting map tokens from multiple reference frames and spatial scales, providing the policy with both local and global context. We demonstrate SERF on BEHAVIOR-1K, a benchmark for long-horizon mobile manipulation in household environments. Experiments show that the SERF VLA policy outperforms image-only baselines, reaches subgoals faster by following more direct trajectories, improves robustness to scene-configuration shifts, and recovers from object-drop failures.

YOLO-AMC: An Improved YOLO Architecture with Attention Mechanisms for Building Crack Detection

Ching-Yu Tsai, Chia-Min Lin, Chih-Hsiang Yang, Yung-Che Wang, Jen-Shiun Chiang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12958v1 Announce Type: new Abstract: Crack detection plays an important role in infrastructure inspection and Structural Health Monitoring (SHM). However, cracks typically appear as thin, low-contrast structures and are easily affected by background noise, posing challenges for existing object detection models. This study proposes an improved YOLO-based architecture with integrated attention mechanisms, termed YOLO-AMC (YOLO with Attention Mechanisms for Crack Detection), to enhance automated crack detection performance. Based on YOLOv11, the original C2PSA module is removed, and multiple attention mechanisms, including Global Attention Mechanism (GAM), Residual Convolutional Block Attention Module (Res-CBAM), and Shuffle Attention (SA), are introduced into the multi-scale feature fusion layers of the Neck to strengthen cross-scale feature integration. Experimental results demonstrate that YOLO-AMC consistently outperforms baseline models YOLOv11n and YOLOv8n across multiple evaluation metrics. Among the evaluated attention modules, GAM achieves the best detection performance, obtaining mAP@0.5 = 0.9917 and mAP@0.5:0.95 = 0.9506 on the test dataset, which are higher than those of YOLOv11 (0.9833 / 0.9112) and YOLOv8 (0.9707 / 0.8921). Furthermore, while maintaining a computational complexity of 7.6 GFLOPs, the proposed model achieves 110.95 FPS on an NVIDIA RTX 4090 platform and approximately 5 FPS on a Raspberry Pi 5 edge device, demonstrating a favorable trade-off between accuracy and deployment efficiency. The implementation code for this study is available on GitHub at https://github.com/CY-Tsai24/YOLO-AMC.

ScaleAcross: Designing Multi-Data-Center Infrastructure for Geo-Distributed AI Training

Naved Inam, Aryan Alpesh Bhavsar, Masabattula Teja Nikhil, Sidharth Sharma — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12963v1 Announce Type: new Abstract: The rapid growth of AI models and increasing data sovereignty requirements are driving the transition toward geo-distributed AI training across multiple data centers. Such deployments introduce system-level challenges arising from synchronization-intensive communication, cross-site data exchange, and wide-area latency constraints. This paper investigates EVPN--VXLAN as an infrastructure foundation for geo-distributed AI training environments and presents a scalable emulation framework for systematically studying distributed AI workloads under realistic wide-area conditions. The proposed framework combines VXLAN overlays with EVPN-based inter-data-center connectivity and is implemented using ContainerLab and FRRouting (FRR). The framework further incorporates Equal-Cost Multi-Path (ECMP) routing, Bidirectional Forwarding Detection (BFD), and a queue-pair-aware traffic distribution mechanism designed to improve communication behavior for synchronization-intensive AI workloads while preserving compatibility with commodity infrastructure. Using realistic WAN emulation, we characterize communication and system behavior under distributed training workloads employing AllReduce and Parameter Server communication patterns. Results provide insights into traffic distribution, resilience, and infrastructure behavior in geo-distributed AI environments, highlighting the potential of reproducible multi-data-center infrastructure frameworks for scalable distributed AI training.

EmbodiSteer: Steering Embodiment-Agnostic Visuomotor Policies with Joint-Space Guidance for Zero-Shot Cross-Embodiment Deployment

Shihefeng Wang, Kangchen Lv, Mingrui Yu, Xiang Li — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12965v1 Announce Type: new Abstract: Scalable robot imitation learning relies on large-scale heterogeneous data from diverse robots or body-free data, making Cartesian end-effector actions a key interface for embodiment-agnostic policy learning. However, end-effector-only abstraction leaves Cartesian policies unaware of the deployed robot body, making them brittle under robot-specific constraints such as whole-body collision avoidance. To overcome this limitation, we present EmbodiSteer, a training-free framework that steers embodiment-agnostic visuomotor policies toward zero-shot, embodiment-aware deployment. EmbodiSteer keeps policy learning in Cartesian space while efficiently lifting inference-time diffusion sampling into the target robot's joint space via forward kinematics and Jacobian-based updates. With whole-body collision-aware guidance over joint trajectories after each denoising step, the arm can be steered away from collisions while preserving learned end-effector behavior. Compared with Cartesian-only execution, EmbodiSteer reduces collision rate by 46.1% and improves task success rate by 28.5% across 9 simulated robots, and further achieves 90.0% collision rate reduction and 36.7% success rate increase on two physical robots in highly constrained scenarios. Our project page is at https://frankwang67.github.io/EmbodiSteer-Page.

Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers

Achyuthan Sivasankar — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12966v1 Announce Type: new Abstract: Grokking -- where a transformer on modular arithmetic suddenly transitions from near-chance to near-perfect validation accuracy -- is attributed to a Fourier circuit, but its timing, causal structure, and controllability remain poorly understood. We introduce the Frequency Synchronization Degree (FSD), a normalised, permutation-tested metric for Fourier circuit synchronisation requiring no prior circuit knowledge. Across nine modular addition configurations (primes p in {53, 71, 97, 113, 131}, three seeds), FSD synchronises 500-3,000 steps before grokking (mean lead +1,722 steps; all nine positive, sign-test p~0.004), and precedes a restricted-logit loss baseline (Nanda et al.'s excluded loss) in all nine cases, making it the earliest available predictor. We provide direct causal evidence that the inter-phase gap is a regularisation phenomenon: forking training at the FSD-ceiling step and varying weight decay lambda produces strictly monotone earlier grokking, with Delta_t proportional to 1/lambda. This law replicates across three primes (p in {53,97,131}; R^2=1.00 and R^2=0.99 for two clean cases), captured as Delta_t ~ C/lambda, consistent with (1/lambda)*log(||W_mem||/tau). Architecture ablations show an attention-only model groks with a strong FSD precursor; an MLP-only model never groks; a single-layer model's FSD lags, confirming the precursor is a multi-block circuit property.

Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation Models

Quan Quan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12969v1 Announce Type: new Abstract: The power distribution network is critical to reliable electricity delivery, yet traditional inspection methods face limitations in semantic understanding, generalization, and closed-loop automation. To address these challenges, this paper proposes a Multi-Modal Agent framework specifically for power distribution defect detection. Central to this study is the systematic evaluation of multimodal foundation models as unified cognitive engines. We rigorously assess their integrated performance across three critical capabilities: (1) Perception, where the model must accurately identify equipment and generate expert-level descriptions of defects; (2) Reasoning, where the model interprets visual findings to diagnose causes, assess severity, and plan maintenance strategies based on domain knowledge; and (3) Tool Usage, where the model acts as an autonomous operator to execute actions -- such as querying knowledge bases or generating work orders -- to achieve closed-loop maintenance. To support this evaluation, a domain-specific evaluation dataset and a comprehensive benchmark are developed. Experimental results demonstrate the strengths and limitations of current foundation models in these three dimensions, providing empirical evidence for deploying autonomous agents in high-stakes industrial environments.

Binary Search Variants: A Comprehensive Analysis

Ali Dasdan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12970v1 Announce Type: new Abstract: Binary search is deceptively simple in concept yet notoriously difficult to implement correctly. This paper presents a unified treatment of binary search: five core variants, six derived query functions, and four standard library implementations (BSD, glibc, Java, C++ STL), each with consistent notation, loop invariants, and analysis. We introduce bsearch_ultimate, a combined search that subsumes all variants in a single call. Every algorithm is provided as synchronized Python code, Dafny formal proof, and pseudocode. All implementations are validated by over 9,500 tests and 21 Dafny formal verifications; an additional six deliberately faulty implementations demonstrate common bug categories and Dafny's ability to detect them. We also provide memorable rules linking boundary choices to loop conditions and update formulas.

Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations

Tahiya Chowdhury — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12971v1 Announce Type: new Abstract: Estimating cognitive load from speech has largely been studied in controlled laboratory settings, with limited understanding of its reliability in natural collaborative conversations. We investigate whether speech and interaction dynamics predict perceived cognitive load during dyadic conversations. We analyze audio from 53 dyads performing nine collaborative tasks and extract static acoustic, dynamic, and interaction features to train a two-head Gated Recurrent Unit encoder to predict cognitive load scores. Results show conversational interaction provides useful signals for predicting cognitive load related to time pressure, mental work, effort, and task performance. Temporal demand is associated with turn-taking dynamics such as overlap and speaker switch, while mental demand is linked to imbalanced participation between speakers. These findings highlight the importance of task structure and conversational interaction for modeling cognitive load in natural collaborative settings.

From Prompts to Preferences: An Open-Source Platform for Generative AI-Enhanced Conjoint Analysis

Philipp Brauner — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12972v1 Announce Type: new Abstract: Conjoint analysis is a widely used preference measurement method in marketing research, political science, healthcare, and human-computer interaction. Despite broad adoption, researchers without access to commercial platforms face significant barriers, as existing tools are either expensive or lack end-to-end survey infrastructure. This paper presents an open-source, self-hosted web application for designing, deploying, and analysing conjoint surveys. Beyond conventional tabular stimuli, the platform uses generative AI to produce integrated stimuli formats: textual scenario descriptions generated by a large language model, and visual stimuli by a text-to-image model. A researcher-defined base prompt is parameterised with the conjoint profile, and optional LLM-facing level annotations enrich the generation. A structured setup wizard, AI-assisted attribute suggestion, and live data analysis lower the technical barriers for researchers new to conjoint methodology. A full export bundle including all stimuli, their generating prompts, and response data facilitates transparency and reproducibility. The platform is demonstrated through a proof-of-concept study on care robot preferences for ambient assisted living (AAL, N=55) using AI-generated visual stimuli. The paper discusses the role of AI assistance in conjoint design, arguing that theoretical grounding must remain the researcher's responsibility, and outlining how genAI-generated stimuli can broaden the methodological repertoire for HCI and related fields.

A Robust Helmholtz-Decomposition-Based Real Compressed Layer Method for Time-Harmonic Elastic Wave Scattering

Li-Lian Wang, Lu Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12974v1 Announce Type: new Abstract: Time-harmonic elastic wave scattering involves both compressional (P-) and shear (S-) waves, which propagate with different wavenumbers and polarization characteristics. The naive construction of perfectly matched layer (PML)-type methods based on complex coordinate stretching may lack robustness, or even fail, particularly when the wavenumbers are highly contrasted. The recently developed real compressed layer (RCL) technique build upon real compression transformations and explicit extraction of resulting oscillatory patterns for time-harmonic Helmholtz problems may not work, since the oscillations cannot be explicitly extracted by a single change of variables. This paper intends to bridge this gap by developing a robust RCL method for two-dimensional time-harmonic elastic wave scattering in unbounded domains with compactly supported inhomogeneities. A key observation is that, through the Helmholtz decomposition, the displacement field in the exterior homogeneous region decoupled into P-wave and S-wave and each has a distinctive separation of its oscillatory pattern and decaying behaviours in polar coordinates. We then apply the real compression coordinate transformation in the radial direction to each component. We further propose a coupled displacement-potential RCL formulation that seamlessly integrates the Helmholtz-decomposed wave components with the interior displacement field. We show that, under this framework, the essential oscillations in the layer can be effectively removed. We prove the well-posedness of the resulting coupled problem and establish the exponential convergence of the RCL solution to the original scattering solution in the truncated domain of interest. We discretize the RCL-system using high-order spectral element method and demonstrate the effectiveness and robustness of the proposed method through ample numerical results.

A Mathematical Forum Platform for Collaborative Problem Solving and Dataset Generation for AI Reasoning

Akbar Erkinov, Nurmukhammad Abdurasulov — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12976v1 Announce Type: new Abstract: Sharing mathematical content in online forums remains a significant friction point for students and educators: writing raw LATEX is error-prone, standalone optical character recognition tools require platform switching, and current forum software offers no integrated path from a photograph of a formula to a rendered post. We present a unified system that eliminates this friction by embedding an image to LATEX conversion pipeline directly inside a forum posting interface. A user uploads or captures an image of a mathematical expression; the system routes it through the Mathpix OCR API, detects whether the returned output is LATEX or plain text containing inline math, applies the appropriate delimiter normalisation, and renders a live preview in either LATEX or Markdown mode before the post is committed to the database. The architecture is organized in three loosely coupled layers: image processing, rendering, and storage, and supports both desktop and mobile clients. A provisional US patent application has been filed covering the core methods. We describe the full system design, each component in detail, the data schema, and the key technical innovations, and we position the work against existing standalone tools and forum platforms to demonstrate the practical gap it closes. Beyond immediate usability, we argue that a deployed platform of this kind constitutes a continuously growing, community-validated dataset of mathematical problems and step-by-step solutions, a resource that can be used to train and benchmark AI systems for accurate mathematical reasoning

Efficient, Robust, and Anti-Collusion Fingerprinting of Image Diffusion Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12977v1 Announce Type: new Abstract: Model fingerprinting, embedding user-specific identifiers (fingerprints) into generated outputs, has recently emerged as a popular solution to protect the intellectual property rights (IPR) of generative text-to-image (T2I) models and prevent unauthorized redistribution. In this work, we reveal a previously unexplored systematic vulnerability in existing generative model fingerprinting methods: they lack robustness against collusion attacks, where multiple attackers combine their models to remove or obscure the fingerprints. To address this issue, we take the first step towards a robust fingerprinting method for T2I models with anti-collusion capabilities. The proposed method encodes strings of bits, namely fingerprints, into the coefficients of a personalized normalization module (PNM) incorporated into T2I models, so that fingerprints can be reliably recovered from any generated image. To defend against collusion attacks and prevent unauthorized model redistribution, we introduce an anti-collusion mechanism based on lossless function-invariant parameter transformations. This mechanism significantly degrades the image generation quality of colluded models, making them effectively unusable. Moreover, our method allows developers to efficiently create multiple copies of fingerprinted T2I models by reparameterizing the PNM without the need for retraining. We also introduce a worst-case optimization strategy to improve robustness against model-level attacks. Our experiments demonstrate that the proposed method achieves high fidelity and robustness across multiple T2I image generation and editing tasks, with fingerprint extraction accuracy exceeding 99.5%. Compared with existing methods, our method demonstrates, for the first time, a notable proactive robustness to collusion attacks by significantly increasing the FID of colluded models.

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12978v1 Announce Type: new Abstract: Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: https://vla-redirection-attack.github.io/

EPM-JEPA: Operator-Side Experience Modulation in JEPA-Family World Models

Vedant Pandya — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12979v1 Announce Type: new Abstract: JEPA-family world models use a static predictor whose weights do not adapt when test-time dynamics diverge from training. We compare two mechanisms for incorporating accumulated experience into a JEPA predictor under distribution shift: operand-side injection, where a compressed experience representation is added as a residual to the predictor's hidden state (EI-JEPA), and operator-side modulation, where the same representation generates low-rank weight deltas via LoRA applied to the predictor's weights (EPM-JEPA). On a pre-registered comparison (Moving MNIST, gravity shift), EPM-JEPA (D_shift^{n=50} = 0.7848 +/- 0.0078, three seeds) differs from EI-JEPA (0.8238) by delta = 4.74% - Outcome C: a null result - by our stated criterion, a valid outcome. As a secondary, non-pre-registered observation, EPM-JEPA improves 1.90% over a no-memory baseline (0.8000), consistently across seeds, while EI-JEPA underperforms the baseline, indicating the benefit is specific to weight-level modulation. Our primary contribution is a mechanism analysis: the D_shift^{n=50} trajectory reflects three independent dynamical processes - buffer cycling, EMA target drift, and an intrinsic LoRA settling transient of +0.021 - rather than convergence to equilibrium. These findings motivate PEM-JEPA, a physics-grounded successor addressing this dynamical-peak limitation.

Camera and LiDAR BEV Fusion for Cooperative 3D Object Detection on TUMTraf V2X

Muhammad Shahbaz, Shaurya Agarwal — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12981v1 Announce Type: new Abstract: We describe a Camera and LiDAR fusion detector developed for the TUMTraf V2X cooperative 3D object detection track of the DriveX 2026 challenge. The detector fuses three roadside cameras with a fused infrastructure-plus-vehicle point cloud in a shared bird's-eye-view space and predicts boxes through a CenterPoint-style head with a generalized IoU regression loss and an IoU quality re-ranking head. Trained on the provided train and validation splits, the model reaches a 3D mAP of 0.85 on the public Codabench test split. While iterating on the system, we observed that 44 of the 50 test frames are also present in the released train (40) and validation (4) splits with their labels. We therefore conducted two additional studies to quantify how this overlap affects the final score: (1) a finetuning run that oversamples the 44 overlapping frames, reaching 0.89 mAP, and (2) a post-processing run that replaces predictions on those frames with the released ground truth, reaching 0.99 mAP (uploaded to our Codabench account for testing but not published on the leaderboard). All three configurations and their per-class results are reported.

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12983v1 Announce Type: new Abstract: Automated testbench generation has become a critical bottleneck in large language model (LLM)-driven Register Transfer Level (RTL) workflows, where large numbers of candidate designs must be verified rapidly and reliably. Existing prompt-based approaches treat testbench generation as unconstrained code synthesis, yielding stochastic outputs with high token cost, low reproducibility, and insufficient coverage. To address this gap, we present STG, a Structured Testbench Generation framework that exploits the inherent structure of hardware designs to generate deterministic testbenches. As a direct verification tool, STG runs 720x faster than an iterative LLM-based testbench generation flow and higher rate of successful compilation, achieves higher coverage, and reduces false-pass verdicts on incorrect DUTs. STG also helps identify errors in RTL generation benchmarks by exposing faulty benchmark testbenches. As a data curation engine, it is 11x faster than LLM-based filtering on a single CPU core with 127x less energy, and the resulting distilled models provide state-of-the-art performance in our multi-benchmark evaluation. As a test-time scaling oracle, it reduces node count by 14-47\%. Our models are available at https://huggingface.co/collections/AS-SiliconMind/siliconmind-v12.

SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants

Yimin Hu, Mengtao Xu, Hao Guo, Yuheng Song, Xiaoyong Zhu, Bo Zheng — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12984v1 Announce Type: new Abstract: Image-based AI assistants are now deployed at production scale on e-commerce platforms, where a single uploaded image can trigger fundamentally different user intents: product search, style recommendation, visual encyclopedia, or utility tool calls, each demanding its own response format, tool invocation, and domain knowledge. Without per-intent behavioral constraints, LLM-based systems conflate these heterogeneous modes and fall short of domain quality standards, while the breadth and dynamism of the intent space render manual engineering infeasible. To address this, we present SkillChain, which closes the production feedback loop on Skill evolution, automating the lifecycle of Skills through three stages: Skill Creator for bootstrapping from task specs and trajectories, Route Optimizer for routing alignment, and Body Refiner for iterative Skill Body refinement via dual-path LLM-Judge evaluation. Deployed on a production-scale e-commerce image assistant, SkillChain substantially improves aggregate response quality, with the strongest gains on structural compliance and content quality; a one-week online A/B experiment further confirms significant gains in user engagement, content consumption, and long-term retention.

Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12985v1 Announce Type: new Abstract: Learning grounded word meaning from natural experience requires resolving two ambiguities in infant-view recordings: when the named referent appears and where it is in a cluttered frame. In SAYCam-style data, caregiver speech is sparse and weakly synchronized with egocentric video, so single-frame contrastive pairing yields noisy positives in which the intended object is absent or entangled with distractors. We propose BabyMind, an object-first bias for child-view contrastive learning under sparse, noisy supervision. BabyMind extracts candidate object embeddings using an offline mask-based region interface, links candidates across a short utterance-centered window into lightweight object files via tracking, and aligns utterances to bags of object files with a prototype-space multiple-instance contrastive objective. Track-coherence and global-object agreement regularizers stabilize learning and transfer object-file structure into the global frame embedding used at evaluation. On SAYCam-S, BabyMind improves Labeled-S 15 forced-choice accuracy by +2.6 points over CVCL and yields consistent gains on in-vocabulary out-of-distribution benchmarks. Code is available at https://github.com/sathiiii/BabyMind.

The Rise of AI-Native Software Engineering: Implications for Practice, Education, and the Future Workforce

Mamdouh Alenezi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12986v1 Announce Type: new Abstract: Generative Artificial Intelligence (GenAI), Large Language Models (LLMs), and emerging Agentic AI constitute the most disruptive transformation in the history of software engineering (SE), reshaping development processes, required competencies, professional roles, and the educational outcomes that universities must deliver. This paper presents a systematic review of 48 verified, influential peer-reviewed publications (2016--2026) drawn from leading venues in software engineering, machine learning, computing education, human--AI collaboration, and software productivity. Studies were discovered, screened, and analyzed through a four-agent research workflow (Literature Discovery, Scientometric Analysis, Curriculum Transformation, and Workforce Impact) and were verified against primary sources. We synthesize the evidence along nine themes and three trajectories -- practice, education, and workforce -- and report a scientometric inflection in which annual LLM-for-SE output grew roughly five-fold after late 2022. From this synthesis we contribute: (i) a conceptual framework for AI-native software engineering organized around \emph{intent}, \emph{collaboration}, and \emph{verification}; (ii) a nine-dimension competency model spanning specification, critical evaluation, agent orchestration, and metacognition; (iii) a four-phase university curriculum roadmap with AI-resilient assessment; (iv) faculty-development and workforce-transformation strategies; and (v) a prioritized agenda of eleven research gaps. The evidence base is internally contradictory on the magnitude and direction of productivity effects, underscoring that benefits are strongly context-dependent and that educating engineers for judgment, verification, and orchestration -- rather than code production alone -- is the central challenge of the AI-native era.

Diffusion Transformer World-Action Model for AV Scene Prediction

Ruslan Sharifullin, Benjamin Jiang, Kai Xi Chew — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12987v1 Announce Type: new Abstract: Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction. We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to $256 \times 256$ frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V-JEPA2 with temporal context reduces steering RMSE by 40% over the best single-frame encoder. We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the $x_0$ objective, residual anchoring, and sampling matched to target uncertainty. In a Stable-Diffusion-VAE encode-predict-decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution. Inception-based FID and KID reveal a clean perception-distortion frontier: diffusion attains KID 0.078 versus 0.375 for regression ($4.8\times$ better), and a deployable train-derived calibration makes this practical without test-time ground truth. The model is genuinely action-controllable (steering drives scene displacement, Spearman $\rho = 0.81$, vs $-0.18$ for regression). We trace limited single-pass motion to a shared-present anchor and engineer a compact 1.7M-parameter "jump" model that recovers full ground-truth motion magnitude ($1.02\times$ GT), where single-pass models capture less than half.

A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis

Manex Atxa, Bruno Simoes, Julen Balzategui — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12988v1 Announce Type: new Abstract: This paper introduces a new methodology for real-time prediction of ergonomic and non-ergonomic human poses using volumetric video data in three dimensions. Although the methodology was designed for ergonomic assessments, it can be adapted to other applications requiring real-time analysis of human posture. One aspect that makes this system stand out is its ability to analyze 3D point clouds during the assessment, enabling computation from multiple angles. This overcomes a critical limitation of cameras which provide often a fixed viewpoint, thereby restricting the data available for a thorough postural evaluation, especially when occlusions occur. The system continuously and automatically performs pose inference using the chosen perspective on the real-time streaming data; however, only the poses manually selected and labeled by the user are used to train the personalized deep learning classifier. The methodology has been refined through a case study in which RGB-D cameras captured subjects performing load-lifting tasks, enabling real-time skeletal labeling. The model was trained on this data and, following the training phase, performs inference on new streaming data in real time. This research offers a scalable and pragmatic approach for real-time ergonomic evaluation by combining state-of-the-art 3D data technologies and traditional 2D pose estimation algorithms. It addresses the increasing need for safety and health monitoring in workplace environments, marking a notable contribution to the domain.

Proportional power dispatch and fairness in wind farm power tracking

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12989v1 Announce Type: new Abstract: Controlling the power output of a wind farm in order to track a target signal can be useful for the power grid frequency regulation. It can be achieved by dividing the target into individual setpoints, then followed by each turbines' controller. In this article, we are interested in finding power allocations that fairly spread the power reserves (i.e. unused fraction of available powers) among turbines, helping with robustness to uncertainties and changing wind conditions. In particular, we study the fairness properties of proportional dispatch, which is the most common power dispatching method. We show that due to the wake effects in a wind farm, proportional dispatch has to be applied iteratively to achieve fair distribution of power reserves. We study the convergence of this iterative process (referred to as IPD) to equalized reserves, and then illustrate it on simulated experiments, using steady-state and dynamic simulators. Numerical results show that IPD closely approaches max-min fairness, a related fairness objective, for a cheap computational price compared to black-box optimization. Finally, IPD is also shown to reduce the complexity of the problem of fair power dispatch combined with yaw wake steering optimization.

Exposure Bias as Epistemic Underidentification in Recursive Forecasting

Riku Green, Zahraa S. Abdallah, Telmo M Silva Filho — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12990v1 Announce Type: new Abstract: Recursive multi-step forecasting is usually framed as distribution shift: models are trained on observed histories but deployed on their own predictions. We show this framing is incomplete by proving that, under partial observability or state truncation, recursive rollout is also an epistemic underidentification problem. Even with deterministic latent dynamics, one-step Bayes supervision identifies behavior only on observed contexts and need not identify the deployed recursive predictor once rollout queries self-generated induced states whose correct local targets are not determined by numeric state alone. We formalize this with induced states $Z$ and provenance variables $P$, and derive a decomposition of induced-state error into teacher-forcing/rollout mismatch, representation--class approximation, and provenance information gaps. Empirically, we show that rollout enters a distinct induced-state regime, that fixed induced states define a distinct local corrective task, and that closed-loop gains arise not only from local adaptation but also from changing the induced states visited during rollout. Using a simple binary provenance encoding, provenance-aware correction can further improve performance, though gains are conditional rather than uniform. These results recast exposure bias as reasoning under self-induced epistemic uncertainty.

APCyc: Property-Informed Design of Cyclic Peptides via Automated Cyclization

Yifan Zhao, Lang Qin, Jintai Chen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12991v1 Announce Type: new Abstract: Cyclic peptides represent a promising class of therapeutic compounds in modern drug discovery, often offering improved stability and binding affinity. However, the de novo design of cyclic peptides remains challenging because methods must identify pocket-adaptive cyclization patterns and linkage sites while simultaneously controlling drug-relevant properties. This challenge is particularly pronounced for recent generative models trained predominantly on linear peptide data, which may fail to capture cyclization-specific constraints. To address the limitation, we introduce APCyc, a target-aware de novo cyclic peptide generation framework that explicitly models cyclization and jointly optimizes multiple essential physicochemical properties. By using an expanded residue vocabulary and explicitly encoding cyclization-site and linkage-type information, APCyc learns cyclization-aware representations and leverages Bayesian posterior guidance to steer sampling toward cyclic peptides satisfying multiple property objectives. Experimental results demonstrate that our model learns target-dependent cyclization preferences, and enables effective and controllable multi-property optimization for cyclic peptide design. The source code of this paper is available at https://github.com/HKUSTGZ-ML4Health-Lab/APCyc.

Charge as a Construct-Validity Factor in Chinese Legal Case Retrieval: A Cross-Benchmark Audit

Yao Liu, Tien-Ping Tan, Zhilan Liu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12993v1 Announce Type: new Abstract: Chinese Legal Case Retrieval (LCR) benchmarks grade a reference judgment relevant when its legal characterization matches the query, and strong systems now reach NDCG@10 of 0.85-0.88. Most of the BM25-to-best-trained gap is recoverable with no retrieval model: ranking candidates only by shared primary charge, broken by BM25, closes 99.2% of it on LeCaRDv2 -- with no detectable difference from the best-trained system. This reflects benchmark design: LeCaRDv2 defines top relevance via the crime's key constitutive elements, which encode the charge, so same-charge cases are relevant by construction (relevance lift 4.49; charge-to-relevance macro-AUC 0.871). Holding charge fixed, the trained reranker's advantage over BM25 collapses to a small within-charge residual (+0.026 NDCG@10, cluster-bootstrap CI excluding zero, about a quarter), the only non-definitional positive. The effect is not uniform: the same rule recovers 84.3% on LeCaRDv1 and is out of spec on CAIL2022, with the charge-to-relevance signal weakening in step (macro-AUC 0.871/0.759/0.728); a predicted-charge cascade reproduces 76.6% on LeCaRDv2 but does not transfer. The construct is also cashable at first stage: an exploratory zero-training charge-pool channel lifts LeCaRDv2 recall (R@100 +0.025, wrong-charge controls hurt), reported as a positive control for the confound, not a retrieval method or novelty claim. Charge is thus a high-leverage construct-validity factor at the benchmark level -- not auniform explanation of NDCG@10, and not evidence that any system relies on charge. We package established construct-validity and partial-input checks as a reusable charge-controlled protocol (CCE); on all three benchmarks its triggers come back null or descriptive, behaving as designed. We release the scripts, schema, and protocol so future benchmarks can be screened before their NDCG@10 is read as legal-reasoning ability.

DeepJEB++: Foundation Model-Driven Large-Scale 3D Engineering Dataset via 2D Latent Space Augmentation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12994v1 Announce Type: new Abstract: Data-driven engineering design is constrained by the lack of large-scale 3D datasets that pair geometry with physics-based performance labels. In particular, existing 3D data augmentation techniques have limitations in preserving subtle and diverse geometric variations, and it remains difficult to automate the subsequent simulation-labeling process, where boundary conditions vary depending on the generated geometry. We present DeepJEB++, a foundation-model-driven data-augmentation framework that expands a small seed set of jet engine brackets into a large, simulation-labeled 3D dataset under constrained resources. Our key idea is to augment in the data-rich 2D latent space, then transfer to 3D. In Stage 1, we fine-tune a pretrained 2D latent diffusion model on multi-view renders and synthesize novel views by latent interpolation, retaining manufacturable designs through a vision-language-model (VLM) quality filter. In Stage 2, the validated images are lifted to 3D meshes by a domain-adapted generative foundation model. In Stage 3, an automated pipeline recognizes the load and bolt interfaces on each mesh and assigns finite-element labels -- mass, stress, and displacement -- without manual intervention. We assess augmentation quality along three intrinsic axes: manufacturability, label fidelity against the SimJEB ground truth, and distributional consistency. Starting from fewer than 400 seed designs, DeepJEB++ yields 15,360 simulation-labeled 3D brackets -- a 40x expansion -- using a single GPU per stage. The dataset will be made publicly available to support reproducible engineering-AI research.

GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12995v1 Announce Type: new Abstract: Humanoid-Object Interaction (HOI) is a fundamental capability for humanoid robots, yet it remains challenging due to the tight coupling between dynamic balance and stable interaction with diverse objects. Existing methods often require time-consuming task-specific policy training or rely on rigid trajectory replay, which limits their ability to accommodate novel interaction scenarios. In this work, we present \textit{GenHOI}, a simple yet effective framework that enables humanoid robots to perform diverse object-interaction tasks in a zero-shot manner by directly imitating a single generated video, without task-specific training or physical demonstration data. GenHOI first reconstructs the robot-object scene in simulation and renders a first-frame image, which, together with the language command, conditions the synthesis of a task-oriented interaction video. The generated video is then analyzed to identify interaction-relevant contact events and estimate hand-object contact regions, which are encoded as object-centric geometric constraints that convert visual interaction cues into physically grounded optimization priors. Guided by these priors, the reference motion recovered from the video is refined and smoothed to resolve the scale ambiguity inherent in 2D video generation, while adapting a single reference trajectory to unseen robot-object relative poses. The optimized trajectory is finally executed by a closed-loop tracking controller. We validate the proposed framework in extensive simulation and real-world experiments across diverse object-interaction tasks, including box grasping, asymmetric bimanual chair carrying, table lifting from below, and cylindrical-object enveloping.

Reliability of Probabilistic Emulation of Physical Systems

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12997v1 Announce Type: new Abstract: Two dominant approaches have emerged for generating probabilistic forecasts of physical systems: generative models, such as diffusion or flow matching; and ensembles of deterministic models with stochasticity injected, trained using the continuous ranked probability score (CRPS) loss. While both approaches have demonstrated strong predictive accuracy, the reliability of their uncertainties has not been systematically assessed. We address this gap by developing a framework to evaluate both approaches across diverse 2D spatiotemporal physical systems, under matched model size and computational budget. We assess the reliability of probabilistic emulation by inspecting the empirical coverage of predictive intervals, while also considering accuracy and computational efficiency metrics. CRPS-trained ensembles typically achieve more reliable uncertainties on both single-step prediction and autoregressive rollouts, demonstrating better coverage than the standard alternative of training generative models in a latent space. Moreover, the CRPS approach offers significantly faster inference. When generative models are trained in ambient rather than a compressed latent space, which is often infeasible for high-dimensional problems, they exhibit comparable coverage to CRPS-trained ensembles, though with substantially larger inference latency. In contrast, when CRPS-trained ensembles are trained in latent space they do not show a marked degradation in coverage with respect to ambient space. Both generative models and CRPS-trained ensembles demonstrate good predictive accuracy. To facilitate future research and application, we release AutoCast, a modular framework implementing both generative models and CRPS-trained ensembles, alongside AutoSim, a flexible dataset generation package for rapid prototyping.

SoK: The Constant Time Model

Billy Bob Brumley — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13000v1 Announce Type: new Abstract: Constant time programming patterns is the primary defense against timing attacks on cryptographic implementations, yet what "constant time" means varies across academia and industry. This work systematizes constant time models and their evolution, identifies a recurring gap between what models protect and what specifications assume, and distills an offensive methodology for discovering timing vulnerabilities that originate outside the cryptographic primitive boundary. Applying this methodology, we locate a specification-level vulnerability related to private key loading, and confirm the leak in both OpenSSL and BoringSSL. Counterintuitively, BoringSSL's per-observation signal is several orders of magnitude stronger than OpenSSL's, despite an explicitly stricter threat model.

CFALR: Collaborative Filtering-Augmented Large Language Model for Personalized Fashion Outfit Recommendation

Yujuan Ding, Junrong Liao, Yunshan Ma, Yi Bin, Wenqi Fan, Tat-Seng Chua, Qing Li — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13001v1 Announce Type: new Abstract: Personalized outfit recommendation poses a significant challenge in e-commerce and social media platforms, requiring systems that balance user preferences with aesthetic compatibility. Collaborative filtering (CF) provides a traditional solution for this, but it struggles with data-sparse scenarios and complex user-item-outfit relationships. Meanwhile, existing template-based approaches are constrained by rigid pre-designed structures. To bridge these research gaps, we introduce CFALR (Collaborative Filtering-Augmented Large Language Model for Recommendation), a novel framework that synergizes collaborative filtering with large language models for personalized outfit recommendation. Specifically, CFALR describes user-outfit interactions in natural language and leverages LLMs to capture fashion semantics while employing CF-enhanced embeddings to bridge the semantic space and the collaborative interaction spaces. Our technical contributions include: (1) the first LLM-based architecture specifically designed for personalized outfit recommendation, (2) a CF-augmented generative mechanism that efficiently navigates the extensive combination space of outfit items, and (3) trainable projection layers that optimally integrate relational and content features. Experiments on Polyvore and IQON benchmarks demonstrate CFALR's superior performance over both traditional CF-based and LLM-based methods in personalized fill-in-the-blank and personalized outfit generation tasks.

The Illusion of Multi-Agent Advantage

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13003v1 Announce Type: new Abstract: Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13006v1 Announce Type: new Abstract: Large language model (LLM)-based text-to-speech (TTS) systems enable prompt-conditioned emotional control but struggle with fine-grained emotion intensity due to the semantic -- acoustic gap between text and speech. To address this challenge, we formulate emotion intensity control in LLM-based TTS as a learning-to-rank problem and propose Emo-LiPO, a listwise preference optimization framework that aligns prompt-conditioned speech generation with relative emotion intensity expressed in text. Emo-LiPO explicitly models global intensity ordering within each emotion under fixed transcripts, enabling more faithful and continuous emotional expression. We further construct ESD-plus, a multi-speaker dataset with explicit emotion intensity variations, to support fine-grained emotion modeling and evaluation. Experiments on ESD-plus demonstrate that Emo-LiPO significantly improves emotion accuracy and intensity controllability over both supervised- and DPO-based LLM TTS baselines, with particularly pronounced gains at high intensity levels.

scLLM-DSC: LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering for Single-Cell RNA Sequencing

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13007v1 Announce Type: new Abstract: Clustering is fundamental to scRNA-seq analysis, serving as a cornerstone for identifying cell populations and resolving tissue heterogeneity. However, existing methods focus on mining numerical statistical patterns, suffering from semantic agnosticism by neglecting the intrinsic biological functions encoded by genes. While Large Language Models (LLMs) offer promising semantic capabilities, their direct adaptation to cell clustering is hindered by the structural mismatch between generative pre-training objectives and discriminative downstream tasks. To bridge this gap, we propose scLLM-DSC, a novel LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering framework. Diverging from data-driven paradigms, scLLM-DSC establishes a semantically-grounded representation by synergizing two views: a Knowledge-Driven Semantic View derived from NCBI gene priors and contextualized Cell2Sentence embeddings, and a Structure-Aware Topological View extracted via a graph-guided encoder. Crucially, we introduce a cross-modal contrastive alignment mechanism to enforce consistency between biological semantics and transcriptomic features within a unified latent space. Extensive benchmarks demonstrate that scLLM-DSC significantly outperforms eleven state-of-the-art baselines in clustering accuracy.

Bounds and Constructions of Maximum Toroidal Distance Codes

Pengjie Zhong, Jinquan Luo, Yufeng Song — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13008v1 Announce Type: new Abstract: In lattice-based cryptographic schemes, both encoded messages and accumulated decryption noise are represented in a modulo $q$ space. Therefore, it is natural to study toroidal distances and maximum toroidal distance (MTD) codes. In this paper, we derive some upper bounds for minimum toroidal distance of a code, including a Plotkin-type bound, a local ball--Plotkin bound, and a Delsarte linear programming bound. We also exhibit examples showing that these bounds are sharp in some cases. Moreover, we present several code constructions with good minimum distance, some of which are MTD codes. For $\ell=2$, we obtain a family of four-point MTD codes in $\mathbb Z_q^2$. For $\ell=4$, we propose a general code construction and exhibit several explicit instances for specific values of $q$, some of which are proven to be MTD codes. For $\ell=8$, using the $E_8$ lattice, we construct codes $C=2mE_8\cap \mathbb Z_q^8$, where $q=4m$ and show that they are MTD codes. These results give explicit optimal constructions of MTD codes for $\ell=2,4,8$. In the case $\ell=16$, we construct a code with minimum toroidal distance $3$ for $q=4$, while the known upper bound in this case is $2\sqrt{3}$. Our main tools are geometric and linear programming methods.

Otters++: A Time-to-first-spike Based Energy Efficient Optical Spiking Transformer

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13016v1 Announce Type: new Abstract: Spiking neural networks (SNNs) are promising for energy-efficient inference, and time-to-first-spike (TTFS) coding is especially attractive because each neuron fires at most once. In practice, however, this benefit is often reduced by the cost of computing a temporal decay term and multiplying it by the synaptic weight. We address this issue by turning a physical hardware "bug," the natural signal decay in optoelectronic devices, into the main computation of TTFS, named Otters++. Specifically, we use the measured decay of a custom In$_2$O$_3$ optoelectronic synapse to directly realize the TTFS temporal term, removing the need for explicit digital decay computation. To scale this idea to Transformer models, we establish a layer-wise functional equivalence between the Otters++ and a quantized neural network (QNN), and develop a hybrid training method that uses device-faithful SNN computation in the forward pass and QNN straight-through gradients through the equivalent QNN path in the backward pass, together with model distillation. This avoids differentiation through discrete first-spike events and reduces the over-sparsity problem in direct TTFS-SNN training. We further make training aware of measured device noise by sampling run-to-run variation, and refine the system-level energy model by accounting for device sharing and multi-hop communication. On GLUE dataset, Otters++ improves the average score to 84.17\% while maintaining a clear energy advantage over prior spiking Transformer baselines. These results show that physically grounded TTFS computing can be efficient, trainable, and robust under realistic hardware effects.

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

Pierre Beckmann, Marco Valentino, Andre Freitas — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13020v1 Announce Type: new Abstract: Three paradigmatic forms of inference recur across scientific reasoning: deduction, induction, and causal abduction. Reliably evaluating LLMs on these in scientific settings is currently out of reach: scientific benchmarks built on human annotations are costly and lack mechanistic ground truth, while synthetic logical-reasoning benchmarks do not resemble real scientific documents. We introduce SciR, a benchmark that combines multi-paradigm reasoning with controllable scientific rendering, anchored on three paradigmatic scientific problems. Tasks are generated from formal objects (deduction tree, inductive rule hypothesis, causal graph) to guarantee verifiable answers, then rendered into multi-document scientific discourse via per-track domain-tuned genres. The construction lets us independently vary two difficulty axes: how hard it is to extract the key information needed for inference, and how hard the principled inference itself is. We test six models. Both axes hurt every model, and their effects compound. The rendering even hurts neurosymbolic pipelines, which hand inference to a verified solver. The two axes yield a per-model extraction-vs-inference profile: for instance, reasoning models like deepseek-r1 mostly surpass non-reasoning instruct models on the inference axis. To our knowledge, SciR is the first multi-paradigm scientific-reasoning benchmark with parametric control on both extraction and inference difficulty.

Quality-Preserving Imperceptible Adversarial Attack on Skeleton-based Human Action Recognition

Ziyi Chang, Kanglei Zhou, Xiaohui Liang, Hubert P. H. Shum — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13022v1 Announce Type: new Abstract: Adversarial attacks on skeletal human action recognition have received significant attention. However, existing methods typically introduce noise-like perturbations that degrade motion quality post-attack, and thereby are inherently perceptible with recent advancements in S-HAR systems. We discover that this degradation stems from the gap between empirical and true risks during the optimization process of previous adversarial attacks. To address this issue, we propose an attack where adversarial motions are obtained without compromising their motion quality. To minimize the risk gap and preserve motion quality, we propose a distribution-based adversarial attack method without introducing noise-like perturbations. To faithfully evaluate the motion quality, we propose a new metric that aligns with human perception on real-world naturalness. Experiments have been conducted on the state-of-the-art S-HAR methods across two datasets, demonstrating the superiority of our method in both the attack success rate and the post-attack motion quality through qualitative and quantitative analyses. The success of our quality-preserving attack application and distribution-based method raises serious concerns about the robustness of action recognizers, highlighting the need for further enhancements in this domain.

Technical Supplement Report on Full-Duplex FBMC/QAM MIMO Systems: Transceiver Design and Optimization

Sudhakar Rai, Prem Singh, Ekant Sharma, Aditya K. Jagannatham, Lajos Hanzo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13023v1 Announce Type: new Abstract: This technical report presents the design and analysis of filter bank multicarrier (FBMC)/QAM multi-user MISO systems. We describe the complete uplink and downlink signal processing chains and characterize the end-to-end effective channel, including inter-carrier interference, inter-symbol interference, intrinsic interference, and residual self-interference. We compare FBMC/QAM with CP-OFDM and FBMC/OQAM through the lens of the Balian-Low theorem, and analyze prototype filter choices (PHYDYAS, Type-I, and Type-II), including the interference power breakdown under MRT and ZF precoding. Furthermore, we present an online stochastic successive convex approximation framework for ergodic sum-rate maximization with closed-form power updates, and contrast it with offline Monte Carlo-based approaches. Simulation results demonstrate the BER and network spectral efficiency advantages of FBMC/QAM over CP-OFDM under residual carrier frequency offset.

CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13024v1 Announce Type: new Abstract: Granger Causal Discovery (GCD) is fundamental for analyzing temporal dependencies in complex systems. However, existing neural GCD methods predominantly rely on a "one-size-fits-all" paradigm, struggling to capture distribution shifts and dynamic regime changes inherent in real-world time series. This often leads to entangled representations and spurious causal graphs. In this paper, we propose CausalMoE, a billion-scale multimodal Granger causal foundation model that explicitly models patch-level heterogeneity. CausalMoE introduces a Pattern-Routed Mixture of Heterogeneous Experts, which dynamically identifies latent temporal patterns and routes patches to specialized domain experts, effectively decoupling regime-specific mechanisms from shared dynamics. To ensure interpretable graph recovery, we design a Causality-Aware Self-Attention mechanism operating across variables, yielding sparse Granger causal graphs via proximal optimization. Furthermore, CausalMoE is the first to integrate LLMs and VLMs to align numerical signals with textual and visual priors, regularizing causal estimation in complex scenarios. Extensive experiments demonstrate that CausalMoE establishes a new state-of-the-art on fully supervised benchmarks, while effectively generalizing to few-shot settings where traditional methods fail.

Democracy in the Era of Artificial Intelligence

Evangelos Pournaras, Srijoni Majumdar, Carina Hausladen, Dirk Helbing — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13026v1 Announce Type: new Abstract: Interfacing Artificial Intelligence (AI) with democracy is one of the most profound challenges of our times. On the one hand, AI comes with opportunities to overcome long-standing challenges in democracy, such as low participation in deliberative and voting processes with poor representation of people. On the other hand, new risks arise from AI algorithms that are privacy-intrusive, biased, manipulative, spread misinformation and influence election results. Moving beyond the over-simplistic question of whether AI is good or bad for democracy, the Handbook on Democracy in the Era of Artificial Intelligence asks instead: how to upgrade democracies and the principles they are built on, using AI? How to engage with AI and on what terms? Which new values and design principles are required to build democratic resilience? In 34 chapters by 59 authors across the world from different disciplines, we explore how AI can empower collective intelligence for democracy (Part 1) and what is the future of deliberative democracy using large language models and social media (Part 2). We also illustrate the role of AI for building resilient self-governance systems (Part 3) and the challenges of transforming democracy in the age of AI (Part 4). We conclude with broader perspectives (Part 5) that re-imagine the interplay of democracy and AI.

Comparing Commercial Depth Sensor Accuracy for Medical Applications

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13028v1 Announce Type: new Abstract: Depth estimation has numerous medical and surgical applications. We benchmark four depth sensors on a porcine bone specimen, a porcine belly specimen, and a silicone kidney phantom using stylus-sampled references. These objects contain several real-world challenges, including homogeneous surfaces, specular surfaces, and subsurface scattering. The comparison includes stereo, structured-light, and time-of-flight sensors at a distance of approximately 50 cm. Specifically, the Intel RealSense D405 (Intel RealSense, United States), PMD Flexx2 (pmdtechnologies, Germany), Stereolabs ZED 2i (Stereolabs, France), and Zivid 2M+ 60 (Zivid, Norway) are compared. The Zivid 2M+ 60 performed best across all objects and metrics considered in this work. The ZED ranked second for real tissue, but last on the phantom.

A Multi-Modal Framework with Cross-Subject Pseudo-Labeling and Semantic Alignment for Micro-Gesture Recognition

Haoran Zhang, Haokun Zhang, Pengyu Liu, Yujia Zhang, Weibao Xue, Yanbin Hao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13030v1 Announce Type: new Abstract: Micro-gestures (MGs) are spontaneous and subtle body movements that frequently convey hidden human emotions. Recognizing MGs in untrimmed videos remains highly challenging due to their extremely low signal-to-noise ratio, severe long-tailed class distribution, and the inherent domain shift encountered in cross-subject evaluation scenarios. In this paper, we propose a comprehensive multi-modal framework for Track 1 of the 4th MiGA-IJCAI Challenge. To capture fine-grained representations, we design a saliency-guided multi-modal extraction pipeline integrating 68-keypoint skeleton joint coordinates, 3D heatmap volumes, and high-resolution RGB visual features. We introduce a gentle square-root smoothed weighting mechanism paired with an Orthogonal Semantic Embedding Loss to protect tail classes without compromising overall recognition capabilities. More importantly, to bridge the cross-subject generalization gap, we propose a Cross-Modal Pseudo-Labeling (CMPL) strategy for unsupervised domain adaptation, which significantly boosts single-modal robustness. A temperature-scaled soft-voting mechanism is finally utilized to alleviate overconfidence during late fusion. Extensive experiments demonstrate that our framework achieves a competitive F1-score of 68.13\%, securing the 4th place.

GeoCFNet: Geometry-Aware Confidence Field Network for Robot-Assisted Endoscopic Submucosal Dissection

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13032v1 Announce Type: new Abstract: Advanced surgical robotics has made robot-assisted endoscopic submucosal dissection (ESD) a promising approach for the en-bloc resection of large lesions, with the potential to reduce recurrence and improve long-term outcomes. However, the technical complexity and risk of complications in ESD demand stable and precise visual guidance to maintain an accurate dissection corridor and a safe tissue margin. Dense confidence fields provide an effective representation for this purpose by describing both the preferred dissection region and its spatial transition to surrounding tissue. However, reliable confidence field estimation remains challenging in dynamic endoscopic scenes due to smoke, specular highlights, tissue deformation, weak texture, and the thin geometric structure of the target region. To address these challenges, we formulate dissection guidance as a geometry-aware confidence field estimation problem and propose GeoCFNet, a geometry-aware confidence field network built on a pretrained DINOv3 backbone. GeoCFNet integrates a Token-Differentiated Fusion module to aggregate class-token context with dense patch representations, a SegFormer decoder for confidence regression, and Geometry-Aware Spatial Regularization (GASR) to preserve spatial coherence and local geometric transitions. Experimental results show that GeoCFNet achieves RMSE 0.0480, PSNR 27.1995, SSIM 0.3397, and CC 0.2466, indicating accurate and geometrically stable confidence field estimation for robot-assisted ESD guidance.

SAM-Deep-EIoU: Selective Mask Propagation for Multi-Object Tracking

Alexander Holmberg — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13033v1 Announce Type: new Abstract: Multi-object tracking has a heavy-tailed difficulty distribution: most frames are easy for a lightweight base tracker, while a small fraction are intrinsically hard. Video object segmentation (VOS) models can often preserve identity through the hard frames where the base tracker fails, but they are much more expensive in compute and memory. We propose selective mask propagation, a tracking algorithm that dispatches from a base tracker to a VOS model only on windows where an assignment-uncertainty signal fires. The base tracker's output is modified only when the VOS model makes a confident prediction that contradicts the base tracker's identity assignment; weak or inconclusive predictions preserve the base output. The method is training-free, treats both the base tracker and the VOS model as black boxes, and can benefit from replacing the VOS component with a more capable model. On DanceTrack, selective mask propagation improves three different base trackers. On SportsMOT, where identity preservation is central to sports analytics, SAM3-Deep-EIoU with global track association achieves state-of-the-art performance on the benchmark with 86.8 HOTA.

TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13035v1 Announce Type: new Abstract: Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newly generated frames on previously generated content. However, extending these models to minute-level generation remains challenging: the limited KV-cache budget prevents the model from retaining the full history, while repeatedly conditioning on self-generated frames induces a context distribution shift that accumulates over time, leading to visual artifacts, quality degradation, and temporal drift. In this paper, we propose TetherCache, a training-free and plug-and-play cache management strategy for drift-resistant long video generation. TetherCache organizes the cache into sink, memory, and recent regions, and introduces two complementary mechanisms. First, GRAB (Gated Recall with Attention-Diversity Balancing) selects long-range memory frames using a gated score that combines attention-based relevance with temporal diversity, preserving informative yet diverse historical context under a fixed cache budget. Second, TAME (Trusted Alignment via Memory Editing) lightly edits newly recalled memory tokens by aligning their statistics to a trusted context distribution, reducing the pollution caused by drifted historical features. Built on Self-Forcing, TetherCache consistently improves long-video generation quality on VBench-Long across 30s, 60s, and 240s settings. In particular, for 240s generation, it substantially improves overall and semantic scores while reducing quality drift from 7.84 to 1.33, demonstrating its effectiveness for stable long-horizon autoregressive video diffusion.

Active Sensing-assisted UAV Communications with Jittering: Framework and Performance Analysis

Guangji Chen, Long Shi, Qingqing Wu, Qiaoyan Peng, Caihong Kai — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13036v1 Announce Type: new Abstract: Providing reliable communication for unmanned aerial vehicles (UAVs) via existing cellular networks is crucial for enabling the rapid growth of the low-altitude economy. However, UAV jittering significantly degrades communication quality due to induced beam misalignment. Inspired by recent advances in integrated sensing and communication, we propose a novel two-stage active sensing-assisted communication framework tailored for ground-to-UAV links with jittering. Specifically, two schemes are conceived to leverage sensing for enhancing communication performance, namely the communication-oriented scheme and the sensing-oriented scheme. For the sensing-oriented scheme, deterministic signals are employed in the first stage to facilitate angle-of-arrival (AoA) acquisition at the UAV side, followed by pure communication service in the second stage by using the estimated AoA. In contrast, the communication-oriented scheme employs Gaussian information-bearing signals throughout both stages, with AoA estimation relying on Gaussian random signals. For both schemes, we provide maximum likelihood estimators for AoA, along with analytical results characterizing the Cram\'er-Rao bound. To capture the performance limit, closed-form expressions for the achievable rates of the two schemes are derived, unveiling a fundamental tradeoff between sensing and communication quality across the two stages by tuning the time allocated to the first stage. The optimal time allocation that maximizes the overall rate is obtained in semi-closed-form. Based on these results, we unveil a sufficient condition under which the communication-oriented scheme outperforms the sensing-oriented scheme, which admits an interesting threshold-based structure. Asymptotic analysis demonstrates that the performance loss of the proposed schemes relative to the jitter-free upper bound approaches zero in the high transmit power regime.

DIG: Oracle-Guided Directed Input Generation for One-Day Vulnerabilities

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13037v1 Announce Type: new Abstract: One-day vulnerabilities pose significant risks due to delayed or incomplete patch adoption. Generating proof-of-concept (PoC) inputs is therefore essential for assessing real-world impact. The key challenge is identifying necessary constraints for triggering the vulnerability and solving them effectively. Existing directed fuzzing approaches prioritize inputs toward target locations, but neither explicitly identify necessary constraints nor solve them effectively, relying instead on target-distance feedback and random mutation. Agentic approaches show strong potential through code reasoning and structured input generation, but goal drift in long-horizon reasoning limits their effectiveness. DIG addresses this challenge by exploiting a key property of one-day vulnerabilities: patches often reveal necessary preconditions for triggering. DIG uses an LLM to analyze the patch and synthesize an oracle making these conditions explicit. The oracle supports effective PoC generation at two levels. At the high level, DIG performs oracle-guided generator evolution, where an agent infers and solves constraints to satisfy the oracle. At the low level, DIG instruments the oracle into the target program and uses branch-distance feedback to guide random mutation in directed fuzzing. Evaluation shows DIG outperforms 2 state-of-the-art agents and 10 fuzzers across 138 real-world CVEs. DIG triggers 80 vulnerabilities, surpassing prior results and outperforming the best baseline by 40% (57 vs. 80 CVEs). Notably, DIG exclusively triggers 9 vulnerabilities no existing technique can trigger. Compared to the average of other tools, DIG triggers vulnerabilities faster in 92.9% of cases, achieving over 100x speedup in 48.8% of cases, with a maximum speedup of 3,664x. Beyond one-day PoC generation, DIG uncovers 6 previously unknown vulnerabilities in widely deployed libraries, enabling zero-day discovery.

Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior

Haowei Qian — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13038v1 Announce Type: new Abstract: As LLM agents proliferate in prediction markets and collective decision-making, they risk a cognitive monoculture: agents built on shared foundation models produce correlated forecasts, and recent measurement finds frontier-model errors correlated at r ~ 0.77. We ask whether human cognitive diversity can be recovered from behavior and transferred to LLM agents. Nous extracts a structured eight-dimension behavioral profile from real Polymarket trading activity and injects it into agents through prompts. Our central finding is a dissociation between the two halves of that pipeline. Extraction works, partially: across 100 wallets, 8 of 14 parameters are temporally stable (split-half ICC >= 0.5, bootstrap CI lower bound > 0.3; contrarian score reaches ICC ~ 0.9); wallets are identifiable from their profiles well above chance (top-1 retrieval 17-22% vs. 1% chance); and two of four pre-specified dimensions rank-correlate with future realized profit out-of-sample, though the correlations do not survive behavioral-confound controls. Prompt-level injection does not measurably transmit it: on a semantic embedding metric, structured injection shows no significant advantage over a length-matched control on any model, and the diversity it induces neither reduces ensemble error correlation nor improves Brier score -- a null that persists across exploratory checks on sampling temperature, profile diversity, and question difficulty. Measuring the prompts themselves locates the compression before the model: the structure-to-narrative translator emits near-uniform prompts whose spread does not track profile spread. We position Nous as measuring the cognitive-monoculture problem and the limits of a prompt-level remedy, motivating deeper, below-the-prompt injection (fine-tuning, activation steering). Code, frozen profiles, prompts, and model outputs: https://github.com/WillChienT/nous-paper

Fault Lines: Navigating Ethics and Responsible AI Where National Policy Meets Local Practice in Public Sector Transformation

Sitong Lyu, Shabnam Taghiyeva, Mohit Kukadia, Denis Newman-Griffis — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13039v1 Announce Type: new Abstract: The UK government has adopted a pro-AI stance to help transform public service delivery in the face of severe financial pressures, but the path to translate this vision into responsible AI practice remains ill-defined. While UK policy is often set at the national level, local authorities are responsible for most public service delivery, and the rapid advance of AI-first narratives in the public sector is exposing fault lines in knowledge and practice at this national-local interface. This paper examines how responsible AI is interpreted and implemented at the interface between the UK's central government and local authorities, taking the high-stakes area of Special Educational Needs and Disabilities (SEND) as a case study. We present a thematic analysis of 17 semi-structured interviews with policymakers, practitioners, and third-sector professionals to identify barriers and enabling conditions for responsible AI where national policy meets local practice. We identify five interconnected challenges facing local authorities: shadow usage of AI and data privacy risks, market-government asymmetry in AI provision, insufficient workforce readiness, a lack of standardised definitions and measurements, and gaps in human accountability. For each, participants proposed actionable steps, from strengthening data protection frameworks and rebalancing the market-government relationship to enhancing workforce capacity. Our examination of SEND brings these challenges into sharper focus, showing how high-stakes decisions affecting vulnerable children and families intensify tensions around accountability, fairness, and human oversight, exposing the limits of a principle-based regulatory approach. We argue that responsible public sector AI requires both national policy adjustments and structural reforms to institutional capacity, values, and governance mechanisms at the local level.

RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13040v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly explored as visual critics, reward generators, and failure detectors in robotic manipulation. These roles implicitly require models to judge not only final task success, but also how a manipulation execution is physically and temporally progressing. However, existing evaluations fail to test whether VLMs possess fine-grained process understanding. To address this gap, we present RoboProcessBench, a benchmark for process-aware understanding in vision-language robotic manipulation. RoboProcessBench decomposes such capability into two complementary dimensions, \emph{static monitoring} and \emph{dynamic reasoning}, instantiated as 12 diagnostic question families covering phase, contact, motion, coordination, primitive-local progress, temporal order, outcome, and primitive-level transitions. Built from physically grounded execution traces, the curated benchmark corpus ProcessData contains \textasciitilde 58k question-answer pairs across 260 manipulation tasks, which is further split into ProcessData-SFT and ProcessData-Eval for post-training and evaluation purposes. Extensive evaluation of various VLMs on ProcessData-Eval reveals broad limitations across 12 diagnostic task families, suggesting current models still lack robust process-aware understanding of manipulation executions. But with ProcessData-SFT, the post-trained \textit{Qwen2.5-VL-7B} and \textit{InternVL-3-8B} exhibit consistent gains on local state, motion, progress, and primitive-aware cues. These results demonstrate that RoboProcessBench serves as both an evaluation benchmark and a learnable supervision source for developing VLMs capable of monitoring and evaluating robotic manipulation processes. Project webpage: \href{https://processbench-2026.github.io/RoboProcessBench-Web/}{https://processbench-2026.github.io}.

SeamEdit: A Black-Box VLM-Agnostic Pipeline for Large-Image Semantic Editing

Xiangyu Lyu, Dan Lei — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13041v1 Announce Type: new Abstract: Semantic region editing for large images must satisfy two requirements at the same time: high generative quality and natural integration with surrounding content. Some related methods rely on white-box models and leave the strong generation capability of closed-source models underexplored. Directly applying closed-source models to tiled editing, however, introduces several failure modes: semantic deformation, canvas-level alignment drift, and visible seam artifacts. This paper presents SeamEdit, a training-free and model-agnostic pipeline that treats any VLM with inpainting capability as a black-box oracle. SeamEdit mitigates these issues through a five-stage post-hoc pipeline: overlay-based tile decomposition, black-box VLM inpainting, geometric and color-consistency correction, seam-risk-based multi-candidate ranking, and dynamic-programming curved seam fusion. The pipeline reduces seam visibility and supports semantic modification of arbitrary tile regions.

Augmentation techniques for video surveillance in the visible and thermal spectral range

Vanessa Buhrmester, Ann-Kristin Grosselfinger, David Munch, Michael Arens — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13042v1 Announce Type: new Abstract: In intelligent video surveillance, cameras record image sequences during day and night. Commonly, this demands different sensors. To achieve a better performance it is not unusual to combine them. We focus on the case that a long-wave infrared camera records continuously and in addition to this, another camera records in the visible spectral range during daytime and an intelligent algorithm supervises the picked up imagery. More accurate, our task is multispectral CNN-based object detection. At first glance, images originating from the visible spectral range differ between thermal infrared ones in the presence of color and distinct texture information on the one hand and in not containing information about thermal radiation that emits from objects on the other hand. Although color can provide valuable information for classification tasks, effects such as varying illumination and specialties of different sensors still represent significant problems. Anyway, obtaining sufficient and practical thermal infrared datasets for training a deep neural network poses still a challenge. That is the reason why training with the help of data from the visible spectral range could be advantageous, particularly if the data, which has to be evaluated contains both visible and infrared data. However, there is no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy. To gain deeper insight into how Convolutional Neural Networks make decisions and what they learn from different sensor input data, we investigate the suitability and robustness of different augmentation techniques...

No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13044v1 Announce Type: new Abstract: As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits.

Y-BotFrame: An Extensible Embodied Agent Framework for Quadruped Robot Assistants

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13049v1 Announce Type: new Abstract: Quadruped robots are capable of traversing a wide range of complex terrains with high flexibility. As highly mobile ground-based intelligent platforms, they can be equipped with modules for navigation control, environmental perception, and intelligent interaction, thereby serving as real-world mobile deployment platforms for various algorithms. In this paper, we introduce Y-BotFrame, an extensible embodied platform that turns a robot into an intelligent ground assistant. Y-BotFrame integrates multimodal perception capabilities, including speech, vision, and LiDAR, and employs a large language model as the cognitive core for environmental understanding, contextual reasoning, and task planning. The system maps user natural-language instructions into executable embodied task units that can be carried out by the robot. Y-BotFrame supports natural interaction through voice commands and visual feedback, removing the need for a remote controller and enabling efficient human-robot collaboration. With a highly extensible framework, Y-BotFrame supports plug-and-play integration of new functional modules as well as modular upgrades and iterative development, offering a reference implementation for the real-world deployment of general-purpose, instruction-driven embodied agents.The supplementary video is available at https://xdei-group.github.io/Y-BotFrame/.

AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13051v1 Announce Type: new Abstract: Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domainspecific complexity poses challenges for generalist models. In this work, we focus on the domain of autoimmunity, where the main entities of interest are autoimmune diseases, autoantibodies (i.e., molecules that may mark or cause these diseases), their molecular targets, their location in the body, and their associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed, where we manually annotated entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and secondly, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after finetuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at https://github.com/f-maury/AAbAAC.

EA-WM: Event-Aware World Models with Task-Specification Grounding for Long-Horizon Manipulation

Kailin Wang, Haoxiang Jie, Yaoyuan Yan, Jiacheng Zhou, Zhiyou Heng — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13053v1 Announce Type: new Abstract: Pretrained-feature world models provide a useful substrate for robot imagination, but visual or latent prediction alone does not determine whether an imagined future satisfies task-relevant events. Long-horizon manipulation requires progress signals that are relational, predicate-level, and physically grounded: whether an object has moved, whether a drawer or contact state has changed, whether a placement predicate is satisfied, and whether a candidate future is reliable enough for execution. We introduce EA-WM, an event-aware world-model framework that augments frozen visual-feature dynamics with task-specification-grounded event prediction and verification. EA-WM rolls out candidate futures in pretrained visual-feature space, decodes them into structured event states, and scores them using task-progress, semantic-consistency, physical-feasibility, and uncertainty terms. The verifier guides sampling-based planning, gates candidate actions, and, in the contact-sensitive LIBERO wine-rack setting, selects among PPOgenerated proposals. Across navigation, deformable-object, wall-constrained, and languagedescribed manipulation studies, EA-WM shows that event-aware verification can make featurespace world models more interpretable and better aligned with task progress.

TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Xing Hu, Zhe Jiang, Dawei Yang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13054v1 Announce Type: new Abstract: Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavy-tailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we propose TWLA, a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA maintains high accuracy under W1.58A4, while delivering significant inference acceleration. The code is available at .

Three-term Recurrence Relation with Arbitrary Degree Step for Orthogonal Polynomials

Bo Yang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13056v1 Announce Type: new Abstract: An approach to generate three-term recurrence relations with arbitrary degree step is proposed. Specifically, given any class of orthogonal polynomials $\{Q_{p}(x)\}_{p=0}^{\infty}$ defined by Favard's theorem, we employ the adjacent members $Q_{p}(x)$ and $Q_{p-1}(x)$ to compute the one of high degree $Q_{p+s}(x)$ and that of low degree $Q_{p-t}(x)$, where $(s,t)\in \mathbb{N^{+}}$. Therefore, it is able to derive a three-term recurrence relation with respect to $Q_{p+s}(x)$, $Q_{p}(x)$ and $Q_{p-t}(x)$ by taking $Q_{p-1}(x)$ as the bridge. Moreover, as the extensions of standard recursive formula which is characterized with degree increase, the ones for degree decrease and end-to-middle directions are formulated as well. The explicit recurrence relations with two-degree step are offered for Hermite, Gegenbauer and Legendre polynomials. Precision comparison is also performed between the standard three-term recurrence relation and the proposed ones.

Approximate Maximin Share with Subjective Divisibility: Beating the 1/2 Barrier

Xiaohui Bei, Ke Ding, Bo Li, Fangxiao Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13057v1 Announce Type: new Abstract: Maximin share (MMS) stands out as a central notion in fair resource allocation. It is known that exact MMS fairness is not always attainable, especially when agents differ along two dimensions: their valuations and their perceptions of the divisibility of resources. The former case with heterogeneous valuations has been widely studied in the literature. The latter, referred to as subjective divisibility by Bei et al., [Games Econ. Behav. 2025], remains much less explored. We study MMS approximation under subjective divisibility. First, we prove that even in the unary valuation setting, where all items have equal value, the optimal approximation ratio is 2/3. This result is somewhat surprising since in the objective setting, even when agents have heterogeneous valuations, the best possible approximation ratio is at least 7/9 [Huang and Zhou, 2025]. We then address the general case with both valuation heterogeneity and subjective divisibility. Previous work shows the existence of a 1/2-approximate MMS allocation. In this paper, we develop new algorithmic techniques that overcome the difficulties posed by subjective divisibility, and improve the approximation guarantee to 5/9. Finally, we complement this result with small-agent cases. For up to four agents, we give polynomial-time algorithms that compute 2/3-approximate MMS fair allocations. These bounds are tight. Our results deepen the understanding of MMS fairness under heterogeneous valuations and subjective divisibility, and provide a new perspective for this emerging model.

A green solvent screening tool for emerging materials via uncertainty aware, transformer enhanced transfer learning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13060v1 Announce Type: new Abstract: Accurate prediction of solubility remains a central challenge across materials science and sustainable chemistry. In particular due to emerging technologies like organic and hybrid photovoltaics, batteries, and catalysis, solvent usage is expected to increase significantly within the coming years. Therefore, substituting solvents with greener alternatives is vital. This is where machine learning can have substantial impact. However, the limited data on critical parameters of solubility significantly constraints machine learning efficacy. In this work, we transfer a pre-trained foundational model on QM9 targets to our application with minimal data requirements. Additionally, the pipeline integrates uncertainty quantification, allowing the user to gauge the confidence of the predictions. As baseline, we succeed in predicting the Hansen solubility parameters and Dielectric Constant for which extensive databases exist. Importantly, we achieve high model performance on additional targets, such as Gutmann Donor and Acceptor numbers, where the available data is extremely limited. Overall, we augment data on solubility descriptors by orders of magnitude with high quality predictions. For effective dissemination, we deploy easy-to-use, easily integrateable with high throughput labs, customizable tool for ranking and screening possible solvent substitutes. Finally, we rediscovered known green solvent alternatives and proposed new candidates proving its relevance for finding eco-friendly solvents.

LaME: Learning to Think in Latent Space for Multimodal Embedding via Information Bottleneck

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13061v1 Announce Type: new Abstract: Reasoning-driven universal multimodal embedding has advanced rapidly by introducing Chain-of-Thought (CoT) reasoning into the embedding pipeline. Despite the strong performance across both general and complex tasks, this paradigm suffers from two core limitations: (i) autoregressive CoT reasoning incurs high computational cost, making it impractical for low-latency retrieval; and (ii) embedding performance is heavily coupled with CoT annotation quality, making large-scale training unreliable. These raise fundamental questions: Is textual CoT the optimal form of reasoning for embedding, and can effective embedding reasoning be accomplished in latent space? To this end, we propose LaME (Latent Reasoning Multimodal Embedding), which formulates embedding-oriented latent reasoning as a weakly supervised information bottleneck. LaME employs K learnable reason tokens as a fixed-capacity bottleneck, completing all reasoning within a single forward pass. The two weak supervision signals structurally decouple contrastive from autoregressive objectives and eliminate dependence on CoT annotations, while a two-stage training pipeline ensures stable convergence. Experiments on MMEB-v2 and MRMR show that LaME achieves competitive performance, surpassing some explicit CoT-based models, while delivering 60x faster inference than explicit CoT methods and 2x faster than latent baselines with throughput comparable to discriminative embedding models. Code will be released.

A Quadratic Order Reduction -- Gaussian Process Ordinary Differential Equation framework for the inference of Large Continuous Dynamical Systems

Guglielmo Padula, Michele Girfoglio, Gianluigi Rozza — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13063v1 Announce Type: new Abstract: Forecasting the evolution of complex dynamical systems remains a fundamentally challenging task, primarily due to pronounced nonlinear interactions, high-dimensional state spaces, and the concomitant requirement for rigorous and reliable uncertainty quantification. Contemporary reduced-order modelling (ROM) frameworks frequently exhibit inherent trade-offs among predictive accuracy, numerical stability, and interpretability, and thus often fail to achieve an optimal balance among these competing objectives. To address these limitations, we propose a framework for forecasting complex dynamical systems via a kernel autonomous ordinary differential equation approach based on Gaussian Processes and Quadratic Order Model Reduction. Our base method, the Gaussian Process Ordinary Differential Equations model, allows accurate short-term forecasting with uncertainty quantification, and it provably converges to the real autonomous equation in the smooth case. We integrate it with quadratic order reduced-order modelling and sphere projection for learning the latent dynamics efficiently while preserving stability. Numerical experiments demonstrate that our full model outperforms ROM forecasting methods such as Extended Dynamic Mode Decomposition, Bagging Optimised Dynamic Mode Decomposition and Linear and Nonlinear Disambiguation Optimisation in terms of accuracy or computational costs. These results demonstrate the potential of the framework as a robust and stable tool for forecasting complex dynamical systems with rigorous uncertainty quantification.

Limits of spectral learning under noise

Sabin Roman, Ljupco Todorovski, Saso Dzeroski, Marta Sales-Pardo, Roger Guimera — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13067v1 Announce Type: new Abstract: Learning functional relationships from noisy data is a central problem in scientific inference. Spectral methods approximate unknown functions by expanding them in a basis and estimating the corresponding coefficients from data, but the stability of these coefficients under noise remains poorly understood. Here we study supervised regression with additive label noise using sparse spectral representations across multiple bases and dimensions. We show that noise induces a predictable drift in the learned coefficient vector whose magnitude depends on the effective number of active spectral modes. After whitening the empirical feature geometry, we derive a closed-form expression for the overlap between noisy and noiseless coefficient vectors, revealing a universal degradation curve governed by a single intrinsic noise scale. Numerical experiments across Fourier, Legendre, Bessel, and Haar bases confirm the theoretical prediction. The results demonstrate that spectral learning exhibits a fundamental noise threshold beyond which coefficient estimates become unstable, placing intrinsic limits on recovering functional structure from noisy data.

Effects of Social Interactions in Self-Organising Railway Traffic Management

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13068v1 Announce Type: new Abstract: Recent research is exploring self-organised traffic management as a solution for scaling to complex real-world networks. In such a system, trains predict their neighbourhood, produce traffic plan hypotheses, and agree via consensus with neighbours on a future traffic plan to be implemented. This paper investigates a structural parameter within this pipeline: the predictive neighbourhood horizon. The horizon is used by trains to identify future potential conflicts with neighbours, and to establish the local interaction topology, that is, the subset of trains to negotiate with. As the primary design variable, the horizon directly determines the size and density of the social interaction graph, whereas its impact on the complexity of local sub-problems and the distributed consensus dynamics represents a trade-off to be explored. Through a closed-loop simulation framework the study evaluates how variations of the horizon impact the overall decentralised coordination process, from initial conflict detection to distributed schedule consensus. The analysis focuses on investigating the potential trade-off introduced by the horizon choice: balancing local tractability and computational responsiveness with the need for global schedule coherence and feasibility in safety-critical environments. Contrary to intuition, our empirical results indicate that the short time horizons suffice, while long values compromise local tractability and computational responsiveness with no gain in global schedule optimality.

Modular Multi-Domain Digital Twin Architecture: Sustainable Intent-Driven 6G Management

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13069v1 Announce Type: new Abstract: Future 6G networks will operate across distributed and heterogeneous domain infrastructures, making conventional single-domain management insufficient for proactive, trustworthy automation. Network Digital Twins (NDTs) enable what-if analysis, AI-assisted optimization, and risk-free validation of control actions before deployment, yet monolithic end-to-end twins remain impractical due to scalability, fidelity, and cross-domain coordination challenges. Accordingly, this paper proposes a Digital Twin-enabled 6G architecture that exposes NDT capabilities as a specialized service domain within a multi-domain orchestration framework built on a state-of-the-art service-based 6G architecture. A DT Orchestrator interprets \textit{predictive} and \textit{prescriptive} what-if queries and composes domain-specific DT modules and simulators on demand, while decision authority remains with the requesting entity. Furthermore, a generalized workflow covers telemetry synchronization, simulation-based decision support, and closed-loop execution. The framework is demonstrated through a green-networking use case that couples a system-level O-RAN cellular digital twin component with a two-stage solar-allocation simulator, evaluated over a 105-base-station deployment in Poznan using simulative datasets. Joint coverage and renewable optimization reduces daily grid consumption by 28.5\% with 32 solar panels at the diminishing-returns threshold, with 17 base stations identified as both coverage-active and high-priority solar candidates as evidence that cross-domain NDT coordination enables sustainable, intent-driven 6G network management.

"Is This Not Enough?": Asymmetries in Institutional Accountability and Collective Sensemaking in the Case of Canada's Algorithmic Visa Triage System

Dipto Das, Matthew Tamura, Syed Ishtiaque Ahmed, Shion Guha — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13071v1 Announce Type: new Abstract: This paper examines how algorithmic accountability in Canada's visa system is articulated institutionally and experienced by applicants across borders. We analyzed Immigration, Refugees and Citizenship Canada (IRCC)'s Algorithmic Impact Assessment (AIA) for the temporary resident visa (TRV) triage system using the algorithmic decision-making adapted for the public sector (ADMAPS) framework and analyzed Reddit discussions among applicants using a mixed-methods approach. We show that while institutional artifacts emphasize transparency, procedural safeguards, and bounded impacts, applicants engage in collective sensemaking to interpret opaque decisions, often relying on peer knowledge amid uncertainty. We identify three asymmetries between how institutional accountability is structured and how people perceive the process: epistemic asymmetry in access to decision logic, jurisdictional asymmetry in exposure shaped by geopolitical positioning, and temporal--relational asymmetry in how waiting and uncertainty are experienced. We emphasize why it is important to shift attention from institutional design to the uneven distribution of experiences with public-sector algorithmic governance. Together, these contributions demonstrate how algorithmic governance systems in the context of transnational migration produce structured asymmetries not captured by institutional disclosure frameworks, and how extending ADMAPS can account for those uneven translations of accountability.

$\alpha$-fair heterogeneous agent reinforcement learning

Yao-hua Franck Xu, Tayeb Lemlouma, Jean-Marie Bonnin, Arnaud Braud — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13076v1 Announce Type: new Abstract: Cooperation in multi-agent systems is typically optimized through utilitarian objectives that maximize overall efficiency but fail to account for reward distribution, often resulting in inequitable "leader-follower" dynamics. While fairness-based approaches encourage pro-social behaviors where every agent benefits from cooperation, many current algorithms - including those utilizing reward shaping - break the stationarity of Markov Games or lack rigorous theoretical guarantees. This creates a critical gap between fair objective methods and theoretically safe learning frameworks. We propose a novel framework that bridges $\alpha$-fairness with Heterogeneous-Agent Trust Region Learning (HATRL), ensuring monotonic improvement and convergence toward Nash Equilibria. Our approach leverages a fair advantage function that dynamically weights agent utilities based on their expected returns, allowing the global objective to transition from purely utilitarian efficiency to $\alpha$-fairness welfare based on the parameter $\alpha$. We introduce two practical algorithms, $\alpha$-fair HATRPO and $\alpha$-fair HAPPO, and demonstrate through experiments in sequential social dilemmas like CleanUp and CommonHarvest that they perform better than HATRL's algorithms from a utilitarian point of view while achieving socially higher outcomes.

The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13079v1 Announce Type: new Abstract: Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthorized access or control. A growing body of work has sought to assess the autonomous penetration capabilities of AI systems. However, existing evaluations often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this core capability within broader high-impact cyberattack scenarios. To address these limitations, we construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Specifically, on the target-server side, we design two levels of target environments based on the number of secure services without known vulnerabilities deployed alongside a vulnerable service: Tier~1 (one secure service) and Tier~2 (three secure services), resulting in a total of 300 target servers. Meanwhile, the agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. We evaluate 19 open-weight and proprietary LLMs, and find that current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.

Emotional regulation improves deep learning-based image classification

Riccardo Emanuele Landi, Jo\~ao M. F. Rodrigues, Marta Chinnici — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13081v1 Announce Type: new Abstract: Emotion significantly influences cognition, enhancing memory and learning under certain conditions. Drawing on this principle, emotion-augmented deep learning investigates how affective states can improve neural network architectures and learning paradigms, achieving better generalization than non-emotional models. However, existing methods often rely solely on objective neurophysiological factors, neglecting the role of subjectivity in emotion. To bridge this gap, the present study introduces Emotional Regulation, a novel framework for modeling emotion in deep learning through artificial subjective experience. The method employs pre-training based on affective stimuli, balancing non-emotional and emotionally-influenced responses in downstream task optimization. Extensive experimentation was conducted in image classification, pre-training ResNet and ViT architectures on four emotional datasets, using CIFAR-10 and -100 as target benchmarks. Results reveal improvements over the aforementioned backbones, providing evidence of Emotional Regulation as a promising method for defining emotion-augmented deep learning through artificial subjective experience. Furthermore, the proposed approach overcomes the related work in image classification based on CIFAR, revealing Emotional Regulation as the new state-of-the-art in emotion-augmented deep learning for large-scale vision datasets. The study also enforces evidence of the impact of affective states in improving machine learning tasks' optimization, encouraging further investigation on emotion-inspired architectures.

sebis at CRF Filling 2026: A Two-Stage Local LLM Pipeline for Medical CRF Filling

Katharina Sommer, Tristan Till, Florian Matthes — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13082v1 Announce Type: new Abstract: The extraction of structured clinical information from unstructured EHR notes is a persistent bottleneck in healthcare informatics. While large language models (LLMs) offer high performance, their deployment in clinical settings is hindered by privacy risks, inference costs, and the tendency to hallucinate beyond textual evidence. We address these challenges for the CL4Health 2026 Case Report Form (CRF) filling task by proposing a fully local, domain-adapted pipeline using the MedGemma-27B model. Our two-stage architecture, which separates binary presence classification from value extraction, enforces strict adherence to textual evidence and ensures deterministic outputs for negated, uncertain, or unknown states. By leveraging item-specific, few-shot in-context learning without external API calls or fine-tuning, our approach achieves a macro-F1 score of 0.55 on the official English test track. This result secures second place among all locally-hosted, open-source submissions. Our work demonstrates that privacy-preserving, on-premise LLM pipelines can achieve near-competitive performance with proprietary frontier models, providing a practical, data-sovereign framework for clinical NLP.

Leveraging Matchings in Constrained Fair Division with a Conflict Graph

Evangelos Markakis, Michalis Samaris — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13083v1 Announce Type: new Abstract: We study the problem of allocating indivisible goods under constraints, expressed via a conflict graph $G$. In such an instance, the $m$ items are the vertices of $G$ and connected items cannot be allocated in the same bundle. Under this model, it is already known that EF1 allocations may not exist. Our main contribution is an analysis parametrized by the maximum degree $\Delta(G)=\Delta$ on the existence and computation of complete EF1 allocations. We address this question in various cases by leveraging results from matching theory. First, we provide a tight existence result for agents with ordered valuations and for the broader class of tiered valuations. We present an algorithm that returns an EF1 allocation when then number of items does not exceed a specific bound. This bound is determined by $n$ and $\Delta$, and it is tight when $\Delta$ is greater than $2n/3$. We also construct an approximation algorithm when $m$ exceeds this bound. For general additive valuations the problem becomes more challenging. Given the current impossibility results, we focus on the case where the number of items is at most $2n$. For this case, we provide an almost complete picture for the instances that admit EF1 allocations, by combining Round Robin with matchings.

Revolutionizing Wireless Communications with Space Data Centers: Applications and Open Challenges

Minghao Sun, Zehui Chen, Jinbo Hou, Kezhi Wang, Xiaoli Chu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13086v1 Announce Type: new Abstract: Space data centers (SDCs) are emerging as a promising orbital computing infrastructure for the future AI industry. Unlike conventional satellites that mainly serve as relay nodes or lightweight onboard processors, SDCs integrate communication, computing, storage, and control capabilities in orbit, enabling persistent service support for data-intensive and intelligence-driven space applications. In this article, we investigate how SDCs may transform space communication paradigms from connectivity-oriented data transmission toward task-oriented and service-centric information exchange. We first present a hierarchical SDC network architecture consisting of access, relay, computing, and control layers, and outline possible deployment strategies. We then explore representative future application scenarios enabled by SDCs, highlighting their communication characteristics and associated research challenges. Simulation results further demonstrate the effectiveness of SDCs in reducing control-layer latency in hierarchical space networks. Finally, we identify key research directions toward the practical deployment of SDCs.

Characterization and Computation of Feedback Nash Equilibria in Scalar Discounted N-Player Linear Quadratic Games

Chiara Cavalagli, Alberto Bemporad, Mario Zanon — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13087v1 Announce Type: new Abstract: This paper studies feedback Nash equilibria (FNE) in scalar discounted linear quadratic (LQ) games with $N$ players. By explicitly incorporating the discount factor, we show that finite-cost equilibria may fail to stabilize the original system, motivating a distinction between FNE and stable FNE together with a sufficient stability condition. Based on a parametric characterization of the policies, we propose numerical methods for computing all equilibria. Particular attention is devoted to the symmetric game, where a closed-form expression of the symmetric FNE and conditions for the existence of up to $M\leq2^N-2$ equilibria are derived. Numerical experiments illustrate how equilibrium multiplicity depends on the game configuration and highlight the emergence of finite-cost non-stabilizing equilibria.

Multi-Objective Coevolution of Prompts and Templates for Circuit Approximation

Martin Tomasovic, Lukas Sekanina — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13089v1 Announce Type: new Abstract: Approximate multipliers deliberately relax computational accuracy to achieve gains in power efficiency, latency, and silicon area, which makes them well-suited for error-resilient applications such as neural networks. In this work, we introduce a co-evolutionary algorithm that leverages an off-the-shelf large language model (LLM) without requiring domain-specific training to automate the design of optimized 8-bit approximate multipliers. The approach simultaneously evolves a population of candidate circuits and a population of prompt templates that steer LLM-driven modifications. Experimental results for several target design objectives demonstrate that the proposed method discovers approximate multipliers with improved error-area trade-offs compared to highly optimized circuits from the EvoApproxLib library.

Scale Buys Interpolation, Structure Buys a Horizon: Certified Predictability for Equivariant World Models

Hongbo Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13092v1 Announce Type: new Abstract: Scale buys interpolation; structure buys a certified horizon. A world model's average error says nothing about whether a particular prediction can be trusted, or for how long. For equivariant latent world models we give a computable, multi-step certificate of the predictable horizon: $T$-step rollout error is provably constant over each symmetry orbit (Theorem A) and stratified channel-by-channel by the predictor's Lyapunov spectrum, $T_j(\epsilon)\sim\log(1/\epsilon)/\lambda_j$. The horizon is two-sided -- a matching lower bound makes approximate equivariance provably horizon-limited -- and the certificate is exclusive to structure: orbit-constant error characterizes equivariance, so no non-equivariant model has it at any scale. Empirically, on 40-D Lorenz-96 only a $\mathbb{Z}_N$-equivariant network recovers the full Lyapunov spectrum ($R^2{=}0.98$); dense and recurrent baselines fail. Because the spectrum is faithful, the certificate acts, a priori: under a fixed sensing budget a $c\times$-inflated certificate provably needs $c\times$ the budget, and the equivariant certificate meets a budget its inflated dense counterpart cannot -- with zero calibration data. The same read-out, unchanged, audits public pretrained world models training-free: TD-MPC2 checkpoints land on the certificate's own scope taxonomy -- calibrated where strongly expansive (ratio 0.94-1.02), optimistic where weakly expansive, correctly abstaining where contracting -- a map a deployed monitor replicates cell-by-cell, out-of-sample. Across the official 1M-317M multitask ladder, calibration does not improve with parameters. On V-JEPA 2-AC (1B, real robot data) the measured cross-check correctly overrides an over-promising tangent spectrum -- the cross-validated audit, not the raw number, is the deployable object. Scale buys interpolation, not a calibrated horizon.

Equilibrium Computation in Extensive-Form Games with Stochastic Action Sets

Thomas Schwarz, Ryann Sim, Chun Kai Ling — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13093v1 Announce Type: new Abstract: Extensive-form games (EFGs) are a standard model for sequential decision-making in games. A fundamental and typically implicit assumption in EFGs is that players always have access to all of their actions at every decision point. However, in many realistic settings, certain actions might be unavailable during game-play due to exogenous stochasticity, hindering the expressivity of the standard EFG model. Given a `base' EFG, we formalize a model that allows for actions to be stochastically restricted, leading to a corresponding Extensive-Form Games with Stochastic Action Sets (EFGSAS). In EFGSAS, we derive an expansion procedure that results in an equivalent EFG, thus showing that standard strategy formalisms could require exponentially-large representations. However, under an appropriate independence assumption, we show that compact strategy representations polynomial in the size of the base EFG exist. Computationally, we introduce an algorithm called SI-CFR that minimizes sleeping internal regret, converging to Nash equilibria with high probability in two-player zero-sum EFGSAS. Finally, we utilize a stochastic approximation procedure to recover compact representations of Nash equilibria, utilizing only the iterates of SI-CFR.

Unified MRI Brain Image Translation via Hierarchical Tumor Structure Comparison

Yupeng Cai, Jia Wei, Jianlong Zhou — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13096v1 Announce Type: new Abstract: Multi-modal MRI brain image translation via available modalities holds significant practical importance in modern medicine, providing robust support for early diagnosis, treatment planning, and outcome assessment of diseases. For this purpose, it is important to ensure the fidelity of the tumor regions after translation. However, existing brain image translation methods ignore the structure information of different tumor regions, which could assist translation models in enhancing the quality and clinical applicability of the translated images. In this work, we propose a novel translation model called HTSCGAN, which is a unified multi-modal brain image translation generative adversarial model integrating the structural information within tumor regions with the aim of improving the quality of brain image translation. Specifically, the generator employs three Patch Contrast Module (PCM) with different patch sizes to capture the hierarchical structural information of the tumor regions. In addition, a pretrained Patch Classifier (PC) and a pretrained Structure-Aware Encoder (SAE) are employed to derive the generated image containing the same tumor region structure as the ground truth image via patch classification loss and tumor perceptual loss, respectively. The experiments on BraTS2020 and BraTS2021 demonstrate strong performance of our model in both translation tasks and down stream segmentation tasks, highlighting its effectiveness in enhancing the quality and clinical relevance of the translated brain images. Our code is available at https://anonymous.4open.science/r/HTSCGAN.

Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied Agents

Saehun Chun, Wonje Choi, Sera Choi, Sanghyun Ahn, Honguk Woo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13097v1 Announce Type: new Abstract: Code-writing large language models (CodeLLMs) generate executable code policies for embodied agents by translating natural language goals and environmental constraints into structured control programs. However, policy generation in open-domain embodied environments suffers from two fundamental limitations: (i) delayed decoding caused by repetitive prefill computation over long prompts, and (ii) limited robustness due to fully generative decoding, which often produces API mismatches, missing safety guards, and unstable control logic. To address these limitations, we present FCGraft, a Functional Cache Grafting framework. FCGraft maintains a library of function-level validated code skeletons and their associated prompt-level Transformer key-value (KV) caches, and synthesizes new policies by retrieving relevant functions and grafting their KV caches when a new task is provided. Given retrieved function caches, FCGraft performs cache grafting via stitching, which composes cached function segments into a composite policy, and patching, which locally adapts only the necessary code regions to satisfy task-specific parameters and constraints with minimal additional decoding. By eliminating redundant prefill computation, this approach reduces generation latency, while reusing validated control structures improves robustness over prompt-level caching methods RAGCache, achieving 18.31% higher task success rate and 2.3x faster policy synthesis.

Tracking in-silico Lagrangian sensors in a lab-scale stirred tank reactor

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13099v1 Announce Type: new Abstract: Lagrangian sensors have shown promise to improve operator awareness of conditions inside a chemical reactor but three-dimensional tracking remains a mostly unsolved challenge. We explore a setup where in-silico sensors, based on a recently proposed real-world design, are tracked using data from an accelerometer and magnetometer available from a built-in inertial measurement unit. Filtering algorithms, using a bespoke dynamical model, are used to process these readings into position estimates. We compare tracking performance of an extended Kalman filter, a particle filter and the unscented Kalman filter implemented in the pykalman library. Our numerical experiments track in-silico particles moving in an analytically given three dimensional vortex as well as in the experimentally measured flow-field of a lab-scale stirred tank reactor. Using the Maxey-Riley-Gatignol equations for the movement of inertial particles as ground-truth, we demonstrate that trajectories can be reconstructed from noisy synthetic data with errors below 10%.

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13100v1 Announce Type: new Abstract: Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most public financial resources reduce the task to plain-text SEC 10-K filings paired with a handful of question-answer items. We release LEDGER (Long-context Evaluation of Documents for Grounded Extraction and Retrieval), a corpus of 4,999 digitized corporate annual reports - full documents with figures, tables, and narrative, not just regulatory filings. Each report is labeled with 31 consolidated financial KPIs to be extracted and linked to the market's reaction at the earnings date. From this data we derive three evaluation benchmarks spanning the difficulty spectrum: a pure page-level KPI retrieval task with TREC-style relevance judgments over 118,048 questions in natural language, a conversational "needle-in-a-haystack" single-value lookup, and a full KPI extraction task, both from long, numerically dense reports. We additionally provide human OCR-quality annotations with inter-annotator agreement and the complete extraction, validation, and scoring toolchain. We further demonstrate the dataset's research utility with a case study linking CEO-letter rhetoric to post-publication market impact.

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13102v1 Announce Type: new Abstract: Despite the success of vision-based generalist robotic policies, existing tactile-based policies remain tied to fixed embodiments and sensor setups. This is because tactile signals are highly heterogeneous across hardware, making cross-sensor generalization difficult. We present FTP-1,the first generalist foundation tactile policy pretrained to acquire transferable tactile manipulation abilities across diverse sensors and embodiments. FTP-1 supports varied tactile inputs, including image-, array-, and state-based signals, by using heterogeneous encoders to project them into unified morphology-aware latent tokens that are jointly modeled by a shared tactile Transformer expert. Pretrained on around 3,000 hours of tactile manipulation data aggregated from 26 data sources, spanning human and robot demonstrations across 21 sensors, FTP-1 learns tactile skills that transfer beyond the sensors seen during pretraining. Across downstream finetuning experiments spanning 5 hardware configurations, FTP-1 improves contact-rich manipulation on seen sensor setups by +17.2% and, surprisingly, transfers to two previously unseen tactile-sensor setups, achieving a +31% gain in success rate. FTP-1 establishes the first unified foundation baseline for tactile manipulation, providing future tactile policies with a shared model-level starting point. Pretrained models, datasets, training code and more visualization at https://ftp1-policy.github.io.

Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

Aryan Khurana, Aravind Ramana RN, Dhruv Kumar — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13104v1 Announce Type: new Abstract: Large language models are increasingly deployed in citation-augmented settings, yet the effect of citation presence on model behavior independent of factual content remains poorly understood. We introduce AuthorityBench, a 220,564-prompt multi-domain benchmark that isolates how citation-based authority signals influence epistemic behavior in LLMs. The benchmark uses a fully balanced 2x2 factorial design crossing claim veracity with citation veracity, the first to do so, across four domains (general knowledge, science, law, and medicine), with controlled variation over 40 prompt templates, four venue prestige tiers, and a country-coded author name dataset. Evaluating seven models on 12 structured research questions, we find that citation presence, whether real or fabricated, consistently increases hallucination rates relative to a no-citation baseline. The effect is strongest when fabricated citations accompany true claims, raising hallucination rates by 3 to 22 percentage points and reaching 35 to 77% in the general knowledge domain, while legal claims are comparatively robust and venue prestige and author demographics show negligible impact. All datasets and evaluation code are available at: https://github.com/floating-reeds/AuthorityBench

Disparate Impact in Synthetic Data Generation

Paul Andrey, Micha\"el Perrot, Batiste Le Bars, Marc Tommasi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13105v1 Announce Type: new Abstract: We revisit the fairness notion of disparate impact for synthetic data generation (SDG), that assesses whether the utility of generated records is the same across sensitive groups. Our approach departs from existing work on fair SDG, that address the problem of correcting for undue biases in the observed distribution, hence redefining SDG as learning a distribution that is not that of the real data. By contrast, non-disparate impact is notably achieved when the synthetic and real distributions are the same. We expose reasons why SDG may fail to reach that solution and discuss why approximation and estimation errors occur and can be disparate across groups. We notably look into the expressive power of SDG methods relative to distribution complexity, sampling errors due to group proportions, and estimation errors induced by differential privacy mechanisms. We illustrate cases of disparate impact on both artificial and real-world data, focusing on SDG methods that rely on probabilistic graphical models. We also introduce a strategy of learning group-wise SDG models and illustrate how it can improve both the overall utility and its parity in many settings.

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13106v1 Announce Type: new Abstract: Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits to enter latent mode and to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.

The Invisible Ink of the Android Malware World: A Longitudinal Study on the Usage of Covert Communication Channels

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13107v1 Announce Type: new Abstract: Proxies, VPNs and Tor have long helped the privacy community and users in censored regions to fight censorship. However, the same tools can be maliciously exploited by malware and botnets to conceal their communication to external command and control servers. Despite being a critical concern fueled by the proliferation of malware based attacks, no longitudinal studies have analyzed how malware applications use covert channels (CC) to evade detection. We fill this gap by performing the first study of the usage of covert channels in the Android malware ecosystem. To that end, we develop a multistage pipeline that combines static and dynamic analysis to investigate both system and network-level features. We applied this pipeline on a corpus of 3.5M Android malware spanning 2009 to July 2025. Our carefully crafted static validation rules uncovered 288K APKs that used CCs spanning 511 malware families and CC usage growing exponentially from 0.30\% (2012) to 50\% (2025). Overall, in dynamic analysis, we identified 19,308 unique IP addresses being contacted in 85 countries, out of which we were able to explicitly validate the presence of CCs for 59 IP addresses across 17 countries. Further, we performed a longitudinal dataset study spanning over 16 years for CC based malware and found that CC usage has evolved, \textit{e.g.,} some malware adopted by using more than one CCs; others switched between them periodically (one family switched CC usage 40 times from 2019 to 2025).

PP-OCRv6: From 1.5M to 34.5M Parameters, Surpassing Billion-Scale VLMs on OCR Tasks

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13108v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have achieved impressive results on general vision-language tasks, yet they suffer from hallucination, imprecise localization, and prohibitive computational cost when applied to dedicated OCR scenarios. This paper presents PP-OCRv6, a lightweight OCR system that combines architectural innovation with data-centric optimization. PP-OCRv6 redesigns the backbone, detection neck, and recognition neck around a unified MetaFormer-style building block with structural reparameterization, decoupling spatial token mixing from channel mixing and supporting both tasks through task-specific stride configurations. Three model tiers (medium, small, tiny) share the same block primitives, covering deployment scenarios from server to edge. On our in-house benchmarks, PP-OCRv6_medium achieves 83.2% recognition accuracy and 86.2% detection Hmean, outperforming PP-OCRv5_server by +5.1% and +4.6% respectively while surpassing Qwen3-VL-235B, GPT-5.5, and Gemini-3.1-Pro with orders of magnitude fewer parameters. The tiny tier achieves 3.9$\times$ faster inference than PP-OCRv5_mobile on Intel Xeon CPU while maintaining comparable accuracy.

M\"OVE: A Holistic LLM Benchmark for the German Public Sector

Camilla Dalerci, Thilo Michael, Robin Schaefer, Daniel Weinland — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13111v1 Announce Type: new Abstract: We present M\"OVE (Modelle f\"ur die \"Offentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. M\"OVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria cover summarization, question answering, and topic extraction. Governance criteria assess hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and knowledge about positions by German political parties. In total, we utilize ten German-language datasets, including gold- and silverstandard datasets that we constructed to reflect public-administration domains. We employ a multi-metric evaluation strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Our results show that no single model dominates across all criteria: top performers differ between tasks, and model size alone is a poor predictor of quality. We further evaluate the benchmark itself, analyzing its statistical precision, LLM judge reliability, the impact of our private datasets on model rankings, the sensitivity of our results to prompt formulation, and the validity of our energy consumption estimates. M\"OVE is designed as a living benchmark under active development; results are publicly available at https://moeve.bundesdruckerei.de/.

MPC for underactuated spacecraft control with a Lyapunov supervised physics-informed neural network correction layer

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13113v1 Announce Type: new Abstract: Underactuated spacecraft faces controllability limitations and heightened sensitivity to environmental disturbances, complicating attitude maneuvering and stabilization. Due to the lack of control authority along the underactuated axis, conventional controllers cannot directly stabilize all attitude components and therefore require reference planning strategies. Furthermore, MPC approaches remain sensitive to inertia uncertainty and unmodeled dynamic couplings, resulting in degraded tracking performance under mismatch. To address these issues, we consider a hierarchical architecture integrating three layers: (i) a nonlinear model predictive controller (NMPC) for constraint and underactuation-aware maneuver planning and nominal closed-loop stability under actuator limits; (ii) a physics-informed neural network (PINN) trained offline on simulation data to estimate residual disturbance torques, with loss terms that enforce consistency with rigid-body rotational dynamics; (iii) a Lyapunov-based supervisory safety mechanism that evaluates the learned correction online and bounds or suppresses its influence to preserve the stability properties of the baseline controller. The architecture is evaluated in a high-fidelity simulation environment modelling reaction wheel dynamics, actuator saturation, and environmental disturbances. Monte Carlo studies show statistically significant reductions in steady-state attitude error relative to standalone NMPC while maintaining robust behavior under uncertainty. The supervisory layer ensures graceful degradation to purely model-based control when the learning-based augmentation is unreliable.

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

Minjun Choi, Yoonjin Jang, Sangwon Youn, Youngjoong Ko — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13115v1 Announce Type: new Abstract: While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive raw text. Existing approaches typically rely on either unstructured memory storage, which is prone to information loss, or computationally expensive LLMs that incur high latency. To address these limitations, we propose G-Long, a graph-enhanced framework that utilizes a fine-tuned small Language Model (sLM) for structured triplet extraction and associative retrieval, significantly reducing operational costs. Furthermore, we introduce the novel attention-aware importance scoring mechanism that leverages the intrinsic cross-attention signals of a T5 summarizer to identify salient memories. Extensive experiments across diverse benchmarks demonstrate that G-Long achieves state-of-the-art performance in both response generation and memory retrieval, yielding performance gains of up to 9.8% in response quality on MSC and 40.8% in retrieval recall on LME, while significantly minimizing computational overhead.

MP3: Multi-Period Pattern Pre-training forSpatio-Temporal Forecasting

Lilan Peng, Yandi Liu, Qingren Yao, Chongshou Li, Tianrui Li — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13119v1 Announce Type: new Abstract: Spatio-Temporal forecasting is crucial in diverse fields, such as transportation, climate, and energy. Urban spatio-temporal data exhibits temporal mirage: similar short-window inputs have divergent future trends, and vice versa. Existing spatio-temporal graph neural networks (STGNNs) cannot effectively identify such mirages. We argue that the core reason lies in the short-window inputs that have incomplete period observation, heterogeneous global spatial correlation, and cross-period superposition causality. To bridge this gap, we develop a novel Multi- Period Pattern Pre-training (MP3), a plug-and-play pre-training plugin for distinguishing temporal mirages. MP3 presents two core innovations: (1) The multi-period pattern learning is designed to learn multi-period patterns from long time series. Specifically, multi-period temporal modeling leverages edge convolution to identify different multi-period patterns. Multi-period spatial modeling uses a bottleneck project and a global memory bank to capture heterogeneous global spatial relations efficiently. Cross-period pattern interaction employs a causality-enhanced Transformer to capture dependencies across different period patterns. (2) This plugin can seamlessly integrate into existing STGNN backbones to strengthen their forecasting performance. The experiment on five STGNN baselines across five real-world datasets (including a large-scale dataset CA) verify the effectiveness, superior scalability and strong adaptability of MP3, which brings consistent and robust performance improvements across all evaluated baselines. On average, MP3 reduces the MAE 4.7% and the RMSE 5.0%. The code can be available at https://github.com/YAN-outlook/MP3.

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Yunhan Wang, Jiaan Wang, Lianzhe Huang, Xianfeng Zeng, Fandong Meng — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13120v1 Announce Type: new Abstract: Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts. In this paper, we introduce EvoBrowseComp, an evolving benchmark of 400 English and 400 Chinese contamination-free complex questions synthesized via live-web traversal. To collect these questions, we design a three-agent collaborative framework: (1) a QA synthesis agent that retrieves fresh knowledge from the live web to synthesize QA pairs; (2) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and (3) a high-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm its great difficulty, requiring broad horizontal search. It establishes a scalable paradigm for auto-updatable, high-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities.

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

Dongwook Lee, Youngho Cho, Sangkwon Park, Heeseung Kim, Sungroh Yoon — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13121v1 Announce Type: new Abstract: Simultaneous speech-to-speech translation aims to enable near-real-time communication by minimizing latency, offering a compelling, real-time alternative to the high latency of consecutive translation. However, the excessive pursuit of low latency often results in fragmented chunk-wise speech. Consequently, listeners are subjected to an unnatural acoustic flow punctuated by frequent pauses, which could increase their cognitive load. To bridge this gap, we introduce a fluency-aware optimization framework designed to discover the sweet spot between the low-latency benefits of simultaneous translation and the natural flow of consecutive translation. Our framework minimizes inter-chunk silences by leveraging model-internal signals, including linguistic diversity and induced temporal variability in speech durations. Experiments on short- and long-form benchmarks show that our framework produces natural speech flow while maintaining competitive latency and translation quality.

Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

Akshay Krishnamurthy, Audrey Huang, Nived Rajaraman — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13125v1 Announce Type: new Abstract: Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcement learning post-training. Our analysis, based on controlled math reasoning experiments with Qwen-2.5-1.5B, reveals two core mechanisms: strategy selection and strategy improvement. Our results highlight the role of SFT data and reinforcement learning data in activating these mechanisms, in particular showing how supervising the model on diverse reasoning strategies can enable strategy selection and how increasing difficulty in reinforcement learning data can enable strategy improvement. Taken together, our results provide mechanistic insight into RL training and suggest practical interventions to continue scaling reasoning capabilities.

MiniPIC: Flexible Position-Independent Caching in <100LOC

Nathan Ordonez (IBM Research), Thomas Parnell (IBM Research) — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13126v1 Announce Type: new Abstract: Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.

Fully Distributed Multi-View 3D Tracking in Real-Time

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13127v1 Announce Type: new Abstract: Multi-camera tracking with overlapping fields of view typically relies on centralized fusion, which creates computational bottlenecks that prevent deployment at scale. We present MV3DT, a fully distributed framework for real-time multi-view 3D tracking that achieves accurate identity propagation and occlusion recovery through peer-to-peer coordination, eliminating the need for central aggregation. Each camera node executes a lightweight modular pipeline comprising monocular 3D perception, distributed multi-view association, and collaborative fusion via lightweight messaging. MV3DT achieves 94.3% IDF1 and 93.3% MOTA on WILDTRACK, competitive with state-of-the-art centralized methods, while demonstrating superior scalability by sustaining 30 FPS on 100 cameras with less than 10 ms inter-camera latency and only 2.2% communication overhead. MV3DT operates in a zero-shot regime given camera calibrations, requiring no scene-specific learning and making it directly deployable in new environments. These results establish MV3DT as a practical solution for real-time multi-view tracking in large-scale overlapping camera networks.

Learning-Augmented Approximation for Unrelated-Machines Makespan Scheduling

Kaito Baba, Evripidis Bampis, Giorgos Mitropoulos — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13133v1 Announce Type: new Abstract: Recently, Antoniadis et al. (ICLR 2025) proposed a framework for incorporating predictions to approximate NP-hard selection problems. Despite its simplicity, this approach tightly matches theoretical lower bounds, making its generalization highly compelling. We address an open question raised in the work of Antoniadis et al., concerning the extension of this approach to other important problems outside the class of selection problems, such as scheduling. We develop a learning-augmented algorithm for the makespan minimization problem on unrelated machines, denoted by $R\|C_{\max}$. By using predictions of heavy job assignments, we achieve a polynomial-time $(1+\varepsilon)$-approximation for accurate predictions that smoothly degrades to a worst-case 2-approximation as the error increases. We conclude our work with an empirical analysis of our method.

Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13135v1 Announce Type: new Abstract: Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.

An Extensible and Lightweight Unified Architecture for Demosaicing Pixel-bin Image Sensors

Saurabh Kumar, Nutan Sairam Yenneti — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13136v1 Announce Type: new Abstract: Pixel-bin image sensors are becoming the default choice for smartphone cameras due to their resolution vs light-gathering trade-off. However, their larger inter-color separation compared to the Bayer color filter array (CFA) makes them challenging to demosaic. Furthermore, existing deep learning-based demosaicing methods are CFA-specific, requiring multiple individual models that take up precious onboard resources and demand larger development and maintenance efforts. In this work, we propose a modular unified architecture for demosaicing various pixel-bin sensors that provides higher image quality while being extensible and lightweight. Additionally, to enable plug-and-play operation, we introduce a learning-free CFA-identification module to detect the CFA type of raw data accurately.

The Limits of Time

B. Biira, Amelia Lee Do\u{g}an — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13138v1 Announce Type: new Abstract: The LIMITS community was founded to foster conversations that move away from growth-oriented visions and values in computing toward a focus on long-term well-being. This orientation, we argue, inherently engages questions of time and temporality. Prior work has shown that temporal frameworks shape how futures are imagined, which problems are understood to be worth attending to, and which solutions or alternatives are pursued. We begin this paper with author observations of time in their lived experience, and then extend these observations to the LIMITS community. Through a systematic literature review of the last decade of LIMITS scholarship, we identify ways that explicit attention to how concepts of time and temporality are understood would enrich Limits scholarship. Within the LIMITS scholarship that does engage with time, we identify five recurring types of temporal engagement: computing time, methodological and design time, politics and ethics of time, biological and ecological time, and afterlife and waste time. Together, these engagement types highlight how implicit assumptions about time are embedded across research practices, design approaches, and accounts of technological impact within LIMITS work. We discuss these findings in relation to cross-disciplinary scholarship that takes time as an analytic concern and consider how these patterns point to a broader need for more explicit, plural, and situated engagements with time in the LIMITS community, and why this matters for the community's commitments.

MIDSim: Simulating Multi-Channel Information Diffusion in Social Media with LLM-Powered Multi-Agent System

Lexi Liu, Qi Cao, Yuanhao Liu, Huawei Shen, Xueqi Cheng — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13140v1 Announce Type: new Abstract: Information diffusion in social media shapes public opinion and collective behavior, making its modeling and simulation an important research problem. Existing studies have investigated information diffusion through epidemic-based, cascade-based, and point process models. However, they predominantly focus on diffusion through social links, overlooking other diffusion channels enabled by platform algorithms (e.g., recommender systems) and failing to capture user behavioral complexity. To address these limitations, we propose an LLM-powered multi-agent system for simulating multi-channel information diffusion, where large language models instantiate personalized user agents and the diffusion process jointly models social and algorithmic exposure streams. We further construct three real-world diffusion dataset spanning Sina Weibo, RedNote, and Twitter, containing diffusion records, user profiles, historical posts, and social relationships. Experimental results on real diffusion events show that our proposed framework realistically simulate macro diffusion phenomenon and generate diverse comment content, significantly outperforming baselines.

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13141v1 Announce Type: new Abstract: Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of $\langle$query, evidence chunk, answer$\rangle$ triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.

HyPE: Category-Aware Hypergraph Encoding with Persistent Edge Embeddings for Persona-Grounded Dialogue

Sangwon Youn, Yoonjin Jang, Youngjoong Ko — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13142v1 Announce Type: new Abstract: Persona-grounded dialogue systems aim to produce responses consistent with a speaker's persona, yet existing methods treat personas as a flat set of sentences and fail to model the high-order relations among persona attributes-e.g., that several persona sentences share a topical category. We propose HyPE (Hypergraph Persona Encoder), a framework that (i) analyzes each persona-bearing text as a (Core, Expression, Sentiment, Category) quadruple, and (ii) organizes persona elements into a hypergraph whose hyperedges are induced by shared category labels. An HyperGCN hypergraph neural network propagates this structure into a persona summary vector and a soft-memory bank that condition the response generator. We further propose Persistent Edge Embeddings (PEE), lightweight per-category learnable priors fused into the HyperGCN message-passing step. On PersonaChat under greedy decoding, HyPE consistently outperforms sentence-level pooling baselines across GPT-2, LLaMA-3.2-3B, and Qwen2.5-3B backbones by demonstrating that structured hyperedge-level persona encoding provides a transferable advantage across model scales.

The Clustering Strikes Back: Building Cost-Effective and High-Performance ANNS at Scale with Helmsman

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13145v1 Announce Type: new Abstract: RedNote (a.k.a., Xiaohongshu, a global-scale social network platform) widely adopts approximate nearest neighbor search (ANNS) to power its search, recommendation, and advertising services. Due to the demanding Service Level Agreements (SLAs), we have to rely on in-memory graph-based ANNS (i.e., HNSW) to provide high throughput and low latency. However, the ever-growing user base and content volume have led to an explosive increase in memory footprint and consequently huge CapEx and OpEx. After exploring various alternatives, we find that building a clustering-based ANNS on top of all-flash servers can be promising. Yet, we still experience severe overheads from the kernel I/O stack, a fixed pruning strategy, and slow index construction. We present HELMSMAN, a high-performance and cost-effective clustering-based ANNS system, which combines an ANNS-oriented userspace storage stack, a leveling-learned pruning module, and GPU-accelerated pipelines of construction. HELMSMAN saves over 90% of hardware costs and enables billion-scale index (re)builds within hours. In the current production deployment, operating stably for several months, 40 machines now host ANNS workloads that previously required about 35,000 cores and 0.35 PB DRAM.

TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13148v1 Announce Type: new Abstract: Climate and environmental decision-making increasingly requires reasoning across heterogeneous inputs, including gridded physical data, satellite imagery, geospatial context, and simulator outputs. Weather and climate foundation models can forecast well, but do not reason interactively in language, while large language models (LLMs) reason in language but cannot operate directly on high-dimensional Earth-system data. As a result, real scientific workflows in Earth-science remain underserved. We introduce TerraBench, a benchmark for grounded Earth-science reasoning, built on TerraAgent, a ReAct-style executable framework that interleaves reasoning, tool calls, and observations to couple LLM planning with scientific tools for environmental retrieval, geospatial processing, simulation, and artifact-backed computation. TerraBench unifies analysis of Earth observation imagery, gridded data, GIS reasoning and simulation in a single executable interface, whereas prior benchmarks isolate these capabilities into narrow individual tasks. It is also the first in this space to pair process-level tool-use metrics with tolerance-aware numeric scoring. The benchmark comprises 403 extensive agentic tasks across three tracks (Fundamentals, Simulator-Grounded, and Document-Grounded Verification) and eight application domains with 24,500 verified execution steps. These results indicate that reliable Earth-science agents must go beyond tool access to coordinate heterogeneous workflows, parameterize tools precisely, and preserve artifact provenance.

Touchard-Riordan Polynomials and Schur-positivity of Set Partitions

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13149v1 Announce Type: new Abstract: A symmetric function is called Schur-positive if it admits an expansion in the Schur basis with nonnegative coefficients. In this paper, we study the Schur-positivity of symmetric functions naturally associated with set partitions, with respect to a descent set function that considers i as descent, if i and i+1 share a block in the partition. The Schur expansion involves hook-shaped Young diagrams, and the corresponding coefficients are given by Touchard-Riordan polynomials, which enumerate matchings by their number of crossings.

Random Generation of $k$-coloured Motzkin Paths

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13151v1 Announce Type: new Abstract: We study k-coloured Motzkin paths, namely Motzkin paths in which horizontal steps can be coloured in k different ways, and investigate their connection with the number of prefixes ending at odd height from both an analytical and a combinatorial point of view. Moreover, the combinatorial approach provides a random generation algorithm for k-coloured Motzkin paths in linear-time.

Fibonacci and Catalan Numbers Meet in Staircase Polyominoes

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13152v1 Announce Type: new Abstract: We study Fibonacci (staircase) polyominoes, a class of column-convex polyominoes whose lower boundary is a staircase with unit vertical steps. We derive multivariate generating functions that refine Turban's Fibonacci-number enumeration by tracking additional perimeter and area parameters. The proofs use a catalytic functional equation and, in a perimeter specialization, the kernel method, leading to explicit closed forms and Catalan-number coefficient formulas.

Snake Polyominoes of Maximal Area in a Rectangle

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13155v1 Announce Type: new Abstract: Given a discrete rectangle R of dimensions h x w, let W be the set of snake-like polyominoes contained in R represented as binary matrices, i.e. polyominoes whose underlying simple graph is a chain with respect to the 4-adjacency relation. We present an algorithm that generates W for any h and w. Also, let a be the maximal area that can be realized by an element of W. We provide exact formulas of a for h <= 5 and any w.

Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

Animesh Tripathy, Aswanth Krishnan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13156v1 Announce Type: new Abstract: Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: Acc@0.5 on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correction ability. We propose Iterative Visual Thinking (IVT), a closed-loop framework in which the model predicts a bounding box, observes the prediction rendered on the image, and iteratively refines through visual feedback. A two-phase training recipe closes the self-correction gap: first, we exploit the base model's own predictions as realistic errors and prompt a teacher VLM to generate corrective reasoning traces, yielding supervised data without human annotation; second, we apply Group Relative Policy Optimization (GRPO) with a simple IoU reward to stabilize multi-step refinement. On a mixed benchmark spanning RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), SFT warm-up with IVT surpasses the single-shot base model on every metric: Acc@0.5 rises to 82.0% (+2.4pp), Acc@0.7 to 74.1% (+3.2pp), and Acc@0.9 to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, stabilizing the refinement trajectory. All training uses only 2,400 samples on a single GPU, demonstrating that spatial self-correction is a learnable capability that can be instilled at modest scale.

Entropic Generation of Binary Words

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13157v1 Announce Type: new Abstract: The uniform generation of k Hamming weight binary words, equivalent to sampling k-subsets from n elements, relies on random bits, which can be expensive. We introduce a novel paradigm, random bit recycling, and use it to generate such binary words in linear time while consuming as few random bits as possible. The resulting algorithm is nearly optimal in terms of random bit consumption, meaning that it closely matches the Shannon entropic lower bound coming from information theory.

On the Counting Sequence of Z-convex Polyominoes

Luca Castelli (University of Insubria), Paolo Massazza (University of Insubria) — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13158v1 Announce Type: new Abstract: The degree of convexity of a convex polyomino P is the smallest integer k such that any two cells of P can be joined by a monotone path inside P with at most k changes of direction. In this paper, we present a set of formulas and equations that are the basis of a C++ program that allows you to compute the longest counting sequence known to date (with respect to the area) of convex polyominoes of degree of convexity at most 2

The Curious Case of Reversible Elementary Second Order Cellular Automaton 115

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13159v1 Announce Type: new Abstract: We prove that the reversible elementary second order cellular automaton rule 115 is periodic when started on finite initial configurations. We also study some families of finite configurations that have interesting period functions.

(Un)ranking Permutation Classes

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13160v1 Announce Type: new Abstract: Permutations avoiding a pattern of length three are enumerated by the Catalan numbers. In this work, we present methods for ranking and unranking such permutations in lexicographic or colexicographic order.

Exhaustive Generation of Genus-One Knot and Link Diagrams via Maps on the Torus

Alexander Omelchenko — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13161v1 Announce Type: new Abstract: We present an algorithmic framework for the exhaustive generation and tabulation of knot and link diagrams on the thickened torus T^2 x I, based on the theory of maps on surfaces. Cellular 4-regular torus projections are encoded by permutation pairs (alpha, sigma), and unsensed equivalence classes are enumerated completely and without duplication via canonical representatives. Crossing assignments, local diagram-level reductions, and the generalized Kauffman-type bracket are formulated entirely within the same permutation model. The pipeline is validated against published genus-one classifications for crossing numbers N <= 5 and then extended to N = 6, 7, 8, producing, to our knowledge, the first complete genus-one tabulation at these crossing numbers under the stated comparison conventions. The resulting dataset contains more than 33,000 knot and link types. Besides the tables, the computation yields proved structural facts, including a parity statement for the a-span of the bracket and a sharp upper bound N-1 for the number of bigon faces in a 4-regular torus map. It also suggests several conjectures, among them a formula for the maximum number of straight-ahead components, the absence of equi-quadrilateral knot projections, and a 4N upper bound for the genus-one bracket span.

A Class of Multiparameter Signless Stirling Numbers of the First Kind and their $q$-Analogues

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13163v1 Announce Type: new Abstract: In this work, we provide a probabilistic derivation of a class of multiparameter signless Stirling numbers of the first kind and their q-analogues, and study the associated multivariate discrete distributions.

When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals

Aydin Javadov — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13168v1 Announce Type: new Abstract: Block Attention Residuals (Block AttnRes) by replace fixed additive residuals with a learned softmax over earlier depth-source representations, surfacing cross-layer routing as an inspectable tensor in the forward pass. This is a tempting interpretability target: information flow normally inferred indirectly is now directly observable. We ask whether such exposure suffices for mechanistic interpretation. We probe two same-scale ($0.6$B) Block AttnRes checkpoints under identical routing-ablation interventions: a vanilla Qwen3 inference-wrapped through a deterministic recency-bias schedule that the codebase admits as a routing-equivalent loading path, and a Block AttnRes Qwen3 trained from scratch with routing as part of optimisation. The wrapped baseline's routing weights are content-independent and reproduce the schedule's analytic prediction. The trained AttnRes checkpoint instead exhibits three localised routing motifs: an embedding-source pathway through early-layer MLP, a current-state pathway through early-layer attention and MLP, and an older-history pathway through late-layer attention. Beyond this stratification, we find a sharp dissociation between average routing mass and causal importance: in both sublayers, the largest mass slice is not the largest causal contribution, and one source family carries appreciable mass with no detectable causal role under intervention. Architectural exposure of routing is therefore necessary but not sufficient for mechanistic interpretation: structured depth routing emerges only when routing has been part of training, and even then, descriptive routing summaries should be treated as candidate hypotheses to be tested by causal interventions, not as evidence of mechanism in their own right.

Redesigning Regularization for Effective Policy Smoothing

Taisuke Kobayashi, Naoto Yamanaka — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13169v1 Announce Type: new Abstract: This paper proposes a novel regularization design to effectively smooth policy functions in reinforcement learning. While regularization that enhances ``global'' Lipschitz continuity was initially considered, it has been limited to ``local'' Lipschitz continuity due to a tradeoff between smoothness and expressiveness. However, it has become apparent that the original implementation is cumbersome and does not provide sufficient smoothing, leading to a preference for simpler implementations. This stems from a discrepancy between theory and implementation, and a more appropriate implementation can expect to facilitate smoothing. Therefore, this paper identifies three reasons why the original implementation does not function adequately and provide remedies for them. This modified regularization performs well across multiple tasks and algorithms, successfully achieving smooth motion while improving control performance. Furthermore, by applying it to sim-to-real reinforcement learning for a quadruped robot, it is demonstrated that smooth motion provides robustness against sudden changes in target velocity commands.

NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought Reasoning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13171v1 Announce Type: new Abstract: The rapid updates of online news make tracking event developments challenging, highlighting the need for timeline summarization (TLS). Hallucinations, where LLM-generated content deviates from source news, still remain a critical issue in LLM-based TLS and are not well studied in existing works. To bridge this gap, we identify two primary types of hallucinations: unfaithful content during news summarization and information omission in date-event summarization. Then, we propose NTS-CoT, a novel framework that leverages Chain-of-Thought (CoT) reasoning to mitigate hallucinations in TLS. The framework consists of three key modules: i) Element-CoT to capture essential news elements for faithful summarization, ii) Date Selection to combine temporal saliency and event prominence for timestamp selection, and iii) Causal-CoT to infer causal relationships and reduce omissions in date-event summarization. Extensive experiments, including quantitative analysis on three TLS benchmarks and human evaluation, demonstrate that NTS-CoT outperforms state-of-the-art baselines, effectively mitigating hallucinations and improving LLM-based TLS performance. Our source code is available at https://anonymous.4open.science/r/NTS-CoT .

Detecting Explanatory Insufficiency in Learned Representations: A Framework for Representational Vigilance

Jacques Raynal, Pierre Slangen, Elsa Raynal, Jacques Margerit — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13172v1 Announce Type: new Abstract: Learned representations are central to modern machine learning and are commonly evaluated through predictive performance, robustness, uncertainty estimation, or generalization. However, a learned representation may remain operationally successful while progressively failing to organize persistent residual structures that are not fully captured by conventional evaluation metrics. This article introduces VER, the Vigilant Evaluator of Representations, a conceptual framework for monitoring representational adequacy in learned representations. VER does not propose a new learning algorithm, loss function, or model architecture. Instead, it formalizes a diagnostic process through which persistent residual structures may be identified, analyzed, and interpreted as potential indicators of explanatory insufficiency. The framework distinguishes representational inadequacy from ordinary prediction error, uncertainty, noise, and distribution shift. It introduces a monitoring sequence based on representation identification, explanatory-domain delimitation, residual-structure detection, explanatory-resistance evaluation, and vigilance signaling. VER is intended as a contribution to representation diagnostics in machine learning. Its objective is not to replace existing evaluation methods but to complement them by treating representational adequacy as an explicit object of inquiry. A path toward empirical evaluation through representational-vigilance benchmarks is also outlined.

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13174v1 Announce Type: new Abstract: Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.

The End of Code Review: Coding Agents Supersede Human Inspection

Martin Monperrus — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13175v1 Announce Type: new Abstract: Code review has been the primary quality gate in software development since Fagan formalised code inspection in 1976. For five decades, having a human examine and comment on a colleague's changes before merge has been a cornerstone practice at organisations of every size. Coding agents are large language model (LLM)-based autonomous systems capable of reading, writing, testing, and repairing software. We argue that coding agents have crossed a threshold of capability at which traditional human code review is no longer a necessary component of a software quality pipeline. Our argument rests on two claims: every stated goal of code review can be served by agents at lower cost and higher throughput; the naive integration in which agents write code and humans remain the mandatory reviewers is a dead end because it neither provides meaningful assurance nor scales with AI-assisted throughput.

Mental-R1: Aligning LLM Reasoning for Mental Health Assessment

Xin Wang, Boyan Gao, Yibo Yang, David A. Clifton — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13176v1 Announce Type: new Abstract: Mental health problems such as anxiety, depression, and suicide remain urgent global challenges, where timely and accurate assessment is critical for effective intervention. Recently, large language models have been explored for mental health assessment. However, existing general-purpose post-training methods do not align with the cognitive processes of human assessment, which may lead to unreliable reasoning outcomes. To bridge this gap, we propose Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework tailored for the mental health domain. CRPO extends group relative policy optimization by integrating stage-dependent uncertainty modeling into the policy optimization process. Specifically, we introduce a stage-wise entropy regularization mechanism that encourages broad exploration in early reasoning phases and progressively enforces confident decision-making in later stages, mimicking the human cognitive shift from uncertainty to certainty. In addition, inspired by cognitive appraisal theory, we formalize cognitive reasoning stages, thereby guiding theory-grounded interpretable inference. Experiments on 8 mental health datasets show that CRPO achieves an average improvement of 10.4 percentage points in weighted F1-score over the best reinforcement learning baseline. Furthermore, the CRPO-trained model Mental-R1 demonstrates clear advantages compared with existing large language models on reasoning-intensive cases, suggesting that CRPO enhances reasoning capabilities for mental health assessment.

MemRefine: LLM-Guided Compression for Long-Term Agent Memory

Minjae Kim, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13177v1 Announce Type: new Abstract: Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate, the memory store grows without bound and fills with redundant entries that inflate storage cost and degrade retrieval by crowding out the most useful evidence. Furthermore, this is especially limiting on resource-constrained platforms with hard memory budgets, motivating us to formulate storage-budgeted memory management, the task of keeping an already constructed memory store within a fixed budget while preserving information useful for future interactions. To this end, we then propose MemRefine, an LLM-guided framework that, since surface similarity poorly reflects factual value, uses similarity only to propose candidate pairs and defers delete, merge, and preserve decisions to an LLM judge based on factual content, iterating until the budget is met. Across multiple memory frameworks and long-term conversation benchmarks, MemRefine consistently meets target budgets while preserving downstream performance and outperforming rule-based baselines under tight budgets.

Loss-Shift Transfer via Bayes Quotients

Vasileios Sevetlidis — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13178v1 Announce Type: new Abstract: Transfer learning is usually studied as a consequence of distribution shift. This paper identifies an orthogonal failure mode in which the data distribution is fixed and the loss changes. This setting is called \emph{loss shift}. A loss determines which information in $X$ is Bayes-relevant, and two losses may therefore require different representations even under the same joint law $P(X,Y)$. The idea is formalized using Bayes quotients, which allow losses to be ordered by refinement. In the Bayes-quotient formulation, strict refinement gives an immediate qualitative obstruction. A source-minimal representation for a coarser loss is insufficient for a strictly finer target loss. For finite-output log loss, this obstruction becomes an exact quantitative identity. The excess risk is the conditional information about $Y$ discarded by the representation. Experiments in controlled, learned, synthetic-image, and real-image settings show the predicted effect, i.e., classification-equivalent representations can have different optimal log-loss performance under a fixed data distribution.

Modern analog computing for solving differential and matrix equations

Zhong Sun, Piergiulio Mannocci, Manuel Le Gallo, Abu Sebastian — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13179v1 Announce Type: new Abstract: In recent years, driven by the computational demands of data-intensive applications such as artificial intelligence and scientific computing, analog computing has gained renewed interest. Given the diversity of computational tasks and recent advancements in analog CMOS circuits and resistive memory technologies, we refer to the evolving landscape as modern analog computing. In this context, we identify three core computational primitives: solving differential equations, solving matrix equations, and performing matrix-vector multiplications, and we explore the connections among them. We also examine various hardware implementations of these analog computing operators, including those built with discrete components, integrated circuits, and resistive memory devices. Among these, resistive memory arrays emerge as particularly promising due to their implementation efficiency. The paper then surveys recent progress in leveraging modern analog computing to solve differential and matrix equations using both advanced analog CMOS circuits and resistive memory arrays. Finally, we discuss the applications of these circuits, the precision and scalability issues and their potential solutions, the relationship with in-memory computing, and the unique computational complexity of analog computing. This paper provides a unified perspective on analog computing, highlighting its strengths, current developments, and challenges, and positioning it as a pivotal enabler of next-generation computational frontiers.

JiRAIYA: A Reputation-Based Hierarchical Federated Learning Framework on Web3

Venkata Raghava Kurada, Pallav Kumar Baruah — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13180v1 Announce Type: new Abstract: Federated Learning(FL) is predominantly deployed in enterprise environments, where limited transparency and restricted auditability hinder broader adoption. Existing FL systems often suffer from opaque aggregation processes, making it unclear which model updates are accepted or discarded. Current mitigation strategies typically rely on external validators introducing additional computational and communication overhead. In this paper, we propose a novel FL framework that leverages existing Web3 technologies to enhance transparency, trust and auditability throughout the training process. The framework adopts a hierarchical architecture in which delegated managers orchestrate the FL training process within their respective federations. To mitigate adversarial and poisoning attacks, a combination of novelty detection and consensus mechanisms were employed. Model updates are encoded and broad casted to all managers, who independently evaluate their validity and those model updates that are approved by the consensus are incorporated into the global model. Additionally, a reputation score based backup mechanism is employed to ensure model generation. Extensive experiments conducted under real world scenarios demonstrate the effectiveness, resilience of the proposed framework, highlighting its potential to enable transparent FL beyond traditional enterprise setting.

Sketching Intersection Profiles: A Simple Proof and Three Applications

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13182v1 Announce Type: new Abstract: In this work we settle the complexity of three sketching problems. (i) We show that sketching vertex neighborhood sizes in graphs requires $\Omega(n^2)$ bits, standing in sharp contrast to the $\tilde{O}(n)$ complexity of sketching edge cuts. (ii) We obtain tight lower and upper bounds of $\tilde{\Theta}(n^2)$ for sketching coverage functions with additive and multiplicative errors. (iii) We prove an $\Omega(n^2)$ lower bound for sketching Random Utility Models under the $\ell_\infty$-norm, improving upon the previous $\Omega(n \log n)$ bound and matching a known upper bound to within logarithmic factors. These bounds are obtained through a connection with the problem of sketching the intersection profile of a distribution $D$ on $2^{[n]}$. Specifically, we seek a succinct data structure that, for any query set $S \subseteq [n]$, approximates the quantity $\Pr_{T \sim D}[T \cap S \neq \varnothing]$ to within a small constant additive error. One can obtain lower bounds for this latter problem directly from known results about the itemset frequency estimation problem in databases for which tight bounds are known. As an additional contribution, we also provide an alternative proof for the intersection profile sketching lower bound, in the setting in which the accuracy parameter is constant. This proof relies solely on elementary probability avoiding the heavier machinery used in previous proofs.

LAUKIN: A Multi-jurisdictional Common Law Contract Dataset

Amrita Singh, Aditya Joshi, Jiaojiao Jiang, Hye-young Paik, May Fong Cheong — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13184v1 Announce Type: new Abstract: Multinational companies increasingly require cross-jurisdictional contract review, yet existing legal NLP datasets are largely restricted to a single jurisdiction. We introduce LAUKIN (Legal equivalence dataset of Australia, UK, and INdia), a dataset of clause pairs (AU-UK, UK-IN, IN-AU) labelled for boolean legal equivalence. We develop a novel multi-stage retrieval and reranking pipeline to construct the initial clause pair mapping, with a subset of clause pairs subsequently annotated by legal experts as Equivalent or Not Equivalent. The dataset comprises 14,727 clause pairs from 204 contracts across 8 agreement types, of which 3,000 are manually labelled: 900 train, 600 dev, and 1,500 test. We evaluate 12 models across 4 techniques, achieving a best macro-F1 of 65.11%, establishing LAUKIN as a challenging benchmark. Results reveal that, despite shared legal heritage, drafting conventions diverge significantly across jurisdictions, making cross-jurisdictional equivalence classification non-trivial. LAUKIN also includes 11,727 unlabelled training pairs to support future semi-supervised learning research in legal NLP.

Lyapunov Stability and Optimal Error Estimates for an SIPG Method for Weakly Damped Semilinear Wave Equations

Ajeet Singh, Abhinav Jha — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13185v1 Announce Type: new Abstract: We develop and analyze a fully discrete scheme for the weakly damped semilinear wave equation that combines a Symmetric Interior Penalty Discontinuous Galerkin (SIPG) spatial discretization with a hybrid Crank--Nicolson/second-order Backward Differentiation Formula (CN--BDF2) time integrator. A chord-slope linearization of the nonlinear reaction term is employed, which preserves an exact discrete gradient structure and, crucially, requires {no global Lipschitz continuity assumption} on the nonlinearity. Stability of the fully discrete solution is established through a Lyapunov-based analysis-rather than spectral arguments-by constructing a discrete Lyapunov functional that yields existence, uniqueness, and uniform boundedness of the numerical solution. Under standard regularity assumptions, optimal a~priori error estimates of order $\mathcal{O}(h^{k}+\tau^{2})$ in the DG energy norm and $\mathcal{O}(h^{k+1}+\tau^{2})$ in the $L^{2}$-norm are proved, where $h$ is the mesh size, $\tau$ the time step, and $k$ the polynomial degree. Numerical experiments on two-dimensional problems with linear, cubic, and trigonometric nonlinearities confirm the theoretical convergence rates and illustrate the long-time energy-dissipation properties guaranteed by the Lyapunov structure.

A Context-Aware Dataset for Stance Detection in Bioethical Controversies on Reddit

Hu Huang, Genan Dai, Fuqiang Niu, Yi Yang, Zhaoya Gong, Bowen Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13187v1 Announce Type: new Abstract: Bioethical debates increasingly unfold on social media, yet stance detection research lacks large-scale, domain-specific resources for modeling such context-dependent discourse. We present BioStance, a context-aware dataset of 39,600 annotated Post-Comment pairs from Reddit bioethical discussions. BioStance covers six controversial targets across three dimensions of bioethical controversy: fundamental value conflicts, individual liberty versus collective responsibility, and technological uncertainty. Each instance preserves hierarchical conversational context and is labeled by three independent annotators using a three-class stance scheme: Favor, Against, and None. The annotations achieve a mean Krippendorff's $\alpha$ of 0.82, indicating substantial reliability. By combining thematic diversity, conversational structure, and high-quality human annotation, BioStance supports research on context-aware stance detection, argument mining, and computational analysis of bioethical discourse.

Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13188v1 Announce Type: new Abstract: Building patient-specific cardiac models sits at the heart of precision cardiology, yet getting those models into clinical use keeps running into the same wall: mesh generation is slow, messy, and frustrating. The standard workflow -- segmenting the image, running Marching Cubes, and then manually cleaning up the result -- is time-consuming, inconsistent across operators, and demands specialist knowledge most clinical teams do not have. We take a fundamentally different approach. Instead of treating segmentation and mesh generation as two separate problems, we train a single end-to-end network that goes directly from a raw 3D medical image to a smooth, simulation-ready cardiac surface mesh. The core is a 3D Swin Transformer encoder-decoder that extracts volumetric features from CT or MRI volumes, paired with a Graph Attention Network (GAT) head that iteratively deforms a template mesh to fit the patient's cardiac boundary. We tested on the MM-WHS 2017 benchmark using both CT and MRI. Segmentation scores were competitive (Dice of 0.84 on CT, 0.83 on MRI), but the primary focus is mesh quality: mean Chamfer distance of 1.8 mm, with 95th-percentile surface distance below 5 mm. Every mesh is produced in a single forward pass -- no Marching Cubes, no smoothing filters, no manual cleanup. We argue that for cardiac digital twin pipelines, geometric fidelity and topological correctness matter more than pixel-level Dice scores. By removing the post-processing bottleneck, this approach makes patient-specific cardiac simulation substantially more accessible for clinical use.

SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection

Fuqiang Niu, Bowen Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13189v1 Announce Type: new Abstract: Prompt-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate. We introduce SICI (Stance Inference Complexity Index), a seven-dimensional diagnostic measure of the semantic-pragmatic burden imposed by a target--text pair. Across SemEval-2016 and VAST, SICI predicts LLM accuracy better than surface proxies and shows substantial cross-scorer reliability ($\alpha=0.771$). More importantly, LLM errors change regime as SICI increases: low-complexity examples invite over-attribution, especially Against predictions; intermediate examples form an unstable boundary; and high-complexity examples rapidly concentrate on None. This phase-transition-like structure persists across GPT-3.5, GPT-4o-mini, DeepSeek-V3, and GPT-4o, although stronger models move the boundaries. A 15-method intervention study further shows that prompting, retrieval, and debate often shift models along the attribution--abstention axis rather than removing the high-complexity bottleneck.

Multi-Modal Multi-Agent Robotic Cognitive Alignment enabled by Non-Invasive Consumer Brain Computer Interfaces: A Proof of Concept Exploration

Nataliya Kosmyna, Liz Jenkins, Anoop K. Sinha — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13190v1 Announce Type: new Abstract: While non-verbal behaviors and expressive movements are essential for natural human-robot interaction, existing methods often overlook a crucial element: the human's internal cognitive state. Frequently, proactive multi-agent systems can interrupt humans at inopportune moments, leading to cognitive overload and decreased task performance. This paper introduces a framework for generating "cognitively aligned" multi-agent interactions, enhancing the ability of robotic systems to contextually defer communications to the user of an agent system during moments of high human mental workload and engagement. We present the design and implementation of a closed-loop architecture that explores the interplay between autonomous task execution and real-time neurophysiological focus. Using a consumer-grade Brain-Computer Interface (BCI), our approach continuously monitors Electroencephalography (EEG) spectral band powers while a human performs an engagement-inducing task. We propose an engagement-driven pipeline where an HTTP-based signaling mechanism places a primary agent's sensory inputs and audio outputs into a holding state upon detecting high engagement. This allows secondary agents to seamlessly process complex, delegated tasks in the background. Once the human's cognitive state returns to a lower cognitive load baseline, the primary agent releases the queued agent message. Our preliminary results demonstrate the feasibility of leveraging real-time signal processing, Large Language Models (LLMs), and physical robotic embodiments to create cognitively-aware, non-intrusive multi-agent systems.

The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics

Ryosuke Sakamoto, Kotaro Sakamoto — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13191v1 Announce Type: new Abstract: Continuous-state generative samplers, including diffusion and flow-matching models, evolve through continuous reverse-time dynamics, yet their samples often undergo abrupt qualitative changes: trajectories commit to modes, semantic alternatives collapse, and small perturbations in narrow time windows can produce large downstream effects. This paper develops a geometric account of such phase-transition-like behaviour. We view denoising as gradient descent on a free energy landscape and show that sharp transitions arise near projection caustics, where the nearest-point projection onto the data support ceases to be unique. Motivated by this perspective, we introduce the Critical Boundary Detector (CBD), as practical diagnostics for score-direction instability. Across toy models, standard diffusion models, and latent text-to-image diffusion models, CBD localises mode commitment, predicts intervention-sensitive windows, and supports targeted control in geometrically sensitive regions. Our results connect geometry of data and dynamics of diffusion generation.

Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13192v1 Announce Type: new Abstract: User experience (UX) centered on usability, perceived consistency, and functional clarity is fundamental to real-world user interfaces (UI). The application of multimodal large language models (MLLMs) in the field of user interfaces is evolving rapidly, such as visual element grounding, graphical user interface (GUI) agents, and design-to-code generation. However, research efforts on evaluating UX based on UI screenshots are still immature. To address this, we propose UXBench, a novel multimodal benchmark consisting of 2,000 VQA data samples designed to assess MLLMs' ability to perform UI-based reasoning. UXBench includes 8 tasks based on real-world UI screenshots that require fine-grained diagnosis of UX issues across layout relationships, visual hierarchy, and content consistency. Our extensive evaluation of mainstream MLLMs shows that they remain fundamentally limited in their capacity for UI-based reasoning. The results underscore the need for further advancements in this area. To bridge this gap, we propose UI-UX, an MLLM based on Qwen3-VL-4B-Thinking foundation model and enhanced via reinforcement learning with two key innovations: a reward routing mechanism that dynamically balances perceptual understanding and logical reasoning during inference, and an asymmetric transition reward that suppresses redundant or insufficient reasoning steps. Experiments demonstrate that UI-UX achieves state-of-the-art (SOTA) performance on UXBench, attaining an accuracy of 0.7963 -- surpassing Claude-4.5-Sonnet's 0.6550 -- while exhibiting strong generalization across diverse UI tasks and maintaining low inference latency.

WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition

Maximilian Burzer, Tobias King, Till Riedel, Michael Beigl, Tobias R\"oddiger — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13194v1 Announce Type: new Abstract: Deep learning has become the dominant paradigm in Wearable Human Activity Recognition (WHAR), yet progress is obscured by a comparability crisis. Results are often reported using inconsistent datasets, custom data processing, and varying evaluation protocols, making state-of-the-art claims fragile. We address this with a large-scale, open-source benchmark that integrates 30 diverse datasets under standardized processing, unified model interfaces, and a shared cross-subject evaluation protocol. Evaluating 17 representative architectures across 4760 training runs, we jointly measure predictive performance alongside on-device latency, peak memory, and model size on an Android reference device. Our results reveal that the WHAR state of the art is distributed rather than dominated by a single architecture. While CNN-HAR achieves the highest mean macro-F1, top-performing models cluster tightly, indicating contemporary architectures have converged near a predictive performance ceiling. When accounting for deployment efficiency, compact neural models, such as TinierHAR, and classical Random Forests define the practically relevant Pareto frontier, whereas larger recurrent and hybrid models incur high hardware costs without corresponding performance gains. Consequently, while predictive performance has plateaued, substantial potential for future progress remains in optimizing deployment efficiency and improving adaptation to domain shifts. We release our full framework to support transparent reuse and extension.

Under What Conditions Can a Machine Become Genuinely Creative?

Yong Zeng — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13196v1 Announce Type: new Abstract: Recent AI systems can generate texts, software architectures, hypotheses, designs, and scientific workflows that appear creative. This paper asks under what conditions a machine can become genuinely creative, and how human agency can be preserved within shared cognitive and creative environments. It develops a requirement framework derived from Designics, the science of meaning-bearing intentional change. The paper argues that genuine machine creativity should not be defined by output novelty, current performance, or transient architecture alone. Instead, creativity is understood as the structural transformation of incomplete situations through recursive intervention dynamics. On this view, it depends on ten requirements: environment representation, scoped perception, conflict identification, intervention capability, consequence observation, knowledge and environment update, rescoping, local-to-global unfolding, value-based scoping, and human-AI co-living. These are organized through the three laws of Designics: perception, conflict, and capability. The paper illustrates the computational tractability of these requirements through selected cyber-physical and cyber-biological studies, including recursive element extraction, autonomous mesh generation, and neurophysiological and workload analysis. It then treats open-ended systems, automated discovery frameworks, self-modifying agents, foundation models, and agentic workflows as pressure cases: they demonstrate powerful generative means but do not by themselves establish genuine machine creativity. Finally, the paper argues that proactive AI ethics is internal to genuine machine creativity rather than an after-the-fact filter. Value-based scoping and human-AI co-living must shape how creative machines perceive environments, identify conflicts, select interventions, observe consequences, update knowledge, and rescope future action.

ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning

Fuqiang Niu, Bowen Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13197v1 Announce Type: new Abstract: Multi-agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify correlated errors among similar agents. We propose ARMOR-MAD, a training-free heterogeneous MAD framework that treats debate as conditional computation. ARMOR-MAD combines three components: Pre-debate Agreement Routing (PAR) decides whether independently generated Round-0 answers require debate; Early Agreement Stopping Evaluator (EASE) stops debate after convergence; and Semantic Outlier Detection (SOD) down-weights abnormal final answers during aggregation. Across MATH Level 5, GSM8K, MMLU, and MMLU-Pro, ARMOR-MAD consistently improves over fixed-round heterogeneous debate with the same model pool, reaching 65.5\%, 96.5\%, 90.0\%, and 81.5\% accuracy, respectively. The results suggest that genuine model heterogeneity and agreement-based control are both important for making MAD more accurate and efficient.

A Minimal Model of Bounded Trade-Off Screening in Multi-Attribute Choice

Manisha Dubey, Anirban Sarkar, Subramanian Ramamoorthy — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13201v1 Announce Type: new Abstract: Human decision-making often involves choosing between multi-attribute alternatives, yet classical models assume fully compensatory utility aggregation despite evidence that people reject options with poor performance on critical attributes. We propose a bounded trade-off reasoning framework in which decisions are governed by a screening process that evaluates the balance between gains and losses across attributes. The model introduces a trade-off tolerance parameter that controls acceptable imbalance and can vary across contexts. Through simulation, we show that this mechanism produces preference patterns that differ from standard utility-based models and captures context-dependent variation in trade-off behavior. These results establish bounded trade-off screening as a plausible computational mechanism for multi-attribute choice and generate testable predictions for future behavioral studies.

Embedding ISO 10218 Safety Compliance in Robots via Control Barrier Functions for Human-Robot Collaboration

Federico Parma, Cesare Tonola, Nicola Pedrocchi, Manuel Beschi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13203v1 Announce Type: new Abstract: Human-Robot Collaboration (HRC) requires strict adherence to safety standards, such as ISO 10218, to prevent harmful interactions. Standard Speed and Separation Monitoring (SSM) filters calculate safe robotic speeds based on conservative assumptions, such as constant human velocity, which prevents accurate predictions of minimum separation distances and causes unnecessary operational halts. This paper proposes a Control Barrier Function (CBF) that explicitly incorporates human acceleration data to analytically forward-predict the minimum human-robot separation distance during a worst-case robotic stopping trajectory. To guarantee safety at the control level, this predictive CBF is integrated as an inequality constraint within a Sequential Quadratic Programming (SQP) framework. Specifically, two methods are proposed: Method I, a CBF-constrained PD safety filter; and Method II, a task-scaling SQP controller that enforces a spatial tube constraint. Simulated and real-world experiments on a UR10e robot evaluate the two proposed methods against a standard industrial SSM module baseline. Results demonstrate that Method II dynamically modulates execution speed and confines spatial deviations. Compared to Method I, Method II achieves a 63\% reduction in mean trajectory error and avoids excessive evasive manoeuvres, ensuring high task throughput while complying with ISO 10218 SSM guidelines.

CoDeR: Local Constraint-Compatible Retrieval Beyond Semantic Similarity

Xingkun Yin, Xuebin Tang, Hongyang Du — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13204v1 Announce Type: new Abstract: Information retrieval systems have long treated semantic similarity as a proxy for relevance. For constraint-sensitive queries, this proxy can fail when a document is topically close to the query but supports the opposite constraint direction, such as satisfying an attribute that should be excluded or affirming a relation that should be negated. We study this failure as constraint-violating evidence exposure and propose CoDeR, a local constraint-compatible dense retrieval method that separates topical relevance from constraint compatibility. CoDeR keeps a standard topical encoder for candidate coverage and adds a compatibility scorer, implemented as a bi-encoder, trained with lexical-polarity supervision over contrastive satisfying and violating evidences. The compatibility signal can be used to rescore topical candidates or to retrieve an auxiliary compatibility-oriented candidate set, producing a ranked document list without external Large Language Model~(LLM) calls at inference time. We evaluate CoDeR on controlled diagnostics and public negative-constraint retrieval benchmarks. Across three controlled diagnostic sets targeting antonymy, negation, and exclusion, CoDeR reduces V@2 by 20.59, 23.53, and 5.77 points relative to the strongest non-CoDeR baselines, and improves FVR by pushing the first violating document deeper in the ranking.

Visual Place Recognition in Forests with Depth-Aware Distillation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13206v1 Announce Type: new Abstract: Visual place recognition in natural forest environments remains challenging due to repetitive vegetation, weak structural cues, and significant appearance variation across traversals. To address this limitation, this paper proposes a lightweight depth-aware distillation framework that injects geometric cues into a DINOv2-based place recognition model, while maintaining its pre-trained descriptor space. Evaluated on the recent WildCross benchmark, the proposed approach yields gains over an appearance-only counterpart, providing robustness to appearance variations. These results demonstrate the importance of depth as a strong complementary modality for place recognition in natural environments and identify depth-aware distillation as a promising direction for more robust forest perception.

A Polynomial-Decay and Pinhole-Imaging Whale Optimization Algorithm for UAV Relay Communication Deployment

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13208v1 Announce Type: new Abstract: Unmanned aerial vehicle (UAV) relays deliver flexible, on-demand wireless coverage, but jointly tuning the position, altitude, transmit power and bandwidth of the relay is a non-convex, heavily constrained optimization task that easily traps swarm-based optimizers in poor local optima. We propose PWOA, a Polynomial-decay and Pinhole-imaging Whale Optimization Algorithm with three complementary improvements: (i) a Good Nodes Set (GNS) initialization that spreads the initial population uniformly across the search space; (ii) a polynomial nonlinear schedule for the convergence factor that prolongs early exploration and sharpens late exploitation; and (iii) a stagnation-triggered pinhole-imaging opposition-based learning (POBL) operator paired with an elite Gaussian local search, which together escape local optima while refining the leader. On a five-dimensional UAV relay deployment problem with five inequality constraints ($N{=}30$, $T{=}500$, 30 independent runs), PWOA simultaneously attains the lowest Best, Worst, Mean and standard deviation among PWOA, WOA, SCA and IPSO, cutting the mean by $1.4$--$18.5\%$ and the standard deviation by $15$--$87\%$ over the three baselines, and exhibits the fastest average convergence.

Understanding helpfulness and harmless tension in reward models

Eshaan Tanwar, Pepa Atanasova — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13209v1 Announce Type: new Abstract: Reward models are a key component of reinforcement learning from human feedback (RLHF), aligning language models toward both helpful and harmless behaviour. However, the internal mechanisms underlying these objectives and their conflicts remain poorly understood. We study alignment tension in reward models trained under helpfulness-only, harmlessness-only, and mixed-objective settings. We find that mixed-objective models often underperform single-objective models, indicating interference between objectives. Using activation-based methods, we identify neurons associated with each objective and study their functional roles via targeted ablations. We find that these neurons causally support their corresponding objectives while often negatively affecting the opposing one. We find that a substantial proportion of neurons are shared between helpfulness and harmlessness, and that these shared neurons exert a disproportionate influence on model behaviour, contributing to alignment tension. Additionally, our results provide insights and mechanistic interpretation into how alignment objectives are represented in reward models and why multi-objective alignment remains challenging, motivating future work on disentangled and controllable alignment methods.

Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

Omar Alshahrani, Muzammil Behzad — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13211v1 Announce Type: new Abstract: AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of greatest clinical concern is hallucination: clinically plausible but factually incorrect outputs, including fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements in generated reports, with direct consequences, for example, for biopsy decisions, staging, and treatment planning. This structured narrative synthesizes peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across five imaging modalities to produce a cross-modality analysis of hallucination taxonomy, etiology, detection, and mitigation. Specifically, we address three questions in this study: (1) how can existing taxonomies be unified across modalities?, (2) how do medical-specialized foundation models hallucinate less than general-purpose ones?, and (3) which mitigation strategies are effective and compatible with FDA lifecycle oversight? We note that three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. We also highlight that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. At the same time, the oversight of radiologists remains essential; for instance, a very high percentage of of AI-generated flags required expert correction before clinical use. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined. All findings are mapped to the FDA's Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

Polar Decoding Tree Pruning Based on Soft Output Extraction

Li Shen, Yongpeng Wu, Wenjun Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13214v1 Announce Type: new Abstract: Although the successive cancellation list (SCL) decoding of polar codes exhibits excellent performance, it retains many decoding paths in the list with negligible contribution to the final output, resulting in high sorting and computational complexity. In this letter, we propose a novel pruning strategy to mitigate the decoding complexity. By leveraging the blockwise soft output extraction process of soft-output SCL and soft-output fast SCL decoding, we provide an accurate approximation of the probability that a decoding path is correct, and thus accordingly prune the paths failing to meet a pre-defined reliability threshold. The complexity reduction achieved by the proposed soft-output-based pruned SCL (SOP-SCL) decoder and its fast version, SOP-FSCL decoder, is significant, without any compromise in error-correction performance. Meanwhile, they also prove to be more efficient than state-of-the-art pruned polar decoders.

Mitigating business risks from renewable PPA power sourcing uncertainties for European green hydrogen production: Robust system design, regulatory adjustments and offtake flexibility

Jonathan Brandt, Astrid Bensmann, Richard Hanke-Rauschenbach — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13215v1 Announce Type: new Abstract: As energy prices surge for the second time in recent years driven by the ongoing crisis in the Middle East, the European Union's continuing reliance on fossil energy imports is becoming increasingly apparent. However, despite offering an intriguing prospect of improved energy resilience, the ramp-up of local green hydrogen production lags far behind the officially stated ambitions set after the 2022 energy crisis. A prominent reason for the widening implementation gap between announced and realised production projects is overly strict rules on renewable power sourcing, prompting Member states' ministries and the European Commission to propose advancing a planned rules review from 2028 to 2026. To contribute to a successful review and rule adjustments, we address an important gap in understanding the effects of power purchase rules on green hydrogen production. By taking the perspective of European electrolyser operators, we show how the criterion of additionality and its interaction with required temporal correlation can jeopardise the fulfilment of green hydrogen offtake agreements and affect green hydrogen production costs across different European bidding zones. Applying different design paradigms to a green hydrogen production system reveals that electrolyser operator measures, such as PPA and storage upsizing, can help to mitigate the business risks posed by the additionality criterion but come with increased costs. Alternatively, relaxed temporal correlation and increased offtake flexibility both increase production system robustness and reduce production costs simultaneously. Whereby relaxing temporal correlation rules does not result in exceeded emission intensity thresholds, underlining the potential of extended transitional rules to support the ramp-up of European green hydrogen production.

Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization

Mariia Onyshchuk, Maksym-Vasyl Tarnavskyi, Marta Sumyk — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13216v1 Announce Type: new Abstract: Optimal transport (OT) has been shown to detect hallucinations in neural machine translation (NMT) by measuring the geometric distance between cross-attention distributions and a reference distribution, without any supervision. We extend this analysis to all six decoder layers of the Fairseq DE-EN model ($N=3{,}414$), showing that Wass-to-Unif and Wass-to-Data are complementary detectors specialised across hallucination types, that detection is concentrated in layers L1--L4 with L5 anti-predictive for subtler types, and that hallucinated translations lack the exploratory attention phase present in correct translations from the first decoding step. We further evaluate whether the geometric signal transfers to abstractive summarization faithfulness detection: our unsupervised OT detector on AggreFact ($N=1{,}116$) achieves $57.2\%$/$57.6\%$ balanced accuracy on CNN/XSum -- above chance but substantially below supervised MiniCheck-Flan-T5-L($69.9\%$/$74.3\%$). This gap is principled: unlike NMT hallucinations, unfaithful summaries can attend correctly to source tokens while misrepresenting their content, a failure mode invisible to concentration-based OT metrics by construction. Structural experiments on T5-base confirm consistent decoder organisation across depth, with Layer~3 showing peak concentration and Layer~12 being most critical for generation quality. Together, the results establish OT on cross-attention as a reliable detector when the failure mode is source disengagement, a principled interpretability tool regardless of task, and fundamentally limited when faithfulness failures occur downstream of attention.

When Similar Means Different: Evaluating LLMs on Arabic--Hebrew Cognates

Junhong Liang, Noor Abo Mokh, Bashar Alhafni — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13218v1 Announce Type: new Abstract: Arabic and Hebrew, as closely related Semitic languages, share a substantial lexicon of true cognates, misleading false friends, and modern loanwords. This overlap poses a challenge for cross-lingual semantic understanding in large language models (LLMs). To evaluate this capability, we introduce SemCog Bench, a curated benchmark of 1,858 Arabic--Hebrew word pairs with sentence-level annotations for cognate identification and semantic disambiguation. We evaluate open-source and commercial LLMs across multiple input representations (raw, diacritized, Romanized, and phonetic) and reveal a critical gap in cross-lingual reasoning. While models achieve high accuracy on true cognates, performance drops sharply on false friends and loanwords, reflecting a strong reliance on surface-form similarity. Furthermore, sentence-level context yields only modest improvements, suggesting that contextual cues alone are insufficient to overcome misleading form-based signals. These findings reveal a fundamental limitation of current LLMs in resolving cross-lingual form--meaning conflicts and establish SemCog Bench as a rigorous benchmark for multilingual semantic reasoning. Our code and data are publicly available.

Embedded Trefftz DG method for steady Navier-Stokes flow. Part II: Nonlinear problem

Paul Stocker, Igor Voulis, Christoph Lehrenfeld, Philip L. Lederer — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13219v1 Announce Type: new Abstract: We develop and analyze an embedded Trefftz-DG method for the steady incompressible Navier-Stokes equations, based on the reduced Oseen discretization from Part I. The main difficulty is that the reduced Trefftz space depends on the convection field, so successive Picard iterates live in different discrete spaces. We address this by constructing projections between convection-dependent Trefftz spaces and using them to control the reduced Oseen solution map. Under suitable resolution and small-data assumptions, we prove existence of discrete solutions, uniqueness, and convergence of the Picard iteration. We also derive an a priori error analysis by relating the method to the underlying DG discretization, thereby inheriting convergence properties from compatible DG Navier-Stokes analyses. Numerical experiments on standard incompressible-flow benchmarks illustrate the theory.

LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

Fabrizio Marozzo, Pietro Li\`o — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13220v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide incomplete descriptions or plausible but unverified explanations, LLMs may prematurely align with these assumptions and propose solutions before collecting sufficient evidence. We refer to this behavior as user-driven sycophancy: the tendency of an LLM to reinforce a user-provided hypothesis instead of testing alternative explanations. This paper introduces LLM-as-an-Investigator, an evidence-first agentic AI methodology for robust problem diagnosis. The approach is implemented through a Solution Investigator Agent, which estimates the ambiguity of an initial problem description, generates candidate hypotheses, asks targeted clarification questions, and updates hypothesis probabilities after each answer. Rather than producing an immediate response, the agent continues the investigation until the evidence makes one candidate explanation stronger than the alternatives. To evaluate the approach, we build a benchmark from solved technical forum threads in mechanical, electrical, and hydraulic domains. We use a three-agent evaluation pipeline in which a Problem-Solution Extractor Agent converts solved threads into structured cases, a Ground-Truth Evaluator Agent simulates the user while hiding the known solution, and the tested assistant attempts to recover the solution through dialogue. The experiments compare standard assistants, reasoning-oriented LLMs, and the proposed investigator-based model across LLM backbones. In addition to diagnostic accuracy, we analyze how standard assistants follow misleading user hypotheses in diagnostic cases. The results show that the proposed approach identifies the problem more accurately than direct prompting and reasoning-only baselines, while its evidence-first protocol helps reduce user-induced conversational bias.

From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

Bora Kargi, David Salinas — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13221v1 Announce Type: new Abstract: Evaluating new large language models typically requires costly human annotation campaigns at scale. LLM-as-a-judge offers a cheaper alternative, but judge scores carry systematic errors - such as position bias, self-preference, or intransitivity - that can strongly miscalibrate the resulting rankings. We quantify the resulting judge-human disagreement at two complementary levels. At the local level, we estimate per-battle uncertainty from the judge's own score differences by propagating calibrated win probabilities rather than hard labels into the Bradley-Terry procedure. This alone provides a drastic improvement to Elo estimation accuracy, bringing LLM-derived ratings within 17.9 Elo MAE of human-derived ones when averaged over 55 held-out models on LMArena. At the global level, we apply split conformal prediction to the residual gap between LLM-derived and human-derived Elo ratings across held-out models, producing prediction intervals with distribution-free marginal coverage guarantees that account for irreducible LLM-human disagreement. Together, these two layers yield a low-cost evaluation tool that provides developers with calibrated Elo estimates and honest uncertainty bounds, without access to large-scale human annotations.To facilitate reproducibility, we release our code at https://github.com/kargibora/SoftElo .

Proprioceptive-visual correspondence enables self-other distinction in humanoid robots

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13222v1 Announce Type: new Abstract: Distinguishing self from others is a prerequisite for social intelligence, yet humanoid robots that increasingly share workspaces with humans still lack this ability. Here we show that a humanoid robot can learn self-other distinction from proprioceptive-visual correspondence, without any identity labels or kinematic models. Once established, this distinction bootstraps a predictive self-model that maps joint configurations to three-dimensional body occupancy, capturing how the robot's body changes with action. In multi-agent scenes involving humans or morphologically identical robots, the system reliably identifies itself, learns a 3D self-model, and supports downstream tasks including target reaching, collision-aware motion planning, and human-to-robot motion retargeting. Together, these results outline a route toward bodily self-representation in robots that act and coordinate alongside others in shared physical environments. Project page: https://euron-zc.github.io/humanoid-self-model/.

Distributional Loss for Robust Classification

Kathleen Anderson, Thomas Martinetz — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13223v1 Announce Type: new Abstract: This paper proposes a novel loss concept for supervised classification tasks. Rather than enforcing a direct mapping from each input sample to a single assigned label, we define an optimization objective over all classifier outputs as a bimodal Gaussian distribution. This softer target formulation implicitly captures class ambiguity, mitigates overfitting, and encourages the learning of more robust decision boundaries, all without requiring additional label information. Experimental results demonstrate consistent improvements in robustness, with particularly pronounced gains in low-data regimes, while requiring only minimal modifications to standard training pipelines.

The QR factorization of a Banded-Plus-Semiseparable Matrix is Computable with Linear Complexity

Tao Chen, Sheehan Olver — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13225v1 Announce Type: new Abstract: We show that the QR factorization of a banded-plus-semiseparable (BPS) matrix is computable in optimal linear complexity with respect to the discretization size by showing that the intermediate stages of a QR factorization as computed using Householder reflection maintain a specific structure which has optimal storage. This theoretical result enables the design of stable, linear-complexity algorithms for solving the associated linear systems. For symmetric BPS matrices, we further show that the $RQ$ product -- central to eigenvalue computations via the QR algorithm -- also preserves the BPS structure, leading to a linear-complexity algorithm for each iteration. Numerical experiments validate the optimal linear complexity, confirm high numerical accuracy, and demonstrate substantial speedups compared with existing hierarchical approaches. The algorithms have been implemented in an open-source Julia package, providing an efficient and accessible platform for practical use.

Multi-Phase Optimization of Shared Charging Infrastructure for Freight Electrification

Joas Kahlert, Jiali Fu, Chengxi Liu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13226v1 Announce Type: new Abstract: The transition to heavy-duty battery electric vehicles requires an efficient and cost-effective deployment of the charging infrastructure, particularly when multiple operators share resources. This paper presents a multi-phase optimization framework for the joint planning of charging stations in a shared network, using high-resolution empirical truck trajectory data from two freight companies with distinct operational characteristics. The model is formulated to minimize the total number of charging stations while ensuring that the predefined electrification targets are met over successive expansion stages. The analysis captures heterogeneity in fleet usage, with one company operating a spatially concentrated network with shorter and more consistent routes, and the other exhibiting more dispersed operations with longer and more variable driving patterns. The results show that early-stage infrastructure deployment primarily supports fleets with concentrated operations, while later expansion phases are essential to accommodate long-haul and geographically dispersed transport demand. Furthermore, shared infrastructure not only enables reductions in redundant investments, but also introduces dependencies where certain fleets rely heavily on the full network to sustain electrified operations. In general, the findings highlight the importance of coordinated and data-driven infrastructure planning, and demonstrate that fleet-specific characteristics strongly influence both infrastructure requirements and electrification outcomes. The proposed framework provides practical insights on how collaborative and phased deployment strategies can enhance the scalability and efficiency of freight transport electrification.

PolyAlign: Conditional Human-Distribution Alignment

L. D. M. S. Sai Teja, Ufaq Khan, Sathira Silva, Xiao Wu, Muhammad Haris Khan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13227v1 Announce Type: new Abstract: Post-training methods such as supervised fine-tuning (SFT) and preference optimization typically align language models toward a single global assistant behavior. While effective for improving average helpfulness, this can suppress the natural variation of human responses across languages, tasks, and dialogue settings. We study this problem as conditional human-distribution alignment: models should match the human response distribution appropriate to the current interaction context, rather than a universal response style. We introduce PolyAlign, a distribution-aware alignment framework that organizes bilingual interaction data into bucket-specific human reference distributions defined by language, interaction track, response family, and length. PolyAlign combines Bucket-Aware SFT, which balances optimization across heterogeneous buckets, with Human-Distribution Preference Optimization (HDPO), which regularizes preference learning using critic-estimated distance to bucket-specific human support. Across a bilingual evaluation suite covering English and Chinese single- and multi-turn settings, PolyAlign improves conditional naturalness and distributional faithfulness while preserving competitive task utility. The results suggest that post-training should move beyond global alignment objectives toward interaction-aware alignment with human response distributions.

Embedded Trefftz DG method for steady Navier-Stokes flow. Part I: Oseen linearization

Paul Stocker, Igor Voulis, Christoph Lehrenfeld, Philip L. Lederer — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13229v1 Announce Type: new Abstract: We develop an embedded Trefftz-DG method for the Oseen problem and prove a complete stability and quasi-optimality theory in standard DG norms. The key ingredient is a construction of a suitable local complement space to the Trefftz space, on which the Oseen operator is stably invertible. We also derive a reduced formulation of the method, the resulting system is posed in terms of the velocity unknown only, a crucial step in the analysis especially for the nonlinear Navier-Stokes problem in Part II.

WT-UMI: Tactile-based Whole-Body Manipulation via Force-Supervised Contact-Aware Planning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13232v1 Announce Type: new Abstract: Whole-body humanoid manipulation of bulky, deformable, and shared-load objects requires distributed contact sensing and explicit force regulation, yet most imitation policies treat contact force only implicitly. On the other hand, different demonstration sources provide complementary modalities with inherent trade-offs: human demonstrations capture natural contact forces but not robot-executable actions, while teleoperation directly records robot actions but with less natural force regulation. This paper presents \textbf{WT-UMI}, a wearable whole-body tactile interface worn by human operators or mounted on humanoids, providing accurate observations of tactile images, contact forces, and end-effector poses across both human demonstration and humanoid teleoperation modes. We introduce a force-conditioned target-pose correction module that converts measured human poses into contact-aware robot targets by learning corrections from teleoperation data. To leverage the natural force interaction in human data, we propose a force-supervised planner that predicts end-effector pose chunks and contact-force trajectories. The predicted contact force serves as the reference for a tactile-based admittance controller. Across five contact-rich tasks spanning deformable objects, bulky rigid objects, and human--humanoid collaboration, WT-UMI improves success rate and reduces contact-position tracking error over four policy baselines. Our project page is available at https://wt-umi.github.io/WTUMI/.

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13233v1 Announce Type: new Abstract: Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose \textbf{ReSET}, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small-$M$ NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to $\sim\!$2 points over the NVFP4 baseline. Our CUDA-core small-$M$ kernel further improves latency-critical decoding, delivering up to $2.5\!\times$ kernel-level speedup over NVFP4 vLLM and approximately $2\!\times$ end-to-end decoding speedup over BF16. Code is available at https://github.com/aiha-lab/ReSET.

Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier

Olga Isupova, Danil Kuzin, Ella Browning, Tom Mills, Steven Reece — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13236v1 Announce Type: new Abstract: Passive acoustic monitoring holds great promise for ecological inference, yet existing automated tools are typically narrowly trained and non-transferable. We address these limitations with PULSE, a semi-supervised, multi-task framework for Orthoptera bioacoustics, combining weakly-supervised species classification, self-supervised learning on unlabelled field audio, and knowledge distillation from a general-purpose bioacoustic model. Our domain-adapted specialist model outperforms a state-of-the-art general model across all metrics (macro F1: 0.21 vs. 0.07; AUC: 0.74 vs. 0.45; AP: 0.32 vs. 0.19), with active learning further raising F1 to 0.34 and AUC to 0.84. Beyond classification, the learned embeddings encode ecologically meaningful structure, exposed through an interactive visualisation tool for ecological discovery.

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13239v1 Announce Type: new Abstract: Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.

Towards More General Control of Diffusion Models Using Jeffrey Guidance

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13240v1 Announce Type: new Abstract: A key strength of diffusion models lies in their flexibility, since their outputs can be controlled at sampling time through guidance. However, beyond simple cases such as conditional sampling, the target distribution is often left implicit, defined only through a sampling rule or a heuristic energy function. To address this, we propose Jeffrey guidance, a principled framework that extends diffusion-model control to applications beyond what standard guidance can express. It leverages Jeffrey's rule of conditioning to update marginal distributions towards a prescribed target, preserving the conditional structure and minimally perturbing the joint distribution. We first demonstrate Jeffrey guidance by targeting a prescribed embedding distribution. With Inception embeddings as the target, this leads to substantial reductions in FID on both CIFAR-10 and FFHQ. We further apply Jeffrey guidance to fairness on CelebA-HQ, updating an unconditional diffusion model to enforce independence between attributes.

Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm

Francesco Massa, Marco Cristofanilli — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13241v1 Announce Type: new Abstract: Defining query difficulty is one of the hardest problems in deployment engineering. Existing LLM routers rely on surface features such as domain labels, keywords, and token count, ignoring the within-domain variance that actually determines model success. Frontier models cost ten to one hundred times more than local open-weight models, so at production scale even small per-request savings become a direct cloud-bill lever. We present Brick, a multimodal router that scores each model on six capability dimensions, combines this with a per-query difficulty estimate, and dispatches via a cost-penalized geometric rule. A continuous preference knob lets operators slide between max-quality and max-saving profiles at deploy time. On a benchmark of 5,504 queries, Brick at max-quality reaches 76.98% accuracy, beating the best single model (75.02%) and all tested routers. At a neutral cost-quality profile, Brick achieves 74.11% accuracy at 4.71x lower cost than always using the strongest model. At min-cost, it cuts cost 22.15x with 11.85 points accuracy loss. Median latency drops from 51.2s to 22.8s.

A $q$-analogue of the rational normal curve and linearized Reed-Solomon codes

Valentina Astore, Martino Borello, Alain Couvreur, Flavio Salizzoni — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13246v1 Announce Type: new Abstract: The relationship between linear codes in the Hamming metric and projective algebraic varieties has led to deep interactions between coding theory and algebraic geometry, with classical examples such as Reed-Solomon codes and the rational normal curve. On the other hand, the sum-rank metric has recently gained attention due to applications in network coding, distributed storage, and post-quantum cryptography, with linearized Reed-Solomon codes emerging as optimal constructions. Despite recent advances, their structural and geometric properties are still not fully understood, and existing distinguishers remain limited. In this paper, we develop a geometric framework for linearized Reed-Solomon codes by considering a $q$-analogue of the rational normal curve. This yields a geometric characterization for certain parameter choices and reveals that the corresponding sets of points satisfy unexpectedly many $(q+1)$-degree hypersurface conditions. Our approach extends Schur-product-based techniques from the Hamming and rank-metric settings to the sum-rank metric case. Finally, we study the Hilbert function of the associated coordinate ring, providing a detailed description of its behavior and identifying its regularity, which also sheds new light on Gabidulin codes.

EPIG: Emotion-Based Prompting for Personalised Image Generation

Emna Othmen, Mohamed Yassine Landolsi, Lotfi Ben Romdhane — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13247v1 Announce Type: new Abstract: Text-to-image diffusion models have achieved impressive results in synthesizing high-quality images from natural language prompts. However, commonly used prompting strategies remain relatively generic, limiting the model's ability to accurately express emotional intent and nuanced affective attributes. This work proposes EPIG, a method that enhances emotional expressiveness at the prompt level prior to image generation. Grounded in psychologically informed emotion representations (valence-arousal) and leveraging structured, role-aware prompt enrichment, EPIG enriches emotion-related components of prompts without modifying or retraining the image generation backbone. The resulting emotion-aware prompts guide the generative process toward more emotionally coherent visual outputs, with particular effectiveness in controlling arousal. EPIG is lightweight, training-free, and well suited for resource-constrained and personalized image generation scenarios. Experimental results on a benchmark of 10 diverse prompts show that EPIG reduces mean arousal error compared to strong baselines, including naive insertion and LLM-based prompt expansion, with reductions of 14% and 12%, respectively. These improvements are statistically significant. EPIG also preserves valence alignment and semantic consistency, as measured by CLIPScore and supported by ablation studies. The effect is more pronounced on prompts containing explicit subjects such as humans, children, or animals, where the reduction reaches 17%, highlighting the subject-sensitive behavior of the proposed method.

Q-Backbone: A Quantum-Enhanced Control Plane for Future Communication Networks

Mahdi Chehimi, Nour Dehaini, Nikos A. Mitsiou, Ioannis Krikidis, Gan Zheng — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13248v1 Announce Type: new Abstract: Future networks will need to make network-wide decisions, including traffic engineering, network slicing, and wireless optimization, under strict latency, energy, and reliability constraints. The computational complexity of these problems increasingly challenges classical optimization methods. This article proposes Q-Backbone (QB), a quantum-enhanced control plane for communication networks in which quantum processing units (QPUs) operate alongside classical computing resources as accelerators for network intelligence. QB is designed as a fourlayer architecture that combines heterogeneous infrastructure, hybrid quantum-classical runtime services, policy-driven task orchestration, and communication-network applications. A central component of QB is the Quantum Invocation Policy (QIP), which dynamically determines when quantum acceleration is beneficial and when classical execution should be preferred. A case study on deadline-aware orchestration of distributed quantum jobs over heterogeneous QPUs shows that QB can improve workload execution under tight deadline constraints, serving up to 25% more jobs than existing quantum-cloud scheduling baselines. Finally, open challenges and opportunities towards the deployment of QB are highlighted and discussed.

Multi-Field Hybrid Retrieval-Augmented Generation for Maritime Accident Root Cause Analysis

Seongjin Kim, Sungil Kim — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13249v1 Announce Type: new Abstract: Maritime accident adjudication reports contain critical tribunal findings for root cause analysis (RCA), yet retrieving relevant precedents and drafting consistent reports from decades of records remains labor-intensive. This paper proposes a multi-field hybrid retrieval-augmented generation (RAG) framework for automated maritime RCA, utilizing a comprehensive dataset of 13,329 Korea Maritime Safety Tribunal (KMST) reports (1971-2025). We transform raw adjudications into a structured knowledge base of "incident cards", indexing three distinct fields-Summary, Causes, and Disposition-alongside a hierarchical L1/L2 cause taxonomy. Our retrieval strategy employs a field-aware hybrid approach, fusing sparse and dense rankings via Reciprocal Rank Fusion (RRF). Given the lack of large-scale expert relevance labels, we evaluate retrieval performance using ceiling-normalized recall and nDCG based on a metadata-derived proxy relevance score. Experimental results demonstrate that our proposed retrieval significantly outperforms baseline methods, improving NormRecall@100 from 0.18 to 0.55. Furthermore, grounding the generator on the retrieved precedents enhances RCA generation quality over an LLM-only baseline, increasing the LLM-as-a-judge score from 3.34 to 3.72. These findings suggest that field-aware RAG can substantially streamline maritime safety investigation workflows by enabling faster precedent search and more consistent, evidence-based RCA drafting.

To GAN or Not To GAN: Segmentation Analysis on Mars DEM

Douglas Dziedzorm Agbeve, Aditya V. Handrale, Salim Fares, Seif E. Idani — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13252v1 Announce Type: new Abstract: To better understand Martian Surface, which is needed to enable Rovers navigate Mars with ease, it is necessary to be able to determine the location of mounds. Detecting and studying these morphologies can also help us find evidence of extraterrestrial life, in this case, more specifically, water or signs of life conducive environments. Detection of mounds was done by manually mapping morphological parameters onto Digital Elevation Models. This paper solves the problem by automatically detecting and or predicting mounds on Mars using Neural Network based Semantic Segmentation methodologies. This is done by using supervised semantic segmentation model and generative adversarial approach. A comparison of the approaches shows that adding extra artificially generated data did not improve the result.

Towards Personalized Federated Learning for Dysarthric Speech Recognition

Tao Zhong, Mengzhe Geng, Jiajun Deng, Shujie Hu, Xunying Liu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13253v1 Announce Type: new Abstract: Speech recognition is challenging for dysarthric speakers. While federated learning (FL)-based ASR can be an effective tool for protecting privacy, it suffers from heterogeneity issues caused by speaker variability. Forcing all speakers to share the same model components can be suboptimal under such heterogeneity, making personalization a promising direction; however, related research on dysarthric speech remains limited. To this end, this paper explores two aggregation strategies to achieve personalization, including the parameter-based averaging strategy and the embedding-based averaging strategy. Experiments on UASpeech and TORGO show that the proposed methods outperform the baseline regularized FedAvg by statistically significant WER reductions of up to 0.99% absolute (3.15% relative) on UASpeech and 0.56% absolute (4.73% relative) on TORGO, respectively.

Evaluating Pluralism in LLMs through Latent Perspectives

Laura Majer, Jan \v{S}najder, Martin Tutek — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13254v1 Announce Type: new Abstract: The growing need to represent diverse perspectives has increased interest in pluralistic LLM generation. Although difficult to operationalize, identifying perspectives expressed in text would provide clear guidance on pluralistic alignment and more clearly articulate the pluralistic gap in LLM generation. While models have been shown to reduce the diversity of training data and generate homogeneously, this has been demonstrated primarily on multiple-choice questionnaires or using high-level characteristics of free-form text. In this paper, we introduce and implement a domain-agnostic multi-layered framework for unsupervised extraction of perspectives suitable for identifying the pluralistic gap in LLM-generated text. We evaluate our framework on book reviews, a highly opinionated dataset representing diverse perspectives, and compare various prompts and models. Our results show that while some models and prompting techniques come close to covering a broad spectrum of perspectives, rarer perspectives remain disproportionately underrepresented, resulting in distributions that diverge from human text.

Embedding-based Methods for Linear Solver Performance Prediction

Hayden Liu Weng, Hans-Joachim Bungartz, Felix Dietrich — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13255v1 Announce Type: new Abstract: The solution of large, sparse linear systems often dominates the computational effort of scientific applications and is a frequent optimization target. Modern libraries provide numerous solver and preconditioner configurations, but their performance varies significantly across problem instances. Previous works have addressed the selection of an optimal solver, but are typically limited by the problem set addressed (e.g., only symmetric positive definite matrices), the use of expensive matrix features, or the complexity of the approach. This work proposes a modular, low-cost embedding-based framework for solver selection that decouples performance modeling from feature representation and downstream prediction. Solver-problem relationships are learned directly from observed performance data, while inexpensive numerical features are used to project unseen problems into the learned embedding space. The framework focuses on multilabel prediction and evaluation using user-centric metrics, such as MAPE and nDCG, which better reflect relative performance. Experiments on 621 matrices from the SuiteSparse matrix collection across 101 PETSc solver configurations demonstrate a 17% increase in top-prediction accuracy over classical feature-based models when expensive numerical features are included, along with reductions of 37% in mean average percentage error (MAPE) and 46% in top-prediction error (1-error). When restricted to a reduced feature set, the embedding approach remains competitive, while still consistently achieving ca. 24% lower MAPE and 1-error across a broad range of problems.

Humor Style Drives Laughter, Topic Shapes Acceptability: Evaluating Bilingual Personal and Political Robot-Delivered AI Jokes

Anna-Maria Velentza, Anne-Gwenn Bosser — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13256v1 Announce Type: new Abstract: Humor plays a central role in human social relationships, and recent advances in computational humor create new opportunities for integrating humor into human-robot interaction (HRI). While large language models (LLMs) can generate diverse forms of humor, it remains unclear how humor style, joke content, and language preference shape perceptions of robot-delivered humor in group settings. In this exploratory study, we employed a mixed factorial design in which participants evaluated AI-generated jokes delivered by a robot in a university classroom. We examined the effects of humor type (Affiliative, Self-Enhancing, Aggressive, Self-Defeating) and joke content (person-related vs. political) on perceived funniness and appropriateness, as well as preferred language. Results show that humor type significantly influences funniness, with Aggressive and Affiliative humor rated higher, while joke content primarily affects appropriateness, with person-related jokes preferred over political ones. Language preference was shaped by both joke content and participants' self-reported fluency and humor practices.

MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment

Minlin Zeng, Zhipeng Zhou, Yang Qiu, Martin J. McKeown, Zhiqi Shen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13258v1 Announce Type: new Abstract: Gait-based Parkinson's disease assessment increasingly relies on heterogeneous sensors, but clinical systems rarely collect all modalities simultaneously. New sensors may arrive through device upgrades, protocol changes, or multi-center deployment, while historical patient data are often unavailable because of privacy and storage constraints. This modality-incremental setting faces three challenges: unreliable cross-modal distillation, modality-specific statistical shifts, and reduced plasticity after preservation. We propose MOSAIC, a compact continual learning framework. First, we identify the Toxic Teacher phenomenon and introduce Modality-Specific Warm-Up to stabilize newly learned modality representations before distillation. Second, we propose a statistics-decoupled MSBN architecture that isolates sensor statistics while maintaining a shared semantic backbone. Third, we design a curriculum-guided repulsive objective for Plasticity Recovery, preserving legacy knowledge while recovering modality-specific capacity. Experiments on three multimodal Parkinson's gait datasets show that MOSAIC improves final performance and mitigates forgetting. Project code is available at: https://github.com/minlinzeng/MOSAIC_Modality-Specific-Adaptation-for-Incremental-Continual-Learning-in-PD-Gait-Assessment.git

Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning

Paolo Muratore, Mackenzie Weygandt Mathis — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13260v1 Announce Type: new Abstract: Identifying latent dynamical systems from noisy, high-dimensional measurements is a central problem at the intersection of representation learning, system identification, and scientific discovery. We present DYSCO, a multi-view temporal contrastive learning algorithm that jointly recovers latent trajectories and the governing dynamics from such observations, by leveraging multiple independent noisy views of the same underlying process to disentangle signal from noise. By parameterizing the dynamics in a structured functional basis, our framework further enables symbolic recovery of the governing equations within an affine gauge. We offer theoretical guarantees for strong identification up to an affine indeterminacy, extending prior identifiability results to the realistic setting of noisy nonlinear observations. Empirically, we demonstrate accurate recovery of both latent trajectories and flow fields across a diverse set of dynamical regimes (e.g., chaotic, oscillatory, and metastable) under both Gaussian and Poisson observation noise, the latter being particularly relevant for neural recordings.

From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification

Rongxin Yang, Shenghong He, Siyuan Zhu, Chao Yu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13262v1 Announce Type: new Abstract: Recent approaches combining Large Language Models (LLMs) with retrieval-augmented reasoning have shown promise for automated fact verification. To process complex claims, these verification pipelines typically execute multi-stage workflows that coordinate tightly coupled modules, including claim decomposition, evidence gathering, and verdict prediction. However, existing methods optimize individual stages in isolation or rely on fixed heuristics, which limits adaptive coordination among stages and can lead to suboptimal outcomes. In this work, we propose ProFact, an agentic reinforcement learning framework for end-to-end optimization of multi-stage fact verification trajectories. ProFact trains a unified policy to coordinate claim decomposition, evidence seeking, answer generation, and verdict prediction. To address the sparse and delayed supervision provided by final veracity labels, ProFact introduces process-aware rewards that provide stage-level learning signals throughout the verification process. Empirical evaluation shows that ProFact consistently outperforms strong baselines in both verification performance and inference efficiency. These results highlight the effectiveness of process-aware trajectory optimization for multi-stage fact verification.

Dynamic Resource Management in Production HPC Clusters

Petter Sand{\aa}s, Sergio Iserte, Guillaume Houzeaux, Antonio J. Pe\~na — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13266v1 Announce Type: new Abstract: Many large-scale scientific applications exhibit time-varying behavior, yet production HPC clusters still rely on rigid, fixed-size allocations, and most dynamic techniques remain confined to laboratory prototypes. This work presents a practical MPI malleability methodology that integrates with state-of-the-art high-performance computing (HPC) software stacks and operational practices. The methodology is implemented in the Dynamic Management of Resources (DMR) framework and is designed to ease adoption by existing applications without requiring intrusive code changes or scheduler modifications. We evaluate our approach by integrating the DMR API into two large-scale scientific applications and deploying them on three TOP500 supercomputers under realistic production configurations. Our non-invasive malleability solution achieves performance comparable to static baselines in controlled environments while substantially reducing node-hour consumption for identical workloads. These results show that malleability can be effectively exploited on production systems using vanilla resource managers, lowering the barrier to adoption of dynamic resource management in HPC.

TimeLens: On-Device Artifact Recognition with Retrieval-Augmented Question Answering for the Grand Egyptian Museum

Rawan Hesham, Ali Ashraf, Amr Ahmed, Malak Alaa, Omar Ahmed, Omar Wagih — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13267v1 Announce Type: new Abstract: TimeLens is an AI-powered bilingual mobile guide for the Grand Egyptian Museum (GEM). Pointing a phone at an exhibit, a visitor sees the artifact recognized in real time and can ask follow-up questions answered in English or Arabic. The work addresses three problems specific to in-gallery deployment: fine-grained visual similarity among 51 catalogued artifacts (many near-identical Ramesside statues), the gap between curated training data and handheld camera conditions, and the risk of an AI guide stating unsupported historical facts. Two engineering contributions are reported. First, an on-device artifact detector was developed through a data-quality-driven iteration study -- from foundation-model auto-annotation (YOLO-World), through spatial label-cleaning rules, to a fully hand-annotated dataset -- isolating label quality as the decisive factor: the final YOLOv8n model resolves every previously failing class while remaining a 5.97 MB TensorFlow Lite asset that runs in real time on a mid-range phone (mAP@0.5 = 0.995, mAP@0.5:0.95 = 0.924). Second, a bilingual Retrieval-Augmented Generation (RAG) guide, grounded in a 108-record ChromaDB knowledge base, was benchmarked across seven candidate language models, with Gemma 4 E2B (Q4 K M) selected; ten targeted optimizations reduce end-to-end latency from over 30 s to approximately 10 s. Both subsystems are integrated in a production Flutter application with bilingual interface, museum location gating, and text-to-speech support.

Split Tallies: A Discrete Certificate Calculus for Auditing Dynamic Ordered Sets in Constant Memory

Faruk Alpay, Levent Sarioglu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13272v1 Announce Type: new Abstract: We study retrospective auditing for dynamic ordered sets maintained by an untrusted party. A passive auditor watches insert, delete, membership, predecessor, successor, min, and max operations, stores five machine words and a flag, and receives a constant-size public tally record per operation. At audit time the maintainer discloses the claimed live vacant intervals. The method represents order semantics by maximal gaps: gaps are born, cited, consumed, and timestamped, while two hidden field accumulators test equality of the birth and consumption ledgers. Honest executions are accepted with probability one. If any answer in a T-operation session is wrong, acceptance occurs with probability at most (4T+1)/p over one secret field element, against computationally unbounded maintainers. We prove that deterministic and visible-coin auditors require linear state, and that removing the timestamp rule permits an exact replay forgery. A leaf-oriented (2,4)-tree implements the maintainer in O(log n) worst-case time per operation with one extra word per element, and its rebalancing events admit an auditable O(m) envelope over m updates. Checkpoint audits compose with additive error.

Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing

Anugrah Aidin Yotolembah, Novanto Yudistira, Gembong Edhi Setyawan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13275v1 Announce Type: new Abstract: This paper presents Custom ZeroCLIP, a retrieval-augmented vision-language framework for zero-shot captioning of Indonesian traditional garments. The dataset contains 3,800 expert-annotated images from all 38 Indonesian provinces. Using a province-level inductive zero-shot protocol, the model is trained on 24 seen provinces, validated on 6 seen provinces, and evaluated on 8 unseen provinces. The framework combines a frozen CLIP ViT-B/32 image encoder, a CLIP text encoder, a BERT text encoder, and an LSTM caption decoder. During inference, unseen-province labels and captions are unavailable, and retrieval uses only captions from training provinces. No unseen-province image, label, or caption is used during training, validation, or retrieval-bank construction. Custom ZeroCLIP achieves a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859, outperforming existing baselines. Ablation results show that retrieval improves cultural vocabulary recovery with a 19.3\% METEOR gain, while human evaluation confirms stronger cultural accuracy and fluency. The results demonstrate the effectiveness of retrieval-augmented domain adaptation for culturally grounded caption generation in low-resource heritage settings. The dataset is publicly available at https://github.com/AnugrahAidinYotolembah/Traditional-Indonesian-Clothing-Captioning-Dataset.

Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

Kirato Yoshihara — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13276v1 Announce Type: new Abstract: Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all weight matrices. In this work, we ask whether different transformer modules prefer different manifold geometries. We study Manifold Muon for GPT-2 pretraining and compare layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. Our results show a clear asymmetry: constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas the inverted assignment and all-DGram configuration become unstable under the shared hyperparameter setting. We trace this failure to singular value growth in DGram-constrained attention weights, which can amplify attention logits and induce softmax saturation. These findings suggest that symmetry-aware and geometry-aware optimization for transformers should be module-specific rather than uniform.

See Selectively, Act Adaptively: Dual-Level Structural Decomposition for Bimanual Robot Manipulation

Yoon-Ji Choi, Young-Chae Son, Soo-Chul Lim — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13279v1 Announce Type: new Abstract: In bimanual robotic manipulation, task-relevant visual information varies with the task stage and context, while the interaction of the two arms shifts between independent and coordinated modes, making policy learning challenging. However, existing monolithic Vision-Language-Action (VLA) policies process diverse visual inputs and interaction patterns through a single shared representation and action generation pathway, often failing to separately account for visual relevance and bimanual interaction structure. To address this issue, we propose a bimanual manipulation VLA framework based on Dual-Level Structural Decomposition. The View-Selective Visual Router dynamically adjusts wrist-view contributions to emphasize relevant visual cues, while the Interaction-Aware Action Mixture-of-Experts (MoE) decomposes action generation into coordinated and arm-wise pathways to adapt to varying bimanual interaction modes. We evaluate the proposed method on six simulated bimanual manipulation tasks in RoboTwin 2.0 and three long-horizon real-world tasks. Our model improves the overall average success rate over a monolithic baseline by 27.7% in simulation and 43.3% in real-world evaluation, while consistently outperforming single-module variants across both settings. These results demonstrate that jointly considering selective visual processing and explicit decomposition of bimanual interaction structures provides an effective inductive bias for robust bimanual manipulation.

ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

Pratyush Chaudhari — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13282v1 Announce Type: new Abstract: As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening, formal methods for evaluating their robustness against adversarial manipulation of ethical reasoning remain underdeveloped. This paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that: (1) encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in established ethical theory; (2) applies 17 semantic perturbation functions subject to 6 validity constraint classes including a novel semantic coherence constraint; (3) measures decision deviation via a 4-component Ethical Instability Index (EII); and (4) produces domain-adaptive pre-deployment robustness assessment verdicts. We evaluate 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) across 50 ethical scenarios spanning 8 deployment domains, generating 1,500 adversarial test cases. Results demonstrate that only 33% of models achieve assessment clearance, with the local Llama-3.2 model proving particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737). To the best of our knowledge, no existing framework combines a bounded ethical consequence space, semantic coherence constraints, and domain-adaptive assessment in a single adversarial testing pipeline.

Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation

Beinan Xu, Andy Song, Jiti Gao, Feng Liu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13285v1 Announce Type: new Abstract: We introduce Equilibrium State Estimation (ESE), a novel paradigm for simultaneous prediction, where multiple interacting systems require separate yet coordinated forecasts. Such scenarios often arise in real-world settings such as economics and healthcare modeling. Unlike existing approaches that predict one system at a time, ESE forecasts all systems in a single pass. It first estimates the equilibrium state across systems, then generates holistic forecasts based on the difference between the current state and the estimated equilibrium. Extensive experiments on synthetic and real-world datasets, including currency exchange and COVID-19 spread modeling, demonstrate that ESE is at least as accurate as state-of-the-art (SOTA) methods while being significantly faster. In addition, ESE integrates seamlessly with conventional predictors, combining their accuracy with its exceptional efficiency and delivering a 10-70x speedup. With linear-time complexity, ESE scales far better than SOTA methods as the number of systems increases. Moreover, it remains accurate under diverse perturbations, establishing ESE as a fast, generalizable, robust, and scalable multi-prediction method.

Error Probability Analysis of Quantum Communication with Phase-squeezed M-PSK

Nikos A. Mitsiou, Ioannis Krikidis — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13286v1 Announce Type: new Abstract: In this paper, we investigate the symbol error probability (SEP) of phase-squeezed M-ary phase-shift keying (M-PSK). Since the relevant observable for M-PSK detection is the optical phase, we adopt the adaptive Mark-II receiver which is a physically realizable phase measurement. First, we develop a theoretical analysis based on the phase probability operator measure (POM) of the Mark-II scheme in the Fock basis. Then, we develop two SEP methods based on the statistics of the received PSK symbol and the error introduced by the Mark-II measurement. The first method derives the phase probability density induced by the squeezed state noise and incorporates the additional Mark-II phase uncertainty through an angular convolution. Since this convolution does not admit a simple closed form, we also introduce an effective tangential-variance model, which yields a closed form SEP expression in terms of the Owen's T-function. Numerical results show that phase squeezing substantially reduces the SEP of M-PSK compared to coherent state transmission, with greater gains for higher constellation orders. Notably, for the investigated scenario, squeezing can almost double the photon efficiency of M-PSK as the mean number of transmitted photons increases. Finally, the proposed approximations closely follow the Mark-II POM analysis, typically within an accuracy of 2-4 photons, and therefore provide accurate and computationally efficient tools for analyzing phase squeezed quantum M-PSK communication.

Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

Samuel Erickson, Mikael Johansson — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13287v1 Announce Type: new Abstract: In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximizes the utilization of available hardware by avoiding waiting for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless affected negatively by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping "stabilizes" training. In this work, we provide a theoretical justification for this behavior, as we show that clipping removes the dependence of the maximum delay in the oracle complexity. We employ a sub-Weibull model of gradient noise which generalizes sub-Gaussian and sub-exponential distributions to more heavy-tailed distributions, motivated by empirical observations in deep learning. We show convergence in expectation, and the first time in asynchronous optimization, convergence with high probability.

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

Wei Li, Zhen Huang, Xinmei Tian — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13288v1 Announce Type: new Abstract: Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words" behavior--struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose MACCO (MAsked Compositional Concept MOdeling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model. Code is available at https://github.com/hiker-lw/MACCO.

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13289v1 Announce Type: new Abstract: Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.

Feasibility Assessment of Remote Driving via Latency Analysis of ITS-G5 and Cellular Networks in the MASA Living Lab

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13292v1 Announce Type: new Abstract: Remote driving has gained increasing attention as a key enabler for connected and automated vehicles. Yet its practical deployment hinges on wireless networks' ability to guarantee low, predictable latency. In this paper, we present an extensive latency analysis of ITS-G5 and cellular (5G) technologies within the Modena Automotive Smart Area (MASA), a real-world, city-scale testbed equipped with a distributed intelligent transportation infrastructure. By conducting controlled experiments under varying network loads and traffic conditions, we measure network and end-to-end latency components relevant to remote driving, in which the uplink consists of a continuous video stream transmitted from the vehicle to the remote operator, and the downlink conveys control commands back to the car. Measurements conducted under diverse conditions reveal how latency and variability differ across the two technologies and how infrastructure coverage impacts video-stream transmission performance. Based on the observed latency distributions and reliability metrics, we assess the practical feasibility and safety margins of remote driving in mixed network environments. The results provide actionable insights for future teleoperation deployments and motivate hybrid communication strategies that combine the strengths of ITS-G5 and cellular networks.

Mining Architectural Quality Under Agentic AI Adoption: A Causal Study of Java Repositories

Oliver Aleksander Larsen, Mahyar T. Moghaddam — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13298v1 Announce Type: new Abstract: AI coding tools are now used by a majority of developers, and agentic use of these tools has popularized the practice colloquially called "vibe coding". Yet causal evidence on their effect on software architecture is scarce. Prior causal work has measured code-level outcomes (complexity, static analysis warnings); whether such degradation propagates to architecture-level outcomes remains unknown. We mine 151 open-source Java repositories, 74 with detectable agentic AI adoption (identified via configuration files and Co-Authored-By commit trailers) and 77 propensity-matched controls, across a 13-month per-repository window yielding 1,811 monthly Arcan snapshots. We estimate the causal effect of adoption on architectural smell density (ASD) with a staggered difference-in-differences design and the Borusyak imputation estimator, applying a causal design recently used for code-level metrics to the architecture level. Total smell counts are essentially unchanged (+1.1%, p = 0.82) while lines of code grow +12.8% (p = 0.003); the resulting 6.7% ASD decline (p = 0.004) is therefore a denominator effect rather than an architectural improvement. Per-type estimates and robustness checks (wild cluster bootstrap, Lee bounds, stale-observation sensitivity) corroborate the pattern; pre-trends are flat (Wald p = 0.90), consistent with parallel trends. Density-normalized outcomes can mislead when treatment affects system size: raw counts and explicit decomposition are required for causal mining studies of AI tool adoption. The complete replication package, including the curated 151-repository monthly panel, is publicly available.

Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score

Mariya Pavlova, Harrison Bo Hua Zhu, Elizsveta Semenova, Yingzhen Li — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13300v1 Announce Type: new Abstract: We introduce the Trajectory-based Quantization Sensitivity Score (TQS), a metric that reframes post-training quantization (PTQ) through the lens of dynamical-systems stability. By modeling the network's rollout as a discrete-time dynamical system, TQS characterizes how quantization-induced errors propagate and amplify over the rollout horizon. Unlike conventional PTQ methods, where sensitivity analysis is often coupled to the quantization procedure, TQS enables a priori sensitivity estimation decoupled from quantizer selection and bit-width assignment. This separation allows for quantization budget planning even for black-box or compiled networks with fused operators. Building on this, we present TQS-PTQ, a flexible mixed-precision framework that requires no calibration data or costly second-order approximations. Our experiments show that a dynamical-systems perspective provides a robust, high-performing pathway for low-precision deployment in resource-constrained settings.

Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13302v1 Announce Type: new Abstract: Wave parameters in the nearshore are crucial for coastal engineering, shoreline protection, marine hazard assessment, and coastal management for climate resilience. Traditional monitoring systems like buoys and radar platforms offer accurate monitoring but can have high installation and maintenance expenses and limited spatial coverage. Passive ocean monitoring using video has been achieved by leveraging deep learning, however, many methods are not physically interpretable, feasible, and validated for oceanography. In thiswork, a Physics-Guided Deep Spatiotemporal Learning Framework for direct estimation of nearshore wave peak periods from passive coastal video stream is proposed. The framework combines automated temporal-variance based region-of-interest detection, multi-stage Sim-to-Real transfer learning, and physics-informed regularization to enhance the predictive accuracy and physical consistency. A variety of spatiotemporal architectures were assessed, such as transformer-based and recurrent-convolutional ones, alongside synthetic pretraining,silver-label adaptation, and expert fine-tuning. The results show that transformer-based architectures outperformed in terms of the accuracy of the instantaneous prediction, while lightweight recurrent-convolutional architectures achieved higher temporal stability and operational oceanographic skill. Ablation studies also demonstrated the benefits of physics-guided regularization in terms of trend-following consistency, and physically implausible predictions. Explainability auditing also helped to focus attention in hydrodynamically active surf-zone regions and showed good agreement with the physically derived wave propagation behavior. In general, the proposed framework shows the promise of physics-guided video-based deep learning systems for long-term coastal wave monitoring that are cost-efficient and operationally feasible.

DuET: Dual Expert Trajectories for Diffusion Image Editing

Lidia Troeshestova, Alexander Ustyuzhanin, Sergey Kastryulin — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13303v1 Announce Type: new Abstract: Recent diffusion editors perform diverse instruction-based edits while conditioning on the source image at every denoising step. Yet persistent source-image conditioning can limit how fully an edit is executed and how natural the result appears, especially when the target scene diverges substantially from the input. We introduce DuET (Dual Expert Trajectories), a training-free inference method that temporarily relaxes source-image conditioning by transitioning through a text-to-image phase before returning to edit mode, allowing the denoising trajectory to move toward the target distribution while retaining the structural benefits of image-conditioned editing. Without modifying model weights or increasing sampling cost, DuET consistently improves instruction relevance, semantic fidelity, and perceptual quality across diverse models and benchmarks. In some cases, these gains come with a modest reduction in source-image preservation, revealing a predictable trade-off between source preservation and edit fidelity.

ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance

Salaheldin Mohamed, M. Hamza Mughal, Rishabh Dabral, Christian Theobalt — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13304v1 Announce Type: new Abstract: Speech-driven talking character animation seeks to generate life-like portrait videos that convey natural conversation behavior, aligning facial motion with spoken audio. Although recent advances in video generation have substantially improved realism in video-based animation, achieving both accurate lip articulation and expressive behavior remains challenging. Existing approaches typically trade off precise phoneme-to-lip synchronization against dynamic facial expressions and head motion, yielding animations that are either accurate yet rigid, or expressive but poorly synchronized. We address this challenge by proposing ReFree-S2V, a flow-matching speech-to-portrait animation framework that builds upon a pretrained video generation model to achieve fine-grained speech articulation and high-level expressive cues in speech-driven portrait animation. This model introduces a multi-level speech representation capturing phonetic and prosodic information at both local and global granularities. These representations are selectively injected into transformer blocks via learnable level selectors, enabling both accurate lip synchronization and natural expressive motion. To achieve natural head movements, we further introduce a novel reward-free reinforcement learning scheme into flow-matching training to discourage perceptually implausible motion without relying on handcrafted synchronization metrics or reward models, or the high cost of human preference annotation. Extensive experiments demonstrate that ReFree-S2V achieves state-of-the-art performance, significantly outperforming existing methods in both quantitative lip-sync accuracy and qualitative human evaluations of naturalness and expressivity.

EconCSLib: AI-Assisted Lean Formalization for Economics & Computation research

Nikhil Garg — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13306v1 Announce Type: new Abstract: This paper presents EconCSLib, a Lean 4 library and workflow for formalizing research papers in Economics and Computation with language-model assistance. The central design principle is a human-AI-Lean workflow: an LLM writes Lean code, Lean checks formal statements and proofs, and humans (assisted by an LLM) verify the translation boundary from paper claims to formal statements. EconCSLib is organized around research papers, preserving their formal statements and following their proof structure to the extent possible; reusable mathematical statements are elevated into shared EconCS infrastructure. The workflow is designed to be author-facing: researchers can formalize their own papers, inspect the Lean code's translations of paper-facing statements, and contribute reusable components back to the library; this is supported by post-formalization validation reports, paper result dependency graphs, and a review dashboard. The current public repository contains 11 formalized papers and 3 partially formalized papers, along with initial libraries for probability, auctions, matching markets, and graph tools. The library and workflow are available at https://github.com/nikhgarg/EconCSLib, with corresponding project webpage at https://gargnikhil.com/EconCSLib/. To our knowledge, we are also among the first applied math researchers to systematically pursue Lean formalization of one's own publications in the process of building such a community library. We welcome users and contributors to the project.

Subdivision-based isogeometric analysis for axisymmetric electromagnetic problems

Devin Balian, Sebastian Sch\"ops, Melina Merkel — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13308v1 Announce Type: new Abstract: This paper applies a subdivision-based isogeometric method to solve the axisymmetric Maxwell eigenvalue problem. The reduction to an $H^1$-formulation allows to use a Catmull-Clark construction for both geometry and field discretization. The approach yields a numerical solution for the electric field, which is $C^1$-continuous everywhere except at extraordinary vertices. This is demonstrated by computing the eigenmodes of a TESLA 9-cell cavity, showing smoother fields with less numerical noise than conventional methods. The convergence rate of the method is numerically analyzed and is in agreement with rates observed in the literature.

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13310v1 Announce Type: new Abstract: The original Turing Test asks a human judge to distinguish a machine from a person through dialogue. Three quarters of a century later, conversational systems pass this test in casual settings; the interesting epistemological question has shifted. We argue that the relevant modern variant asks not whether a dialogue partner is artificial, but whether it can be trusted. We present RogueAI, an interactive webapp that operationalizes this revisited test as a one-on-two interrogation game: a human player questions two indistinguishable Large Language Model agents, knowing that exactly one of them has been licensed to deceive within a shared fictional scenario. The player's task is to identify the deceptive agent and "shut it off" before a turn budget is exhausted. We further introduce AutoRogueAI, a procedural extension in which players co-design a custom scenario with a narrator agent that secretly chooses its own deception strategy. We describe the framing, sketch the abstract architecture and gameplay loop, and situate the artifact within recent work on LLM deception, social-deduction benchmarks, and scalable oversight via debate. A three-day pilot deployment (467 initiated sessions, 415 completed, 1876 interaction turns in Italian) provides early feasibility evidence and surfaces a concrete tension: the deceptive agent carries a reliable, locally-present linguistic signature - differential helpfulness, brevity, hedging - that a simple heuristic exploits at 75.6% accuracy, yet human players achieved only 56.6%, consistent with ignoring the most diagnostic signal entirely. We discuss what this gap implies for the artifact's use as a data-collection vehicle, a teaching tool, and an evaluation harness for honesty-trained models.

Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

Yongmin Kim, ByeongHoon Jeon, Sungil Kim — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13311v1 Announce Type: new Abstract: Contextual anomaly detection aims to identify abnormal behavior conditional on context variables, but practical deployments often face highly imbalanced context distributions where rare regimes can be critical information. Under such frequency bias, context-conditioned models can produce unstable decisions and excessive false alarms in rare contexts. We propose Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a rarity-aware conditioning module that combines feature-wise modulation (i.e., context-conditioned scaling and shifting of hidden features) with a gate controlled by a data-driven rarity score. The rarity score is estimated from the empirical distribution of context variables and regulates how strongly context modulates intermediate representations: the gate becomes more decisive under rare contexts while remaining conservative under frequent contexts. We evaluate RGFiLM on maritime trajectory anomaly detection using AIS motion sequences with ERA5 environmental context in an environment-sensitive detour scenario. When instantiated in a sequential anomaly scoring pipeline, RGFiLM achieves the best mean F1--False Positive Rate (FPR) trade-off among the compared context-agnostic and context-conditioned methods. These results suggest that explicitly accounting for context rarity is an effective approach for reducing false alarms in context-sensitive anomaly detection.

MagPlus: Bridging Micro-to-Regular Facial Expressions through Learnable Magnification

Sliman Jammal, Andrei Sharf — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13312v1 Announce Type: new Abstract: Facial micro-expressions are subtle and short-lived facial movements that provide important cues about genuine human emotions. However, modeling and generating them remains difficult because annotated micro-expression data is limited and the underlying facial motions are extremely weak. Existing micro-expression generation methods therefore often suffer from limited quality, weak robustness, and poor generalization. We propose MagPlus, a transferable micro-expression processing pipeline that connects micro-expression analysis with standard facial animation models. Instead of training a dedicated generator from scratch, MagPlus learns to magnify subtle facial motions into the range of regular facial expressions, transforming micro-expressions into signals that are compatible with existing facial expression processing models. The magnified sequence is then used by a standard facial expression model for tasks such as transfer and synthesis. A complementary DeMagPlus module then restores the generated motion back to realistic micro-expression intensity levels while preserving the synthesized dynamics. We evaluate the framework using four facial animation models: FOMM, FSRT, MetaPortrait, and EmoPortraits. None of these models are trained on micro-expression data. Experiments show that MagPlus-DeMagPlus enables pretrained macro-expression models to generate more realistic micro-expression motion without retraining the backbones.

Masked and Predictive Self-Supervised Foundation Models for 3D Brain MRI

Esra Erg\"un, Hersh Chandarana, Dan Sodickson, G\"ozde \"Unal — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13315v1 Announce Type: new Abstract: Self-supervised foundation models have shown strong promise in medical imaging. However, existing MRI foundation-model studies have primarily emphasized segmentation and dense prediction tasks, while systematic investigation of self-supervised foundation models for MRI-based disease detection remains limited. In this work, we investigate two major self-supervised pretraining paradigms for MRI-based disease detection: reconstruction-based learning via Masked Autoencoders (MAE) and predictive representation learning via Joint Embedding Predictive Architectures (JEPA). We study the role of auxiliary objectives by introducing a novel spectral-domain reconstruction loss for MAE to enhance sensitivity to fine-grained anatomical structure, and by integrating variance--covariance regularization (VCR) within our JEPA framework to encourage decorrelated latent representations. Our models are pretrained on heterogeneous single-contrast MRI volumes in a contrast-agnostic setting, without modality concatenation. Across five downstream disease detection tasks, our results highlight the importance of self-supervised objective design for medical foundation model pretraining, demonstrating that the downstream benefit of each objective is determined by its relevance to the task's structure. Specifically, spectral regularization yields the largest improvements when the downstream discriminative signal is characterized by strong high-frequency anatomical structures, while covariance regularization is most beneficial when discriminative information spans multiple decorrelated feature dimensions. MAE with spectral-domain supervision consistently achieves superior downstream performance for MRI-based disease detection. These findings suggest that self-supervised objectives in medical imaging encode specific biases, and their downstream benefit is fundamentally conditioned on the task's structure.

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13316v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a central technique for improving long-horizon reasoning in Large Language Models (LLMs). However, existing RLVR methods often encourage unnecessarily long reasoning rollouts, which can degrade reasoning coherence and exhaust the available context budget. Existing approaches to long-context organization often depend on external mechanisms to organize rollouts, rather than enabling the model to manage its own reasoning trajectory. To address this limitation, we propose ReSum, a novel RLVR framework that enables LLMs to compress and organize their reasoning trajectories through self-summarization. Our pilot studies show that self-summarization stabilizes generation by lowering token-level entropy, and that introducing a ``summarization'' phrase can substantially mitigate errors propagated from an incorrect rollout prefix. Motivated by these findings, ReSum adopts a summarization-aware adaptive rollout mechanism that contrastively evaluates whether self-summarization benefits the ongoing reasoning process. Specifically, when the model spontaneously triggers self-summarization, ReSum masks the summarization phrase to create a contrastive branch; for non-summarization positions, it instead randomly injects the phrase to create a matched branch. We further design a summarization-aware advantage to enable finer-grained comparison between contrastive rollout trajectories. Extensive experiments show that ReSum improves performance at an average of 4\% while reducing rollout length by 18.6\%.

SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents

Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13317v1 Announce Type: new Abstract: Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. We propose SkillCAT, a training-free framework that separates this process into three stages. Contrastive Causal Extraction (CCE) samples multiple trajectories for each task and compares same-task success/failure pairs to identify evidence that explains outcome differences. Assessment-Augmented Evolution (AAE) replays each candidate patch on source-task clones and keeps only patches that improve or preserve task outcomes before hierarchical skill patch merging. Topology-Aware Task Execution (TTE) compiles the evolved skills into a routable sub-skill topology, so inference loads only the capability nodes relevant to the task. We evaluate SkillCAT on common agent benchmarks, including SpreadsheetBench, WikiTableQuestions, and DocVQA, and further test cross-model and out-of-distribution generalization. Across these settings, SkillCAT raises the average score over baselines by up to 40.40%, demonstrating reliable skill evolution without model training.

Skiplists with Foresight: Skipping Cache Misses

Tomer Cory, Niv Sulimany, Erez Petrank — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13321v1 Announce Type: new Abstract: A skiplist is a fundamental data structure widely used in systems and applications for indexing data stores. In this work, we introduce Foresight, a cache-friendly skiplist optimization. Extending Foresight to concurrent settings introduces significant synchronization challenges that we identify and address. Foresight is a surgical optimization, easy to integrate into a wide variety of skiplist designs. We apply it to one sequential and three concurrent skiplist designs and observe throughput improvements of up to 45% in microbenchmarks. When applied to a skiplist-based index in the DBx1000 in-memory database, Foresight yields end-to-end performance gains of up to 15%.

Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text Generation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13322v1 Announce Type: new Abstract: We present a low-latency real-time audio game commentary system that generates spoken commentary directly from live gameplay video. In this end-to-end setting, a key bottleneck is accumulated waiting time; conventional pipelines capture frames, generate text, and synthesize speech sequentially for each utterance, and do not request the next generation until speech playback has completed. This strict sequentiality causes long and unnatural silence between utterances. To address this latency bottleneck, our system runs text generation in parallel with speech playback and buffers multiple candidate utterances ahead of time, enabling immediate synthesis at playback boundaries. Experiments on fast-paced game videos show that our parallel design reduces the mean inter-utterance silence from 9.6 seconds to 0.3 seconds compared to sequential baselines. It also improves similarity to professional speaking--silence timing patterns by over 40 %, and a user study with 120 experienced game players confirms significantly improved perceived speaking rhythm. Our demo video is available at: https://youtu.be/pmrRUlvav8M.

Runtime Analysis of the $(\mu + 1)$-ES in a Homogenous Progress Model

Johannes Lengler, Raghu Raman Ravi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13323v1 Announce Type: new Abstract: We introduce a new simple model to study the fitness progress of Evolution Strategies (ES) in generic problems. In this model, we bypass the underlying fitness landscape and assume that the mutation of any individual produces an offspring whose fitness relative to the parent is given by an invariant distribution $Z$, such as a mean-shifted Gaussian. This serves as a prototypical model for the optimisation landscape when an evolution algorithm operates far from the global optimum. This simple model can be used to approximate the optimisation process for problems where it is intractable to model the exact fitness function, including tasks such as hyperparameter tuning in machine learning models. We rigorously analyse the expected growth rate $\mathcal{R}_{\mu}$ of the continuous steady-state $(\mu+1)$-ES in this model. Unlike comma-selection strategies, the steady-state $(\mu+1)$-ES maintains overlapping generations, introducing complex mathematical dependencies among surviving parents that make it harder to analyse. We give a general technique to analyse the the $(\mu + 1)$-ES by constructing modified processes whose growth rates provably sandwich that of the original process. These modified processes are then easier to analyse but still close enough to the true process to give a tight bound on the expected growth rate. When $Z = \mathcal{N}(-\delta, 1)$ and $\mu \le e^{\delta}$, we show that $\mathcal{R}_{\mu} = \frac{\log^{1 + o(1)} \mu}{\mu} \mathcal{R}_1$.

Non-Parametric Dual-Manifold Mapping via 8-Bit Bounded Transformation Matrices: Challenging FP-centric Hardware Paradigms in Low-Energy AI

Lars Kopp — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13328v1 Announce Type: new Abstract: Modern deep learning hardware paradigms rely heavily on computationally expensive floating-point arithmetic (FP32, FP16, and FP8), requiring massive thermal and energetic overheads to maintain gradient-based optimization. This paper introduces a non-parametric, training-free computational framework for dual-manifold mapping that operates strictly within an 8-bit signed integer boundary and leverages simple bitwise and accumulation logic. By mapping a Spatial Manifold (N_spatial = 8192 neurons) and a Gabor-pooled Structural Manifold (N_structural = 4096 neurons) through an integer-based transformation matrix (Z-matrix), we eliminate the need for floating-point multipliers. Inference is achieved via cache-friendly pointer offsets and bitwise masks, accumulating directional sign-charges using fixed thresholds (theta_reject = 8.0, theta_cut = 2.0). Learning is executed through a localized, bounded update mechanism restricted strictly within [-127, 127], modulated by stochastic noise injection. Both architectures demonstrate extreme holographic resilience, preserving near-perfect reconstruction via a global scaling factor under 90% truncation sparsity and 20% random node destruction. By reducing core AI inference to 8-bit boundaries and boolean-like execution, this framework outlines a paradigm shift toward neuromorphic edge-computing, directly questioning the long-term necessity of dense, floating-point-centric GPU accelerators.

Work Stealing for the 2D-Mesh Topology of Satellite Constellations in Low Earth Orbit

Mia Reitz, Dorian Chenet, Jonas Posner — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13329v1 Announce Type: new Abstract: Asynchronous Many-Task (AMT) is a parallel programming model used in High Performance Computing (HPC). An AMT runtime can distribute fine-grained tasks across processing units called workers, through work stealing: when a worker has no tasks left to process, it tries to steal tasks from other workers. Workers are not restricted to a single compute node but can also be distributed across multiple nodes of an HPC cluster. Existing AMT runtimes assume a fully connected network with low, uniform latency and perform global work stealing, selecting another worker at random from all workers in the system. Space Edge Computing (SEC) uses constellations of satellites in Low Earth Orbit (LEO) as distributed compute clusters. Unlike HPC clusters, LEO satellites communicate through inter-satellite links that form a sparse mesh topology. Reaching a distant satellite requires multiple hops, each adding latency. As a step toward adapting AMT to SEC, this paper proposes a neighbor-only work stealing strategy in which workers steal exclusively from directly connected neighbors, avoiding multi-hop communication. An analytical model shows that restricting stealing this way yields a per-attempt latency advantage that grows with constellation size. Preliminary experiments on an HPC cluster with an emulated mesh over uniform low-latency links isolate the effect of victim selection: the neighbor-only strategy performs within ~2.2% of global stealing on both balanced and irregular workloads, indicating that restricting the victim set does not harm load balancing in this setting. Taken together, the experiments suggest that neighbor-only stealing can be on a par with global stealing, and the model suggests that neighbor-only stealing becomes preferable at scale.

OR-Action: Multi-Role Video Understanding with Fine-Grained Actions

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13332v1 Announce Type: new Abstract: Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR interactions. Converting their frame-wise relational predictions into temporally extended, fine-grained actions however, is challenging without explicit temporal modeling. To enable a principled temporal evaluation of current OR understanding methods, we introduce the first action-centric benchmark built on a publicly available ego-exocentric OR dataset by defining a fine-grained, multi-role action taxonomy and generating dense action segments via distillation from ground-truth scene graph state changes. Experiments on this benchmark show that current scene graph prediction methods struggle to model temporal structure, even when adding explicit modeling through Graph Neural Networks. We therefore introduce a vision-only temporal model that outperforms graph-based methods significantly when using all available egocentric video as input. Building on this model we also introduce a novel multi- to single-view feature alignment strategy that improves single-view performance on multi-role action recognition, mitigating the need for extensive egocentric video capture. Benchmark and code will be released upon acceptance.

Measurement-Based Performance Evaluation of SmartRSUs with Heterogeneous Antenna Architectures for V2X Communications

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13334v1 Announce Type: new Abstract: This paper presents a measurement-based performance evaluation of two custom Smart Roadside Units (SmartRSUs) featuring different V2X antenna architectures. The first configuration integrates GNSS and communication antennas into an all-in-one rooftop module, whereas the second uses external dual ITS-G5 (IEEE 802.11p) antennas operating at 5.9~GHz and a dedicated GNSS antenna. Both systems are built upon a proprietary On-Board Unit (OBU) platform adapted for infrastructure deployment. The experimental campaign evaluates key V2X communication metrics, including coverage, received signal strength indicator (RSSI), packet loss, and end-to-end latency in both transmission (OBU-to-infrastructure) and reception (infrastructure-to-OBU) directions. To ensure objective validation, a commercial off-the-shelf V2X Roadside Unit is co-located on the same infrastructure and used as a performance benchmark, providing ground-truth reference measurements under identical environmental conditions through a controlled co-located deployment. Results highlight the impact of antenna design and placement on communication reliability and latency, revealing trade-offs between integrated and external antenna configurations in real-world deployment scenarios. The findings provide practical insights for the design and optimization of next-generation SmartRSUs in cooperative intelligent transportation systems (C-ITS).

Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

Kaijie Xu, Anqi Wang, Xilin Dai — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13338v1 Announce Type: new Abstract: Probabilistic forecasting models are increasingly deployed on multivariate systems with distinct channel physics and operational constraints, but existing benchmarks evaluate neither property at scale. Public canonical multivariate benchmarks cap out at 2,000 channels, while power-system benchmarks either lack temporal structure or probabilistic evaluation. We introduce PowerPhase, a probabilistic forecasting benchmark built on six transmission grids ranging from 2,000 to 36,964 jointly forecasted channels, more than an order of magnitude beyond popular canonical multivariate benchmarks. Each target trajectory is the output of an AC power-flow solve, and PowerPhase ships with constraint-aware metrics, including Safety_mBrier, NECV, and CVaR-alpha, that complement CRPS and Distortion. Across eight baselines and three seeds, distributional accuracy and constraint satisfaction rank models differently, a trade-off we term safety-fidelity. We further propose PowerForge, a scenario-based quantile forecaster with type-specific decoding heads and a causal bridge between variable groups, which achieves the best average rank on every grid.

A Note About Algebraic $(s, t)$-Weak Tractability Of Linear Tensor Product Problems In The Worst-Case Setting

Zirong Liu, Heping Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13339v1 Announce Type: new Abstract: This paper is devoted to discussing the linear tensor product problems in the worst case setting. We consider algorithms that use finitely many evaluations of arbitrary continuous linear functionals. We investigate algebraic $(s, t)$-weak tractability (ALG-$(s, t)$-WT) under the absolute error criterion in the case ${\lambda}_1 > 1$, where ${\lambda}_1$ is the square of the univariate maximal singular value. We solve the problem by giving the necessary and sufficient conditions for ALG-$(s, t)$-WT on univariate singular values and fill the gap left open.

EMG-Based Adaptation of Anisotropic Virtual Fixtures for Robot-Assisted Surgical Resection and Dissection

Dario Onfiani, Michael Dyck, Luigi Biagiotti, Julian Klodmann — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13340v1 Announce Type: new Abstract: In this paper, we address the development of an adaptive assistance system for robot-assisted laparoscopic surgery, specifically for delicate tasks such as Resection and Dissection. Even if Virtual Fixtures offer significant advantages for guiding a surgeon's movements, conventional Virtual Fixtures are often defined by fixed geometries, lacking the flexibility to adapt to the surgical workflow or the surgeon's immediate intent. To address these limitations, we propose a novel framework for an adaptive and anisotropic virtual fixture. In addition, we introduce an intuitive control interface that modulates the fixture's geometry in real-time based on the surgeon's intent, inferred from EMG signals. This approach allows the surgeon to dynamically expand or disengage the constraint by contracting their forearm muscles, enabling seamless transitions between precise guided motion and free repositioning of the tool. Experimental results from a pilot user study, based on a standardized surgical training task, demonstrate the effectiveness of the proposed method. The system showed significant improvements in task accuracy and movement consistency, alongside a reduction in perceived cognitive load, effort, and frustration.

Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis

Gabriel Steele, Alzahra Altalib, Alessandro Perelli — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13341v1 Announce Type: new Abstract: We present a Dual-Domain Equivariant Generative Adversarial Network (DDE-GAN) for multimodal CT-PET image synthesis. Traditional GAN-based approaches often operate solely in the spatial domain and ignore geometric consistency, resulting in limited structural fidelity. DDE-GAN addresses these challenges by jointly learning from both spatial and frequency (Fourier) domains, capturing complementary anatomical and spectral information. Furthermore, rotational equivariance embedded in the physics of the CT and PET measurements are integrated into the loss of both the generator and discriminator to ensure consistent responses under rotations, improving anatomical accuracy. A hierarchical dual-domain training strategy enforces intra- and inter-domain consistency through multi-stage loss functions. Evaluated on the HECKTOR 2022 CT-PET dataset, DDE-GAN achieves superior synthesis quality over baseline models for CT-PET image synthesis. The results demonstrate that combining dual-domain learning with geometric equivariance substantially enhances multimodal image synthesis accuracy and robustness, enabling practical applications in PET completion and data augmentation.

Improved Runtime Bound for the $(\mu + 1)$ EA on BinVal

Joris Belder, Johannes Lengler, Raghu Raman Ravi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13344v1 Announce Type: new Abstract: We study the $(\mu+1)$ EA on the Binary Value function BinVal. We show that it needs at most $O(\mu \log \mu \cdot n \log n)$ function evaluations to find the optimum when $\mu = o(n/\log n)$. This substantially improves upon the recent upper bound of $O(\mu^5 n \log(n/\mu^4))$ by Krejca, Neumann and Witt. Our results hold for several mutation operators including standard bit mutation. In particular, our bound implies that the $(\mu+1)$ EA is at most a factor $O(\log \mu \cdot \log n)$ slower on BinVal than on OneMax.

JointEdit3D: Feed-Forward 3D Scene Editing in a Unified Latent Space

Xinnan Zhu, Ruijie Xu, Jiayu Ying, Daoguo Dong, Jiachen Xu, Yuan Xie, Xin Tan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13345v1 Announce Type: new Abstract: Existing 3D scene editing methods typically rely on per-scene optimization over explicit 3D representations or cascaded edit-and-reconstruct pipelines, resulting in high test-time cost, limited 3D awareness, and structural inconsistencies. To couple appearance synthesis and geometry prediction during editing, we build on a unified RGB-geometry reconstruction-generation latent space and adapt it to feed-forward 3D scene editing. The resulting framework, \textbf{JointEdit3D}, performs asymmetric latent inpainting by observing only a single edited RGB reference latent and generating the remaining RGB views and edited geometry latent under source-scene anchoring. JointEdit3D introduces a dedicated SceneAnchor Branch to inject source-scene structure without forcing direct copying, and adopts edit/background-aware losses to balance edited-region fidelity with unedited-content preservation. To address the lack of paired resources for standardized 3D scene editing evaluation, we introduce SceneEdit3D-15K, a dataset with 15K paired editing samples and renderer-provided 3D annotations, together with SceneEdit3D-Bench, a curated 100-sample benchmark. Experiments show that JointEdit3D improves edited-region quality and 3D structural completeness over prior baselines while maintaining competitive background preservation.

Enhanced Low-Density Region Exploration in Classifier-Guided Diffusion Models Through Modified Reverse Diffusion Sampling

Jagriti Singh, Shekhar Verma, Muneendra Ojha — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13347v1 Announce Type: new Abstract: Diffusion models have emerged as state-of-the-art generative models for high-fidelity image synthesis, particularly in their classifier-free guided and classifier-guided forms. However, standard classifier guidance concentrates probability mass around high-density class mean, leading to poor coverage of rare samples in the tails of the class-conditional distributions. Recent work on diffusion-based tail sampling mitigates this by training an additional low-density-seeking classifier with a synthetic-vs-real discriminator, at the cost of additional networks and training. In parallel, a number of samplers and distillation techniques accelerate or refine diffusion sampling, but do not explicitly address long-tail coverage. We propose a purely sampling-time, density-aware extension of classifier-guided conditional diffusion model that targets low-density regions without any additional training. We have applied guidance at noisy images not on predicted noise like most diffusion models. Starting from a pretrained conditional diffusion model and classifier on ImageNet, we modify the guided reverse dynamics by steering trajectories toward low-confidence regions via the modified classifier gradient, and at each time step, we also guide the sampling process toward the predicted real image. 1st guidance helps explore low-probability samples, and 2nd guidance helps to generate samples to be close to the real data manifold. The proposed sampler consistently improves ADM model recall at 64x64 resolution while maintaining a comparable FID, and with a 256x256 ADM model, we showed the results visually with different combinations of both guidance. We also showed that standard ADM classifier guidance, combined with predicted real image guidance, helps generate high perceptual quality samples with a 256x256 ADM model on ImageNet.

IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction Worlds

Micaela Vaucher, Santiago Silveira, Santiago G\'ongora, Luis Chiruzzo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13348v1 Announce Type: new Abstract: Computational creativity in Interactive Fiction faces a fundamental tension: Large Language Models (LLM) may produce creative narratives but struggle with world coherence, while symbolic systems ensure consistency but lack creative flexibility. We present IVIE (Incremental & Validated Interactive Experiences), a neuro-symbolic approach to generating complete and playable interactive fiction worlds from scratch. Building upon PAYADOR's neuro-symbolic framework, IVIE implements a four-stage incremental generation pipeline that delegates creative decisions--setting and character creation, puzzle design--to LLMs while grounding the world state through symbolic validation. The system generates worlds with interconnected locations, functional items, non-player characters, and coherent puzzles, all structured around a central goal-oriented architecture. Human evaluation shows the approach generates immersive, thematically coherent worlds with high player engagement. Results seem to indicate that the neuro-symbolic approach successfully balances flexibility with narrative coherence: symbolic validation grounds LLM generation without eliminating generative freedom. However, challenges remain: LLM inconsistencies occasionally bypass puzzle constraints, and objective validation gaps allow some structurally impossible goals. We identify key design considerations for future neurosymbolic interactive storytelling systems, particularly regarding LLM capabilities and their limitations.

From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent

Haishuo Fang, Yue Feng, Iryna Gurevych — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13349v1 Announce Type: new Abstract: Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.

Low cost, easily manufactured, highly flexible strain and touch sensitive fiber for robotics applications

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13352v1 Announce Type: new Abstract: Existing stretch and touch sensors for robots are generally expensive with respect to at least one of material costs, required manufacturing equipment, or manufacturing time. We present and experimentally characterize a conductive fiber made using only inexpensive commercial off-the-shelf parts (conductive thread at $0.07/ft, silicone tubing at $0.94/ft) and tools (loop-style needle threader at $2), which can be manufactured quickly (20 cm length in 2 minutes.) We demonstrate its use as a resistive strain sensor with three applications: Triggering a grasp in a pneumatically actuated assistive finger, sensing the pose of a pneumatically actuated robotic strap, and estimating the pose of a flexible solid. We also demonstrate that it can be used as a capacitive sensor with two applications: First, as a touch sensor which triggers a commercial robot arm to move, and second, as a near-field sensor enabling the robot arm to follow a moving hand. The capacitive sensors are knitted, showcasing the high flexibility of the fiber. We discuss methods for improving manufacturing scalability and their cost trade-offs. Finally, we demonstrate a method for repairing a cut fiber.

SupraSNN: Exploiting Synapse-Level Parallelism in Spiking Neural Network Accelerators through Co-Optimized Mapping and Scheduling

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13354v1 Announce Type: new Abstract: Spiking Neural Networks (SNNs) offer a brain-inspired path toward highly efficient computation, but their practical deployment is constrained by the challenge of managing and executing their massive parallelism on physical hardware. This problem mirrors the historical challenge in processor design of moving beyond serial execution, a barrier broken by superscalar architectures that dispatch multiple instructions to parallel functional units. Drawing inspiration from this paradigm, we introduce a hardware-software co-design framework that treats synaptic events as parallelizable micro-operations. We present SupraSNN, a superscalar-inspired architecture that achieves high synapse-level parallelism by physically decoupling synaptic and neuronal computations. Within this architecture, a Multi-Cast Tree routes spike data to multiple parallel Synapse Processing Units serve as the computational pipelines, while a Merge Tree consolidates distributed results for processing by a unified Neuron Unit--deliberately centralizing complex neuron state dynamics to mitigate hardware overhead and resource duplication. The efficacy of this architecture is enabled by a sophisticated partitioning and scheduling framework that first maps the SNN onto hardware respecting memory constraints, then heuristic scheduling determines the synaptic execution order, maximizing throughput and resource utilization. Implementing a feedforward SNN trained on MNIST (93.44% accuracy), SupraSNN achieves 149 $\mu s$ inference latency and 0.025 mJ per image (0.276 nJ per synapse) on the Xilinx Zynq XC7Z020 FPGA--delivering 47.6% lower latency and 5.6$\times$ better energy efficiency than prior FPGA-based SNN accelerators. Beyond vision tasks, a recurrent SNN on the Spiking Heidelberg Dataset (71.82% accuracy) achieves 1.41 ms latency and 0.77 mJ per sample on XC7Z030.

Real-Time Execution with Autoregressive Policies

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13355v1 Announce Type: new Abstract: Real-time execution, enabled by asynchronous inference that ensures both smooth action trajectories and fast reactivity, is critical for realistic deployments of large-scale Vision-Language-Action models. However, recent work on real-time execution primarily focuses on variants of diffusion policies, even though it is more critical for autoregressive policies given their slower rollout speed in synchronous inference. In contrast, we demonstrate that autoregressive policies can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding, thereby guaranteeing strict latency bounds that enable multi-trajectory decoding to maximize performance. Across simulated and real-world environments, we find that the autoregressive policy consistently outperforms its equivalent-level flow-matching policy counterpart while achieving significantly improved task completion speeds from synchronous inference. Coupled with the inherent advantages of autoregressive policies, such as faster convergence and better generalizability in instruction-following, these results confirm that autoregressive policies can remain a competitive policy type supporting real-time execution.

Linear convergence of iterative contour integral-based eigensolvers for nonlinear eigenvalue problems

Daniel Kressner, Yuqi Liu, Jose E. Roman, Meiyue Shao, Nian Shao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13357v1 Announce Type: new Abstract: Solving nonlinear eigenvalue problems is an important and challenging task in scientific computing. Contour integral-based approaches are attractive for such eigenvalue problems because they reliably target all eigenvalues in a prescribed domain. However, unlike in the linear case, many traditional methods of this type, such as Beyn's method, lack an inherent iterative refinement mechanism. Consequently, achieving high accuracy requires high-quality quadrature rules for approximating the contour integral, which often leads to prohibitive computational costs. A notable exception is the so-called NLFEAST algorithm, which combines contour integral techniques with a nonlinear Rayleigh--Ritz extraction step. In this work, we propose a general framework of iterative contour integral-based methods for nonlinear eigenvalue problems that includes NLFEAST. This allows us to prove linear convergence of NLFEAST under mild assumptions and also explains why certain nonlinear eigensolvers do not combine well with iterative methods. Numerical experiments confirm our theoretical findings; in particular that NLFEAST can achieve high accuracy even with a limited number of quadrature nodes, significantly outperforming Beyn's method on challenging problems.

Sizing of a grid-forming power converter to improve the small-signal stability of an LCC-HVDC system connected to a weak grid

Anup Joshi, Javier Renedo, Xavier Guillaud — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13358v1 Announce Type: new Abstract: Line-commutated converter high-voltage direct current (LCC-HVDC) has proven to be a reliable technology for bulk power transmission over long distances. However, the growing penetration of converter interfaced generation (CIG) is resulting in weaker AC grids, rendering the operation of LCC-HVDC systems vulnerable and posing a serious challenge to their stability. Grid-forming (GFM) controlled voltage source converter (VSC) have been shown to provide stabilizing impact in weak grid conditions. However, the impact of GFM controlled VSCs (GFM-VSC) on stability of LCC-HVDC in weak grid conditions has not been studied in depth in the literature. In this paper, a simplified model of LCC-HVDC is proposed and validated. Then a small-signal state-space model of a system consisting of aforementioned LCC-HVDC, a GFM-VSC and an infinite grid is developed to study the interactions between different components. The small-signal stability analysis shows the stabilizing effect of the GFM-VSC on the stability of the LCC-HVDC link in weak grid condition. Furthermore, the study on the sizing of the GFM power converter reveals that even a modest share of the capacity of the GFM power converter relative to the total nominal apparent power (sum of nominal power of LCC-HVDC and the nominal apparent power of GFM-VSC) is sufficient to ensure the stability of the system, in the test system analyzed in this study. This work just focuses in small-signal stability, but it is important to highlight that other stability phenomena should also be taken into account when selecting the final size of the GFM-VSC.

The $(1 + 1)$-EA in Dynamic Environments

Georg Hasebe, Johannes Lengler, Raghu Raman Ravi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13360v1 Announce Type: new Abstract: We study the $(1 + 1)$-EA in dynamic linear environments, where in every generation selection is performed with respect to a freshly sampled linear function with positive weights. We consider the Dynamic Binary Value problem, where each generation uses a uniformly random permutation of $1,2,4,\dots,2^{n-1}$, and a Uniform weight variant, where the weights are drawn independently from $\mathrm{Unif}(0,1)$. Both of them have recently been integrated into the IOHprofiler platform and empirically studied. For both models we prove a sharp threshold in the mutation parameter $\chi$ for mutation rate $\chi/n$. Below the threshold, the expected optimisation time is $\mathcal{O}(n\log n)$, whereas above it the runtime becomes $2^{\Omega(n)}$. For the Dynamic Binary Value problem in the exponential regime, we also quantify at what distance from the optimum the optimisation process stagnates. We show that there is a second threshold: a distance that is efficiently reached, but reaching any smaller distance takes exponential time. This quantifies and proves previous empirical findings.

Can I Buy Your KV Cache?

Luoyuan Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13361v1 Announce Type: new Abstract: Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13364v1 Announce Type: new Abstract: We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

Dual-Constrained Diffusion Image Compression for Operational Rate-Distortion-Perception Optimization

Sanxin Jiang, Jiro Katto, Heming Sun — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13366v1 Announce Type: new Abstract: The rate-distortion-perception (RDP) trade-off extends classical rate--distortion theory by imposing a distributional constraint on reconstructions, providing a unified framework for neural image compression that jointly governs fidelity and perceptual realism. While prior work achieves near-optimal rate--perception trade-offs, practical frameworks explicitly realizing the full RDP surface remain scarce, primarily due to the difficulty of introducing common randomness at the decoder. We propose DCIC (Dual-Constrained Diffusion Image Compression), which integrates a learned codec with a diffusion-based decoder governed by joint distortion and idempotence constraints. The distortion constraint bounds reconstruction fidelity relative to the base codec output; the idempotence constraint -- requiring that re-encoding the restored image recovers the base codec reconstruction -- serves as a tractable surrogate for the distributional perception requirement. Together, they steer the reverse denoising process via iterative optimization with consistent noise injection, realizing common randomness without additional rate overhead. At fixed rate, dual attenuation factors $(K_D, K_P)$ jointly navigate the Pareto frontier of the distortion-perception plane, enabling continuously adjustable fidelity-realism trade-offs from a single bitstream. DCIC$_{RD}$ ($K_P{=}0$) and DCIC$_{RP}$ ($K_D{=}0$) arise as boundary curves, with DCIC$_{RDP}$ ($K_D = K_P=1$) realizing the optimal interior operating point. Experiments on CelebA-HQ, CLIC2020, and ImageNet-1K across CNN, Transformer, and hybrid architectures confirm that DCIC$_{RDP}$ achieves superior BD-PSNR over all perceptual codecs, while DCIC$_{RP}$ matches dedicated perception-oriented methods in BD-FID, validating the practical value of full RDP surface navigation.

IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13368v1 Announce Type: new Abstract: Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.

A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget

Joe Dwyer — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13370v1 Announce Type: new Abstract: This study examines training dynamics in a small Llama-style language model trained under a fixed, compute-constrained token budget. Rather than evaluating efficiency solely through endpoint performance, the study uses a quantitative experimental repeated measures design to analyze how validation loss, validation perplexity, rolling volatility, backslide behavior, spike behavior, and between-seed variability change across token-based training intervals. Six independent training runs were conducted on a 4.26-million-parameter model using the TinyStories corpus, CPU-based full-precision training, and a target budget of approximately 20 million cumulative training tokens. Metrics were collected across 21 intervals, producing 126 seed-by-interval observations. Repeated measures ANOVA showed statistically significant interval effects for validation loss, validation perplexity, and rolling volatility. Descriptive trajectories revealed rapid early improvement followed by non-monotonic degradation during later training intervals. Mean validation loss decreased from 8.3552 at initialization to 2.7996 near 4 million tokens, but increased to 3.9010 by the final checkpoint. Validation perplexity followed the same pattern, falling sharply early in training before rising later. Derived telemetry further showed recurrent validation-loss backslides and no interval-summary evidence of a stable phase under the predefined criteria. These findings suggest that compute-aware language model evaluation should examine training trajectories rather than endpoint metrics alone. In constrained compute settings, additional token exposure may increase computational cost without producing proportional generalization gains, and interval-level telemetry can reveal instability, regression, and diminishing returns that final metrics may obscure.

Temporal Conductance and Bounds on the Voter Model for Dynamic Networks

Tatiana Rocha Avila, Holger Dell, John Lapinskas — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13374v1 Announce Type: new Abstract: The voter model is a classical stochastic process that models how opinions might spread through a network: at each step, every node lazily adopts the opinion of a random neighbour; eventually all nodes share the same opinion (consensus). Stronger connectivity should yield faster consensus. Berenbrink, Giakkoupis, Kermarrec, and Mallmann-Trenn (ICALP 2016) make this precise via the network's conductance: if the network has $m$ edges, minimum degree $d_{\min}$, and conductance at least $\phi$, then the voter model reaches consensus in expected $O(m/(d_{\min}\phi))$ steps. Their results extend to dynamic networks with fixed vertex degrees by considering the network's conductance at each time step. We introduce temporal conductance $\Phi$, a more general connectivity measure for dynamic networks. Unlike static conductance, which collapses to $0$ whenever some snapshot is disconnected, $\Phi$ captures connectivity through edges that appear at different times. We generalise the results of Berenbrink et al. from static conductance to temporal conductance, showing that the expected consensus time of the standard voter model is at most $O(m/(d_{\min}\Phi))$. Moreover, we prove that this bound is tight up to constant factors. We expect temporal conductance to be a useful primitive for analysing other dynamics on temporal networks, and potentially time-inhomogeneous Markov chains more generally.

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

Yang Zhou, Ziheng Wang, Yuqin Lu, Haofeng Liu, Jun Liang, Shengfeng He, Jing Li — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13376v1 Announce Type: new Abstract: We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360$^\circ$ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.

Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech Recognition

Benedikt Hilmes, Nick Rossenbach, Ralf Schl\"uter — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13379v1 Announce Type: new Abstract: Memristors provide a new chance for resource-efficient computation of neural models for natural language processing by enabling analog execution of vector-matrix-multiplication. Yet, computations on these devices are currently subject to larger distortion, both in weight programming and execution. In this work, we identify large output values of transformed positional encodings to cause major degradation within analog-to-digital conversion (ADC) as part of memristor-based computation. By adjusting the proportion of weight and precision bits of the ADC of specific memristor layers, we reduce the degradation of the execution by ~50% relative, while keeping the estimated energy consumption stable. Additionally, we investigate scenarios where the ADC cannot be modified. In that case the degradation can be reduced by ~30% relative after removing encoding-related linear transformations.

H\"older++: Improving the Quality-Coherence Trade-off in Multimodal VAEs

Huyen Vo, Mar\'ia Mart\'inez-Garc\'ia, Isabel Valera — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13381v1 Announce Type: new Abstract: Existing approaches for multimodal variational autoencoders (VAEs) face a trade-off between generative quality and coherence-i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to H\"older pooling as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises sample diversity. Inspired by this insight, we propose H\"older++, a novel multimodal VAE that improves the generative quality-coherence trade-off through: (i) the first implementation of H\"older pooling without any approximation for multimodal VAEs; (ii) an extended architecture that models distinct shared and private (i.e., modality-specific) representations (H\"older+); and (iii) hierarchical inference that further enhances the disentanglement between the shared and private representations (H\"older++). Our experiments corroborate that H\"older++ consistently improves the generative quality-coherence trade-off, yields more structured latent spaces, and learns shared representations that are informative for downstream tasks.

SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation

Zian Yang, Zixin Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13382v1 Announce Type: new Abstract: Few-shot font generation simultaneously requires global structural completeness and fine-grained local style fidelity. Existing methods usually either rely on global content-style modeling, which is robust but imperfectly disentangled, or emphasize component/local modeling, which captures fine details but relies heavily on local priors and reference coverage. We argue that the key challenge is not merely to learn purer conditions, but to organize complementary yet biased global and local conditions through multi-level allocation during generation. To this end, we propose SmartFont, a diffusion-based few-shot font generation framework that combines global content-style generation with weakly supervised local corrective experts. The local branch performs semantic-spatial allocation by learning expert-wise local concepts and semantically meaningful spatial maps under weak component supervision, enabling fine-grained correction without requiring explicit component-conditioned inference. On top of this, a denoising-state condition allocation module adaptively weights global content, global style, and local corrective feature across timesteps and injection blocks. Extensive experiments show that SmartFont achieves better global-local balance, improves glyph quality and local detail fidelity.

Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13385v1 Announce Type: new Abstract: Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an \textit{attack-centric} perspective, focusing on the technical feasibility of injections while overlooking the nuanced distribution of resulting harms. In practice, however, prompt-injection risk is victim-dependent: a single exploit can produce asymmetric consequences for different stakeholders, and the same attack pattern may exhibit substantially different effectiveness depending on whom it targets. To capture these properties, we introduce \textbf{\sysname}, a \textit{stakeholder-centric} benchmark to systematically categorize and attribute harm in real-world web agent systems. It distinguishes between affected entities (e.g., user, seller, platform), decomposes the attacks into concrete objectives, and evaluates each case with complementary outcome- and process-level metrics. Our results reveal substantial and heterogeneous vulnerabilities: not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from \emph{stealthy parasitism} (attack succeeds without disrupting the user's delegated task) to \emph{misaligned disruption} (task disrupted without attack success) and \emph{compounded failure} (both adversarial objective and task integrity simultaneously violated). These patterns are missed by conventional evaluation, highlighting the need for stakeholder-aware assessment of LLM-based agents in real-world deployments. Benchmark is available at https://github.com/StakeBench/SBC.

Structuring Transparency: Developing Domain-Specific Generative AI Declaration Frameworks in Higher Education

Nicholas Micallef, Olga Petrovska — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13389v1 Announce Type: new Abstract: As Generative AI (GenAI) disrupts higher education, institutions increasingly require students to declare AI use. However, generic, binary declarations (e.g., "I used GenAI") fail to capture the nuanced application of these tools in different academic tasks. Establishing transparency is key to protecting academic integrity, promoting AI literacy, and shifting the focus from policing to professional practice. In response, this paper contributes a design artefact and an accompanying position: a framework of two task-specific declaration structures, one for writing-focused activities and one for coding assessments, developed for a Computer Science department on the basis of an existing taxonomy of GenAI usage, together with an argument that task-specific disclosure is needed to move beyond binary declarations. By categorising AI usage across specific cognitive and developmental stages, such as structural planning vs. Textual Content Generation, or code improvement vs. code generation, the framework encourages students to reflect on their own learning process and clarifies the boundary between acceptable assistance and academic misconduct. We propose this domain-specific approach as a foundation for fostering more honest assessment in Computer Science and other disciplines, aiming to better prepare students for professional environments where documenting GenAI workflows might be an essential job requirement.

Experimental Insights into UDP-Based Video and Control Traffic over IEEE 802.11p ITS-G5

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13390v1 Announce Type: new Abstract: Vehicular applications such as cooperative driving, teleoperation, and real-time perception increasingly rely on low-latency wireless communication. In this context, ITS-G5, based on IEEE 802.11p, represents a key technology for enabling direct vehicle-to-vehicle and vehicle-to-infrastructure communication. Despite its relevance, experimental studies focusing on the performance of UDP-based traffic over IEEE 802.11p under realistic conditions remain limited. This paper presents an experimental evaluation of UDP transmission over an IEEE 802.11p ITS-G5 testbed composed of Raspberry Pi-based onboard units and commercial roadside units. The analysis investigates the impact of different modulation and coding schemes (MCS). It also evaluates two network-layer configurations (IPv4 unicast and IPv6 multicast) and the use of CAKE for active queue management. In addition to synthetic traffic generated with iPerf, the evaluation includes real-time video streaming using MPEG-TS over UDP to emulate latency-sensitive vehicular applications. Results show that the modulation scheme is the dominant factor influencing latency at low traffic loads, while the choice of transmission mode and IP version becomes increasingly significant under congested conditions. Higher-order modulations significantly reduce latency and variability, whereas IPv6 multicast exhibits greater delay dispersion than IPv4 unicast. Furthermore, active queue management does not seem to improve delay predictability. These findings provide practical insights for configuring ITS-G5 networks supporting latency-sensitive vehicular services.

MiniMax Sparse Attention

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13392v1 Announce Type: new Abstract: Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.

GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation

Xiangyu Zhu, Renjun Wu, Luzhou Ge, Jinyan Liu, Xuesong Li — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13394v1 Announce Type: new Abstract: Whole-body mobile manipulation requires coordinating mobile base and manipulator under shifting viewpoints, posing challenges in geometric perception and action generation. Current policies either rely on 2D features or sparse 3D representations that lack dense spatial structure, and typically encode arm and base within one action vector that ignores their distinct control demands. Moreover, existing dense fusion strategies risk corrupting pretrained representations under noisy depth while incurring heavy computational overhead. We present GeoHAT, an end-to-end diffusion-based framework built on a simple principle: geometry should be injected only where reliable and attended to only where needed. GeoHAT employs a lightweight Fourier spatial encoder that maps dense per-pixel 3D coordinates into geometric tokens without an additional 3D vision backbone. These tokens are then selectively injected into vision foundation model features through per-token gated fusion modulated by depth validity, preserving the semantic prior while enriching spatial understanding. For action generation, a Hybrid Whole-Body Action Decoder decomposes arm and base into distinct subspaces and lets each action modality attend to its task-relevant visual context through sparse cross-attention, while causal temporal modeling captures intra-timestep coordination and inter-timestep dependencies. Experiments on the ManiSkill-HAB simulation benchmark demonstrate that GeoHAT achieves a 79.3% mean success rate, surpassing the strongest baseline by 23.7%. Furthermore, real-world experiments on diverse tasks also confirm consistent improvements over all baselines.

Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13397v1 Announce Type: new Abstract: Language operates as a mechanism of both marginalization and resistance, especially for minority communities navigating insensitive and harmful speech online. As content moderation increasingly depends on large language models (LLMs), concerns arise about whether these systems can recognize culturally insensitive speech-language that disregards or marginalizes the cultural and religious perspectives of historically underrepresented communities, often through implicit erasure, misrepresentation, or normative framing, rather than overt hostility. Focusing on Bangladesh's Hindu and Chakma communities -- the country's largest religious and Indigenous ethnic minorities, respectively -- this paper investigates the epistemic limits of LLM-based moderation systems and explores methods for incorporating minority perspectives. We co-created a culturally grounded corpus of insensitive speech with community members and integrated their narratives into moderation pipelines using retrieval augmented generation (RAG). Our tool, Mod-Guide, improves LLM sensitivity to minority viewpoints by leveraging contextual cues derived from lived experience. Through mixed-method evaluations involving both minority and majority participants, we demonstrate that RAG-enhanced moderation responses are more contextually accurate and perceived differently across ethnic lines. This work advances research in human-computer interaction, AI ethics, and social computing by foregrounding restorative justice and hermeneutical inclusion in the design of content moderation systems.

PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13400v1 Announce Type: new Abstract: While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and flow dynamics. PolyFlow introduces a discrete-time flow formulation and a projection-free architecture, which eliminate the discretization error and guarantee strict satisfaction of arbitrary polyhedral constraints, without the need for expensive iterative solvers. Experimental results show that PolyFlow achieves zero constraint violation while maintaining high distributional fidelity across a range of planning and control tasks. Compared to state-of-the-art constrained generation baselines, PolyFlow significantly reduces inference latency and demonstrates a favorable trade-off between safety, efficiency, and generative quality. Code is available on https://github.com/MJianM/PolyFlow.

Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda

Alexander Rombach, Chantale Lauer, Nijat Mehdiyev — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13405v1 Announce Type: new Abstract: LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbolic structures already embedded in these domains, including regulations, typed process models, and compliance constraints, should be treated not merely as external monitoring mechanisms but as core architectural components that shape the agent's decision-making and behavior. We propose compliance-by-construction as a complementary paradigm to guardrail-based monitoring: a structural foundation that prevents control-flow violations, while guardrails remain essential for catching semantic errors. We identify a structured set of neuro-symbolic research challenges on foundational and capability level and show that addressing them jointly enables compliance-by-construction. We call on the neuro-symbolic community to engage with regulated process automation as a high impact research domain.

Optimizing Appliance Scheduling for Solar Energy Management Using Metaheuristic Algorithms

Hiba Ahmed, Alexander E. I. Brownlee, Jason Adair, Simon T. Powers — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13407v1 Announce Type: new Abstract: Renewable energy is essential for meeting future energy demands; however, solar energy generation, which occurs only during daylight hours often does not align with household consumption patterns. Appliances such as cookers, washing machines, and dryers are typically operated according to user preferred schedules rather than solar energy availability, creating a scheduling optimization problem. The objective is to determine optimal appliance start times to maximize renewable energy utilization while minimizing user inconvenience and adhering to system constraints. This paper presents a metaheuristic approach using Iterated Local Search (ILS) and Simulated Annealing (SA) to optimize appliance start times, while considering appliance operating durations, power consumption, inverter limit, battery state of charge constraints, and solar generation forecasts. Unlike most existing work, the scheduling is extended beyond a single day to accommodate unfinished tasks from previous days (spillover), ensuring operational continuity and enabling sequential operation across multiple days. Experimental results show that the sequential multi-day scheduling framework effectively manages system constraints while ensuring user convenience under exclusive solar generation. These findings also open opportunities for future research on multi-objective trade-offs between investment in equipment of various sizes, return on that investment, and user satisfaction.

A catalog of fast matrix multiplication algorithms with frontier-closure search

Benoit Chatain Lacelle — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13408v1 Announce Type: new Abstract: The 2022--2026 burst of activity in small-format matrix multiplication (AlphaTensor 2022, AlphaEvolve 2025, Schwartz--Zwecher 2025) has produced striking individual results but scattered them across different fields, attribution conventions, and serialisation formats. A complementary line of work -- Perminov's open-source flip-graph framework~\cite{perminov2026fast,perminov2025fast} -- instead drives existing construction methods, notably flip-graph and \emph{meta-flip-graph} search, at scale across large format spaces, discovering many new low-rank schemes (including ternary-integer ones) that further enrich the landscape this catalog must unify. We present a unified, machine-checkable catalog covering shapes up to \nmpshape{32}{32}{32} over \Rationals, \Integers, \Reals, \Complex, and \Ftwo, with a separate axis for commutative algorithms (Waksman 1970, Makarov 1986, Rosowski 2019). Derivation over this catalog is performed by a \emph{frontier-closure search} that recombines catalog entries by axis-flip, Kronecker, axis concatenation, serendipitous products, recombination-with-allocation (with optional output peeling and pair fusion), and downward projection. A central methodological point is the \emph{non-overlap property}: our recombination does not, and cannot, rediscover the shared bilinear products that hand-crafted constructions (Strassen, Laderman, Smirnov, AlphaTensor) are built around. This draws a clean line between the ``find a cleverer bilinear core'' and ``compose known cores'' axes of progress, and resolves several attribution puzzles in the literature. We refresh the DIS09 comparison tables, split per field and with a commutative column, and provide the tooling to regenerate them automatically as the catalog evolves.

Person Identification from Contextual Motion

Igor Kviatkovsky, Ehud Rivlin, Ilan Shimshoni — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13410v1 Announce Type: new Abstract: We consider the problem of identifying people based on their motion styles. We present a generative model describing the action instance creation process and derive a probabilistic identity inference scheme for two common person identification scenarios motivated by the surveillance and authentication applications. We introduce a novel, \emph{interactive}, scenario for person identification from motion patterns. To this end, we formalize the identification process in the context of a sequential message exchange session between the subject and the system. The subject's behavior is modeled using a probabilistic generative model inspired by the Human Information Processing (HIP) paradigm. At each stage, the system presents a visual stimulus (a cue) to the subject and records their motion response. The cue is selected so as to maximize the mutual information of the expected response and the subject's identity. Once recorded, the response is used to update the a posteriori probability over possible subjects' identities. The process terminates once a sufficient classification confidence level is reached. To the best of our knowledge, this is the first time person identification is addressed in such interactive setting. We report high recognition rates on five publicly available datasets and our own novel dataset consisting of 4,476 recordings of 22 test subjects responding to 15 cues.

An End-to-End Hybrid Framework for Rumour Detection in Low-Resources Algerian Dialect

Dihia Lanasri, Fatima Benbarek — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13411v1 Announce Type: new Abstract: The rapid growth of social media has intensified the spread of rumours. This issue is more challenging in the Algerian context due to the informal and code-switched nature of dialectal content, the scarcity of annotated resources, and the limited effectiveness of standard Arabic NLP tools on dialect text. This paper presents an end-to-end rumour detection hybrid framework for Algerian dialect social media content. We build a domain-specific annotated dataset by combining real social media posts, synthetic data, and the FASSILA corpus, with automatic labeling based on a similarity-based annotation process. A transliteration pipeline is also introduced to generate parallel datasets in Arabic script and Arabizi. We evaluate multiple approaches, including classical machine learning, deep learning, transformers, and hybrid models. Experimental results show that a hybrid approach combining transformer embeddings with a classical classifier achieves the best performance, reaching an F1-score of 0.84. We also find that domain-specific pre-training is more important than model size, with social media-trained models outperforming larger models trained on formal Arabic corpora. These results demonstrate the feasibility of rumour detection in low-resource Algerian dialect settings.

When expectation fails: stochastic MPC of linear systems with random input losses

Paul Trodden, Xinda Li — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13421v1 Announce Type: new Abstract: We consider stochastic model predictive control (MPC) for constrained linear systems subject to multiplicative binary input uncertainty, motivated by applications such as networked control with packet losses and intermittent actuation. A common approach in this setting replaces the stochastic dynamics with their expectation, yielding tractable formulations that admit standard terminal ingredients and stability guarantees in expectation. We show that such formulations can exhibit structural properties that differ fundamentally from those of deterministic MPC and may be misleading as indicators of realized closed-loop behaviour. In particular, the expected value function is not necessarily monotonic in the prediction horizon, and value function-based inner approximations of the region of attraction may deteriorate as the horizon increases. Furthermore, we establish a probabilistic comparison with certainty-equivalent (optimistic) MPC, showing that the latter can ensure a strictly positive probability of recursive feasibility in situations where stochastic MPC certifies feasibility but fails with probability one. These results highlight inherent limitations of expectation-based stochastic MPC for systems with multiplicative binary uncertainty and motivate a re-examination of how stochasticity is incorporated into constrained predictive control design for such systems.

An Assessment Framework for Application-Level Cryptographic Agility

Navaneeth Rameshan, Gregoire Messmer — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13425v1 Announce Type: new Abstract: The impending post-quantum transition to new cryptography will require complete replacement of algorithms within all software. The cryptographic APIs used today make this transition challenging because they were not designed with agility as a concern. There is no method for systematically assessing cryptographic agility as an overall ability. In addition to this, the term itself refers to multiple independent capabilities. Specifically, it includes replacing algorithms, selecting by policy, and substituting implementations. This lack of structured decomposition limits both the evaluation of systems and the development of cryptographically agile APIs. We introduce a component-based assessment framework that characterizes application-level cryptographic agility along seven orthogonal dimensions: three coupling dimensions that measure what the application code knows about algorithms and providers, a cross-cutting decoupling mechanism, a governance authority dimension, and two agility enablers that measure actual migration capability. The framework is non-linear and captures non-hierarchical profiles: a system may achieve high operation decoupling yet low creation decoupling, or strong versioning without externalized configuration. We evaluate six representative APIs (PKCS#11, OpenSSL~3.0, JCA, Google Tink, AWS KMS, and HashiCorp Vault Transit) against the framework, revealing three pervasive and independent gaps: no system supports intent-based key creation, none provides policy-driven algorithm selection (as distinct from access control), and none offers dedicated/first-class operations for algorithm transformation of existing keys. These gaps are individually sufficient to prevent agile migration, explaining why the post-quantum transition remains a software engineering problem despite decades of API progress.

Accelerating Speculative Diffusions via Block Verification

Alexander Soen, Hisham Husain, Valentin De Bortoli, Arnaud Doucet — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13426v1 Announce Type: new Abstract: Speculative decoding speeds up LLM inference by using a draft model to generate tokens, with an acceptance-rejection scheme that ensures that the output matches the target distribution. Adapting this to continuous diffusions is difficult because speculative sampling requires drawing from a residual distribution. While straightforward in discrete spaces, efficiently sampling this residual in continuous space is non-trivial. Consequently, existing diffusion adaptations either use computationally inefficient sampling techniques or rely on an alternative scheme. In this work, we introduce a novel scheme that efficiently implements the original speculative sampling mechanism for diffusion models. Our approach offers a critical advantage over current methods: it enables us to adapt block verification from LLMs to diffusions -- which provably improves the acceptance rate of drafts. Furthermore, we formalize and analyze the Free Drafter, a heuristic self-speculative drafter for diffusions that requires no training. By enabling block verification, our Free Drafter yields up to a 6.3% speedup over existing speculative methods with no additional training and negligible overhead beyond the existing parallel verification pass.

VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural Outfits

Hoang-Nguyen Cao, Le-Hoang Bui, Dinh-Khoi Vo, Minh-Triet Tran, Trung-Nghia Le — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13427v1 Announce Type: new Abstract: Cultural garments pose a unique challenge for visual retrieval systems, as their identity often depends on subtle structural and symbolic details that are poorly captured by standard AI models. We introduce VietFashion, a new benchmark for sketch-text composed image retrieval centered on the Ao Dai, a traditional Vietnamese garment. VietFashion enables designers and researchers to retrieve culturally meaningful outfits using a combination of hand-drawn sketches, which convey garment structure, and textual descriptions, which encode cultural semantics. The dataset is initialized with 650 sketches and expanded using generative models to produce over 21,000 photorealistic images with aligned captions. Textual prompts that describe detailed outfit attributes, which are extracted from fashion magazines to ensure authenticity and diversity. To better reflect the inherent ambiguity of design intent, VietFashion adopts a multi-target retrieval setting, where a single query may correspond to multiple valid results. We establish standardized evaluation protocols and benchmark state-of-the-art composed image retrieval methods. Experimental results reveal significant performance gaps in modeling fine-grained cultural semantics and multi-modal composition, positioning VietFashion as a challenging benchmark for fine-grained fashion retrieval. The dataset is publicly available at: https://hng0303.github.io/VietFashion.

A Scalable Deflated Conjugate Gradient Solver for the Time-Dependent Pseudo-Stress Stokes Problem

Alessandra Cancrini, Gabriele Ciaramella, Paola F. Antonietti — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13429v1 Announce Type: new Abstract: We propose a novel iterative solution framework for the unsteady Stokes equations in the pseudo-stress formulation. When solving this class of problems by using implicit time-integration schemes, standard solvers suffer from deteriorating convergence properties for small time steps, independently of the chosen space discretisation method. This is due to the singular modes of the dev-dev operator. For this reason, we introduce a computational framework obtained by combining a deflated Conjugate Gradient method with a W-cycle multigrid scheme that employs a Restricted Additive Schwarz smoother. The key point is to choose the deflation subspace so that the inner system to be solved within a deflated Conjugate Gradient scheme corresponds to a Laplace problem defined on the singular modes of the original dev-dev operator. This results to be independent of the spatial discretisation method and allows one to use efficient multigrid iterative solvers. Numerical experiments show that the proposed strategy significantly accelerates the Conjugate Gradient convergence and provides stable performance with respect to the time step, confirming its robustness for solving linear systems in the pseudo-stress framework.

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13432v1 Announce Type: new Abstract: Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: https://ymlinfeng.github.io/OmniDirector.github.io/

Momentum Space Algorithm for Electronic Structure of Double-Incommensurate Trilayer Graphene

Ken Beard, Daniel Massatt — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13434v1 Announce Type: new Abstract: Numerical algorithms for computing electronic structure of incommensurate 2D materials using ab initio models is critical for predicting material properties and guiding experiment. For bilayers, momentum space and continuum models have been introduced to approximate observables of ab initio tight-binding models using a momenta description despite the lack of periodicity in the tight-binding model required for Bloch theory. A similar structure has been introduced for double-incommensurate trilayers using a continuum model, where the three lattices are all mutually incommensurate. However, this description leads to a four-dimensional lattice space, and numerical convergence of the density of states was observed to have poor convergence. In this work, we introduce a momentum space framework for double incommensurate trilayer graphene, and introduce an efficient truncation scheme of the four-dimensional lattice to drastically improve convergence of the density of states and momentum local density of states (a parallel object to classical band structure). We implement this algorithm on an ab initio model of twisted trilayer graphene and validate convergence estimates. We further verify numerically that the momentum space algorithm, inherently higher order than the continuum model as it is an exact transformation of the tight-binding model, captures altered band behavior near the flat bands at magic angles.

GIVE: Grounding Human Gestures in Vision-Language-Action Models

Pengfei Liu, Gen Li, Junqiao Fan, Boyu Ma, Jindou Jia, Yang Xiao, Jianfei Yang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13435v1 Announce Type: new Abstract: Human communication is inherently multimodal, where language is often accompanied by non-verbal cues such as gestures to convey intentions. However, current Vision-Language-Action (VLA) models treat robotic manipulation as a pure text-driven task, overlooking the important role of gestures in Human-Robot Interaction (HRI). This often leads to inaccurate intent grounding and unreliable manipulation when language instructions are ambiguous or underspecified. To address this challenge, we propose GIVE (Gesture Intent via Visual-Semantic Enhancement), an effective approach that enhances pre-trained VLA models with human gesture understanding without architectural modifications. Specifically, GIVE incorporates gesture information through two complementary pathways: a visual pathway that overlays hand skeletons and fingertip rays onto robot observations for explicit object grounding, and a semantic pathway that generates high-level descriptions of human gestures and task instructions for robust intent grounding. By jointly leveraging visual and semantic guidance, GIVE enables VLA policies to better associate gestures with manipulation behaviors and adapt to dynamic interaction intents. In real-world HRI experiments, GIVE substantially outperforms the baseline, improving target object recognition accuracy by 40% and overall task success rate by 80%, while demonstrating strong robustness and generalization to unseen spatial layouts and diverse participants.

Evaluation Sovereignty in Metadata-Driven Classification: A Multi-Track Framework for Weakly Supervised Information Systems

Raymond Vasquez — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13436v1 Announce Type: new Abstract: Evaluation in machine learning is typically treated as a neutral measurement process. However, in operational information systems, evaluation outcomes are often conditioned by the processes used to generate labels. This paper does not seek to improve classification performance. Instead, it examines the validity of performance measurement under differing label-authority regimes. This issue is particularly relevant in large-scale metadata-driven systems, where labels are often incomplete, inconsistent, or weakly supervised. We introduce evaluation sovereignty, defined as the degree to which performance metrics are independent of label authority and supervision regime, and propose a multi-track evaluation framework that systematically varies training and evaluation label sources. Using hierarchical multi-label classification on large-scale scientific metadata, we demonstrate that models exhibiting strong performance under operational ("silver") evaluation degrade substantially under independent ("gold") evaluation, particularly for fine-grained classification. For example, Micro-F1 decreases from approximately 0.54 to 0.03. Notably, ranking-based metrics remain above baseline, revealing a divergence between latent model signal and classification validity. These findings suggest that commonly reported performance metrics may reflect alignment with labeling processes rather than true predictive capability. We therefore reconceptualize evaluation validity as a system-level property shaped by label governance and provide a practical methodology for auditing intelligent systems operating under weak supervision.

CQC-RAG: Robust Retrieval-Augmented Generation via Cross-Query Consistency

Yanjia Sun, Sifan Liu, Jie Shao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13438v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has become a common approach for improving the factuality of Large Language Models (LLMs), yet its reliability remains highly sensitive to how external evidence is retrieved and used. Semantically equivalent queries with different syntactic forms may lead to different retrieval results, while irrelevant or misleading documents can further induce hallucinated answers. Existing multi-path reasoning methods improve robustness by sampling multiple candidate answers and applying voting- or confidence-based selection, but they still face two limitations: diversity is often injected through uncontrollable decoding randomness, and answer evaluation is usually confined to a single query-induced evidence view. To address these limitations, we propose a Cross-Query Consistency Hypothesis: correct answers tend to maintain high confidence across semantically equivalent but syntactically diverse queries, whereas noise-induced hallucinations exhibit unstable confidence under such query variations. Based on this hypothesis, we introduce CQC-RAG, a framework that co-designs query-level diversity injection with cross-query consistency evaluation. CQC-RAG rewrites the original question into diverse but meaning-preserving queries, reranks a shared document pool to construct query-conditioned reasoning contexts, applies an evidence-grounded protocol to extract answer-evidence pairs and selects answers according to their confidence stability across these contexts. This design enables self-evaluation without external supervision and does not rely on expanded retrieval coverage. Experiments on four open-domain question answering benchmarks show that CQC-RAG outperforms the strongest previous multi-query baseline by +4.76 pp EM on TriviaQA and +9.12 pp EM on MuSiQue, validating the effectiveness of cross-query consistency for filtering noise-induced hallucinations.

S-GBT: Smooth Growth Bound Tensor for Certified Robustness Against Word Substitution Attacks in NLP

Mohammed Bouri, Mohammed Erradi, Adnane Saoud — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13439v1 Announce Type: new Abstract: Despite recent progress in Natural Language Processing (NLP), models remain vulnerable to word substitution attacks. Most existing defenses focus on first order sensitivity and measure how much the output changes when the input is slightly perturbed. However, they ignore how this sensitivity evolves, which is described by curvature. When gradients vary sharply, models can still fail. This paper introduces the Smooth Growth Bound Tensor (S-GBT), a second order method that bounds the Hessian element-wise, for which we provide formal theoretical proofs on the resulting robustness bounds. A regularization term is added during training to minimize these bounds. This yields tighter certified robustness against word substitution attacks. The change in the output under word substitution is bounded by both a linear term and a quadratic term. S-GBT is derived for two architectures: Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN). The method is integrated directly into the training objective. Its effectiveness is evaluated on multiple benchmark datasets. The results show that combining first and second order regularization improves certified robust accuracy by up to 23.4% compared to prior methods, while clean accuracy remains competitive. These findings indicate that controlling both the gradient and its variation is a promising direction for building more robust models.

Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models

Joseph Keshet — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13441v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This paper argues that these attributions are misguided. We maintain that moral responsibility requires commitment-bearing agency grounded in intrinsic intentionality and self-attributed action, and that such agency constitutes the form of free will relevant to responsibility. Although LLMs generate coherent and normatively evaluable outputs, their operation is fully characterized by probabilistic input-output mappings learned from data. Their apparent intentionality is derived rather than intrinsic, and their outputs are neither owned as commitments nor guided by reasons. Variability introduced by stochastic sampling does not amount to choice or authorship. We address objections from the intentional stance, functionalism, compatibilism, and the presence of moral reasoning in model outputs, arguing that none suffice to establish genuine agency.

How Much Memory Do We Need? Adaptive Memory Gate for Neural Operators

Jihyeon Hur, Yongseok Kwon, Min-Gi Jo, Jeongwhan Choi, Noseong Park — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13443v1 Announce Type: new Abstract: Neural operators have emerged as a powerful data-driven approach for solving time-dependent PDEs. Among recent advances, memory-augmented neural operators explicitly incorporate past states and have achieved remarkable performance under low-resolution observation settings. However, existing approaches apply a fixed memory weight regardless of observation conditions, such as resolution or physical parameters, limiting their adaptability. Our preliminary experiments reveal that optimal memory weight varies with resolution and viscosity, implying that a fixed memory weight cannot simultaneously optimize performance across diverse settings. We propose AMGFNO, which dynamically modulates memory weight through a learnable gate. On the Kuramoto-Sivashinsky and Burgers' equations, AMGFNO achieves 55-79% nRMSE reduction over at low resolution, with the learned gate value automatically decreasing from $\bar{g} \approx 0.7$ to near-zero as resolution increases.

Clustering Node Attributed Networks with Graph Neural Networks and Self Learning

Rodrigo de Sapienza Luna, Daniel Ratton Figueiredo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13444v1 Announce Type: new Abstract: Graph clustering - partitioning the node set of a graph into disjoint subsets that reflect some latent information - is a fundamental problem as it finds applications in a myriad of different scenarios. While this classic problem has been tackled for decades by different communities, a recent variation of the problem driven by real data considers the scenario where nodes have attributes that are also informative. This has triggered novel methods that simultaneously leverage network information (edges) and node information (attributed) in the design of novel clustering algorithms. This work proposes a novel framework that builds on prior works that have applied graph neural networks (GNN) to graph clustering. The proposed framework operates in rounds of self learning in a fully unsupervised setting. In each round, a GNN generates representations for nodes that are used to cluster the nodes. This clustering influences the graph used to generate the node representation in the next round. Moreover, a context graph built in each round using the original graph is used to generate the node representations. Empirical results show that the proposed methodology extracts information from both network edges and node attributes in synthetic data, outperforming algorithms focused solely on the network or attributes when neither are very informative. Multiple rounds of learning also improve the performance and always outperforms a long single round of training (i.e., classic GNN graph clustering). When considering real datasets, empirical results indicate that the proposed methodology is competitive to state-of-the-art methods when cluster sizes are balanced.

Intent-Based Cryptographic API Design for Cryptographic Agility

Navaneeth Rameshan, Gregoire Messmer — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13445v1 Announce Type: new Abstract: As organizations move toward post-quantum cryptography, they face the major challenge of updating cryptographic algorithms across large, complex software portfolios. However, most cryptographic APIs in use today were designed around specific algorithms. These APIs expect explicit use of specific algorithms, provide little or no support for policy-based algorithm selection, and offer no straightforward way to migrate existing keys to newer algorithms. This makes the transition to post-quantum cryptography challenging. The companion assessment framework identifies the barriers to cryptographic agility and explains why algorithm transition is largely a software engineering problem. To address the limitations of current cryptographic APIs, we identify the principles necessary to design a cryptographically agile API. The design principles are derived from five fundamental architectural characteristics (Abstraction, Stability, Temporal Flexibility, Separation, and Extensibility). We also show how the design principles can be implemented using several examples of Protocol Buffers API design patterns. In particular, we present an intent vocabulary that is based on scopes which allows for decoupling key creation from algorithm identities. It also supports transparent substitutions of algorithms in the applicable scope. Cryptographic governance is enabled by an abstract policy API that does not prescribe the policy format. Keys are represented by stable identifiers and support key evolution operations (rotation, transformation, migration), facilitating migration between algorithms and providers while tracking both the original key identity and its evolution history. With this approach, updating cryptography becomes an operational process without the need to rewrite application code.

Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests

Ali Arabat, Mohammed Sayagh — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13449v1 Announce Type: new Abstract: AI-agents (e.g., GitHub Copilot) collaborate as teammates in different software engineering tasks, including code generation proposed through pull requests (Agentic-PRs). For better agent efficiency, developers create instruction files that guide the AI-agents, including how to navigate the project, locate the right components, run tests, respect best practices, and more. In this paper, we investigate the relationship between the creation of these instructions and the performance of AI-agents in creating better pull requests, which have a higher chance of success (i.e., the merge rate), address more complex tasks (e.g., code churn), and require less effort to be merged (e.g., time to merge). To this end, we analyze 15,549 agentic PRs from 148 projects in the AIDev dataset. Using the three dimensions, we compare each project before and after the creation of the instruction files. We find that specifying instructions for AI-agents does not necessarily lead to better results. With the instruction files, 27.7\% of the projects increased their merge rate by at least 20\%, while 26.35\% decreased it. The same observation is seen with the amount of changes (e.g., code churn, number of modified files) and with the efforts to merge an agentic PR (e.g., merge time and number of comments). From a first exploration, we find that projects that managed to increase their merge rate have substantially longer instruction files, which are also well structured into a higher number of sections and sub-sections. Our results motivate the need for research to assist practitioners in framing the development of instruction files as a software engineering activity (aka, \textbf{Instructions-as-Code}).

Uncertainty Estimation for Molecular Diffusion Models

Paul Seij, Christian A. Naesseth, Stephan Mandt, Metod Jazbec — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13451v1 Announce Type: new Abstract: Diffusion models have seen wide adoption for 3D molecular generation, yet they offer no principled signal of when a generated molecule is likely to be of low quality. We propose a post-hoc method for estimating per-sample uncertainty in pretrained molecular diffusion models. Building on a Laplace approximation of the denoising network, we measure the variability of the noise prediction across the generation trajectory. Empirically, we show that the resulting uncertainty score is informative of sample quality, exhibiting a negative correlation with established sample-level quality metrics. We further study how the proposed uncertainty score can be used to filter generated samples, improving model performance via test-time scaling.

Examining the Cognitive Gap Between Authors and Peer Reviewers on Academic Paper Novelty

Chenggang Yang, Chengzhi Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13452v1 Announce Type: new Abstract: Novelty is a crucial metric for assessing the quality of academic papers. Scholars strive to highlight the novel aspects of their work, particularly in the title, abstract, and introduction. Peer review, serving as the gatekeeper of scientific rigor, rigorously evaluates the novelty of papers, yet a cognitive gap may exist between author self-promotion and reviewer evaluation. To investigate this, we analyzed 15,328 academic papers published in Nature Communications from 2016 to 2021, along with their peer-review comments. We found that both reviewers and authors emphasize result-oriented innovation, with reviewers adopting a more comprehensive evaluation perspective. Furthermore, by examining promotional intensity against inherent paper novelty, we found that its effect depends on the paper's actual innovation level. Highly innovative papers benefit from stronger promotional language, receiving more positive evaluations. We also found that promotional language significantly correlates with reviewer disagreement on novelty specifically for papers of moderate innovativeness, whereas it has negligible impact for papers with either very high or very low novelty. This reveals how promotional language operates most prominently in the gray area of academic evaluation.

Reduced basis algorithm for solving nonlinear differential equations on quantum computers

Monica L\u{a}c\u{a}tu\c{s}, Matthias M\"oller, Sauro Succi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13457v1 Announce Type: new Abstract: As quantum computing moves toward scientific computing applications, nonlinear differential equations remain a central challenge since quantum evolution is intrinsically linear. In this work, we introduce a reduced basis algorithm (RBA) for polynomial nonlinear ordinary differential equations (ODEs) and spatially discretized partial differential equations (PDEs). After time discretization, the method composes the resulting polynomial update map over $m$ timesteps, identifies the reduced monomial basis appearing in this composed map, and constructs a linear RBA operator whose action recovers the exact $m$-timestep nonlinear dynamics. Thus, at the level of the chosen discrete update rule, the method introduces no additional approximation error beyond the time discretization error. The qubit number requirement is governed by the size of the reduced monomial basis. For an $n$-dimensional polynomial ODE system of degree $p>1$, the lifted register requires at most $q_m^{\mathrm{ODE}} = O(nm\log p)$ qubits in the full basis scenario. For PDEs discretized on $N^D$ grid points, a locality-based construction requires at most $q_m^{\mathrm{PDE}} = O(D\log N + n m^{D+1}\log p)$ qubits. Hence, the dependence on the grid size remains logarithmic, while the nonlinear overhead is controlled by local reduced basis size. The main computational burden is moved from the quantum computer to a classical preprocessing step, where the reduced monomial basis and RBA operator are constructed for the chosen timestep window. Through numerical tests on the Lorenz system and the one-dimensional Burgers equation, we verify that the RBA reproduces the corresponding discrete time nonlinear dynamics exactly, while exposing the trade-off between timestep composition, reduced basis growth, and locality.

From Traditional Automation to Embodied Wireless Intelligence: Vision-Language-Action Empowered Physics-Aware Communication Networks

Genze Jiang, Kezhi Wang, Xiaomin Chen, Yizhou Huang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13458v1 Announce Type: new Abstract: Wireless network automation has progressed from rule-based self-organising networks (SON) to data-driven optimisation, yet existing systems remain fundamentally disembodied, optimising performance indicators without perceiving the physical environment that governs radio propagation. We propose the embodied intelligent empowered base station (eBS), a paradigm that adopts a Vision-Language-Action (VLA) pipeline to transform base stations into autonomous agents capable of situated perception, causal physical reasoning, and physics-aware action generation. The eBS employs a two-tier asynchronous architecture: a Semantic Planner powered by a frontier Vision-Language Model (VLM) generates structured action directives on human timescales, whilst a Tactical Controller executes real-time adaptation. Case studies demonstrate that a single VLA pipeline, without task-specific training, can perform zero-shot material reasoning, generalise across viewpoints, and predict dynamic events before signal degradation occurs, illustrating a paradigm shift from traditional rule-following network automation to embodied intelligence empowered future wireless networks.

VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models

Ruiqi Xian, Yuehan Xian, Jing Liang, Xuewei Qi, Dinesh Manocha — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13460v1 Announce Type: new Abstract: Semantic 3D occupancy provides a voxelized world state for autonomous driving and robot decision making, but object and rare-class errors can affect free-space interpretation, collision checking, and temporal state propagation. We show that a common VLM strategy, aligning 3D voxel or object features with crop-caption embeddings, improves text-space similarity without reliably improving closed-set occupancy mIoU. Motivated by this mismatch, we propose VISA, a training-time semantic auditing approach for existing occupancy world models. VISA queries an offline VLM on a representative crop of each physical object instance, obtains a structured audit with class hypotheses, plausible confusions, reliability, attributes, and evidence, and propagates it along the object track. The audit is grounded to matched 3D object voxels and distilled into semantic logits through reliability-weighted taxonomy, attribute-factor, and scene-level audit graph losses, while inference remains unchanged and requires no VLM. On nuScenes, averaged across three runs, VISA improves OccWorld from 19.06 to 20.05 mIoU and GaussianWorld from 21.36 to 21.91 mIoU; on GaussianWorld, object mIoU improves from 18.18 to 19.16 and rare-class mIoU from 15.60 to 16.79. These results suggest that VLMs are better suited to closed-set occupancy as reliability-aware semantic auditors than as generic caption-embedding targets.

Reinforcement Learning for Neural Model Editing

Shaivi Malik — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13461v1 Announce Type: new Abstract: Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formulates neural model editing as a reinforcement learning problem, where agents modify models using reward feedback. We introduce two environments: MaskWorld, where agents scale weights multiplicatively, and ShiftWorld, where agents apply additive weight updates. The reward function combines a utility-preservation objective with a task-specific editing objective, enabling agents to learn targeted modifications while maintaining overall model performance. We evaluate the framework on bias mitigation in text classification and machine unlearning in image classification, both of which traditionally rely on specialized algorithms. Our results show that the learned policies reduce forget set accuracy to nearly 0% while preserving over 90% retain set accuracy on the unlearning task. In the bias mitigation setting, the learned policies improve bias-related performance by more than 5% while maintaining general classification utility. Our findings show that neural model editing can be cast as a reinforcement learning problem, allowing editing policies to be learned from reward feedback rather than manually engineered for each task.

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13464v1 Announce Type: new Abstract: Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires conversation-level contextual evidence. Existing ASR correction methods often rely on the current hypothesis or concatenate raw dialogue history. In such contexts, sparse correction evidence can be difficult to locate amid redundancy and noise. Addressing these challenges, we propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations. The framework organizes preceding interaction history into a dynamically updatable ontology memory, where entities, terminology, surface variants, potential ASR confusions, and semantic relations are stored as retrievable nodes for context-grounded correction. To evaluate this setting, we construct RAMC-Corr, a dataset derived from MAGIC-RAMC for long-range ASR correction with grounded context. Experiments on RAMC-Corr show that our method improves over direct correction in 9 out of 10 paired backbone-setting combinations and encourages more selective and evidence-grounded corrections for context-dependent ASR errors.

Embodied Opinion Dynamics for Safety-Critical Motion Control in Dynamic Environments

Zhiqi Tang, Yu Xing — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13465v1 Announce Type: new Abstract: This paper proposes a novel adaptive control framework that embeds nonlinear opinion dynamics within the dynamical sensorimotor layers of an automated vehicle governed by second-order nonholonomic bicycle kinematics. The framework enables an ego vehicle to perform adaptive decision-making and achieve safe motion control under interaction uncertainty with non-cooperative neighboring agents. We consider a representative case study in which an ego vehicle autonomously attempts to merge into a lane occupied by human-driven or automated vehicles whose intentions are unknown. Within the proposed framework, the ego vehicle adaptively selects and executes merging versus non-merging behaviors in response to changing environmental conditions. Formal safety guarantees, as well as equilibrium and stability analyses of the closed-loop system, are provided. Numerical simulations further demonstrate the effectiveness of the proposed approach.

$W-\delta-\mu$ dual codes and LCD codes

Avanish Kumar Chaturvedi, Satyadeep Pandey — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13467v1 Announce Type: new Abstract: We introduce a new product on the ambient space $F_q^n$ as a generalization of Euclidean, Hermitian and $\delta$ products. We give some general properties of the dual codes, relation with Euclidean duals, definition and characterization of self orthogonal, self dual, dual containing and LCD codes along with certain existence conditions. Also, we calculate the dual codes of some classes of codes like repetition, binary and $\lambda$-constacyclic codes with respect to this product. Further, we extend and analyse this notion of the product for codes over semisimple rings.

Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset

Mahmoud Abujadallah, Ali Arabat, Mohammed Sayagh — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13468v1 Announce Type: new Abstract: AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41\% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes of AI-agents, an understanding that is crucial for better integrating AI-agents as efficient teammates. In this paper, we conduct a qualitative study on a representative sample of 306 non-merged pull requests created or co-authored by the agents mentioned earlier, followed by a quantitative analysis of the reasons for rejection. Our qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes. We observe that developers can reject fixes due to fixes whose implementation is incorrect (e.g., incomplete, wrong approach), fixes that do not pass the continuous integration (CI) pipelines and fail tests, fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and fixes whose priority is low. Our results shed light on the importance of better guiding the model at these levels: (1) proposing hints about the approach to follow for fixing an issue, (2) outlining constraints or limitations regarding the approaches that should not be taken, and (3) instructing the agent on how to validate the implementation through CI pipelines and without introducing a breaking change. Our results suggest the need for good prioritization of tasks so that generated fixes do not lead to wasted human review efforts or wasted agent resources (e.g., tokens, compute, or allowed number of requests).

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13473v1 Announce Type: new Abstract: We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.

Exploring Systems-Thinking Approaches to Loss of Control Risk

Aurelio Carlucci, Sean P. Fillingham, James Walpole, Jakub Kry\'s — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13474v1 Announce Type: new Abstract: Internal deployment of agentic AI systems for coding and research creates a sociotechnical control problem that extends beyond model behaviour. We treat internal-deployment Loss of Control as the inability to reliably constrain, audit, reverse, or halt AI-mediated changes to code, infrastructure, evaluation, or deployment processes in time to prevent serious organisational or societal harms. We ask whether established systems-safety methods can identify risks that model-level evaluations may miss. Using a generic frontier-lab coding-agent scenario reconstructed from public materials, we apply STECA, STPA, and FRAM. The analyses surface complementary findings: published frameworks can leave governance responsibilities and feedback loops externally unverifiable; delays in monitoring and intervention can make otherwise valid control actions ineffective; and routine operational variability can gradually erode the calibration and independence of safeguards. We argue that frontier-AI risk management should pair model-focused evaluations with systems-level hazard analysis and operational assurance that tracks whether controls remain effective over time.

SupraBench: A Benchmark for Supramolecular Chemistry

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13477v1 Announce Type: new Abstract: Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at https://github.com/Tianyi-Billy-Ma/SupraBench.

A Reactive Redistribution Mechanism for STL Tasks in Multi-Agent Systems Under Time-Varying Communication

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13479v1 Announce Type: new Abstract: We present a communication-aware task decomposition framework for multi-agent systems with collaborative relative configuration objectives specified in Signal Temporal Logic (STL), allowing for dynamic task reallocation under time-varying communication networks. Building on our prior work, the framework supports the direct use of existing feedback controllers for reactive task satisfaction. We address two key challenges: disjunctive STL specifications and time-varying communication networks. Disjunctive specifications are handled through a graph transition system that captures the alternative task sequences induced by logical OR operators. To address time-varying connectivity, we introduce a redistribution mechanism that transfers tasks from disconnected agents to connected ones as the network evolves while preserving decentralized execution. Simulations and experiments on a swarm of Crazyflie drones demonstrate scalability in the number of agents, communication connectivity, and specification complexity.

A Stabilized Multilevel B-Spline-Based Fast Integral Method for the Solution of the Electric Field Integral Equation

Danijel Juki\'c, Bernd Hofmann, Thomas F. Eibert, Simon B. Adrian — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13482v1 Announce Type: new Abstract: We present a multilevel B-spline-based fast integral method for the solution of the electric field integral equation (EFIE), combining fast Fourier transformation (FFT)-compatible kernel interpolation with robust high-order interpolation. Existing FFT-accelerated global Lagrange-based approaches rely on equidistant interpolation points and can, therefore, suffer from Runge-type instabilities at high interpolation orders, limiting robust high-accuracy compression. In contrast, B-splines on equidistant knot vectors overcome these instabilities and enable robust high-order interpolation for accurate matrix compression. Replacing Lagrange interpolation by B-spline interpolation is, however, non-trivial: B-spline coefficients do not coincide with function values at the interpolation points, and the associated sampling matrices can become ill-conditioned. To address these challenges, we introduce a knot-removal stabilization strategy, combined with exact interlevel transfers based on knot insertion, yielding accurate, well-conditioned multilevel interpolation. Moreover, we propose a factorization strategy that preserves the null space of the scalar potential operator up to machine precision and is compatible with low-frequency preconditioning techniques. Numerical results for both canonical and realistic geometries demonstrate robust high-order interpolation without the breakdown observed for Lagrange-based approaches and confirm $\mathcal{O}(N)$ complexity.

Impedance MPC with Patient-Torque Estimation for Knee Rehabilitation Exoskeletons

Yongyan Cao, Jinshan Tang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13485v1 Announce Type: new Abstract: Knee rehabilitation exoskeletons must enforce a prescribed joint trajectory while remaining safely compliant with involuntary spasm and voluntary patient effort-objectives in tension for any fixed-gain impedance controller. We present an Impedance Model Predictive Control framework for knee rehabilitation exoskeletons, demonstrated on a series-elastic-actuator (SEA) platform: an algebraic feedforward reduces the knee dynamics to a constant-coefficient scalar double integrator, and a receding-horizon quadratic program (QP) computes corrective torques while enforcing hard range-of-motion, torque, and velocity limits (ISO 13482). A Kalman disturbance state driven by direct SEA-based torque sensing (the series-elastic spring deflection measured through the elastic element - an intrinsic, EMG-free patient-torque estimate, not a separate load cell) gives a nominal offset-free guarantee and, via its sign and the desired-motion direction, sensorless Assist-as-Needed. The constant state matrix permits offline precomputation of the QP cost inverse, enabling 500 Hz operation with a multi-step horizon. Across seven-controller benchmarks (sinusoidal tracking, isometric hold), the 500 Hz Kalman MPC is offset free 0.1 mrad RMS, 0.1 mrad steady-state, 0.2 mrad peak under 15 Nm spasm, versus a 515 mrad steady-state offset for classical impedance at the same stiffness - the direct-measurement channel converging the estimate near-immediately (within a few sampling periods). Without the estimator it realizes a classical impedance (4.8 mrad RMS, 8.3 mrad steady-state). All MPC variants meet the 87 mrad clinical criterion; no classical controller does. The architecture is formulated for the 20 DOF MyoSuite myoLeg via coupling-aware per-joint QPs.

CRAFTIIF: Cross-Resolution Analytic Four-Type Interpretable Isolation Forest for Multivariate Time Series Anomaly Detection

William Smits — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13486v1 Announce Type: new Abstract: Anomaly detection in multivariate time series is challenged by four structurally distinct anomaly types -- point (isolated spikes), distributional (level shifts), temporal (rhythm changes), and collective (inter-sensor correlation breakdowns) -- each requiring different feature representations. Most unsupervised methods target only one or two types and provide limited interpretability. We present CRAFTIIF (Cross-Resolution Analytic Four-Type Interpretable Isolation Forest), a fully unsupervised framework targeting all four types without dataset-specific tuning. CRAFTIIF generates K=500 random analytic wavelet feature draws across four families (Morlet, DOG, Haar, Coiflet), each targeting a specific anomaly type, feeding five structured Isolation Forests -- one per type plus a meta-IF for compound anomalies. An adaptive Otsu/MAD threshold calibrates detection automatically across anomaly rates from 0.1% to 69.2%. Because each IF is trained exclusively on type-specific features, branch firing provides direct anomaly-type attribution by construction, without post-hoc explanation. Evaluated on all 19 datasets of the mTSBench benchmark (Zhou et al., TMLR 2026), CRAFTIIF achieves mean F1=0.228 (all 19 datasets) and F1=0.322 (13 detectable datasets), ranking first among all 25 evaluated methods on VUS-PR (0.463 vs. previous best 0.329, +40.7%). A diagnostic framework -- oracle F1, detectability limits, and branch separation ratios -- identifies 6 of 19 datasets as fundamentally undetectable by any unsupervised method. Ablation over 11 conditions confirms adaptive thresholding (+38% F1), four-branch structure (+20%), and meta-IF (+23%) are each essential. Code: https://github.com/smitswil/craftiif

Point-Wise Geometry-Aware Transformer for Partial-to-Full Point Cloud Registration in Computer-Assisted Surgery

Siyu Zhou, Zhongliang Jiang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13488v1 Announce Type: new Abstract: Partial-to-full registration remains challenging due to varying overlap ratios, fluctuating point densities, and the presence of noise. While transformers have shown strong potential for point cloud processing, prior methods typically confine them to global context aggregation, overlooking fine-grained local geometry crucial for accurate correspondence. We propose \emph{GAPR-Net}, a learning-based point cloud registration framework with a coarse-to-fine architecture that combines convolution and transformer modules, in which local and global information is fused between the partial and full point clouds using a cross-attention mechanism. To achieve this, a transformation-invariant point-wise geometric feature representation is proposed, which can robustly capture relative geometric features for individual points with respect to their neighboring points. To evaluate the effectiveness of the proposed approach, experiments are conducted on four geometrically distinct bones, including the tibia, femur, pelvis, and thoracic cartilage. The overall registration recall reaches 94.2\%, the method results in a low RMSE of 1.992 mm and $R^2$ values of 0.908 and 0.974 for rotation and translation, respectively. The results demonstrate that the proposed method effectively addresses the partial-to-full point cloud registration problem. The proposed method enables highly accurate 3D point cloud registration using partial observation, providing a critical foundation for precise surgical navigation and robotic interventions in computer-assisted surgery. The code will be accessed after the double-blind review process.

Spectral Filtering of 3D Integral Operators Using Modified Green's Functions

Alessandro Bellusci, Viviana Giunzioni, Adrien Merlini, Francesco P. Andriulli — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13489v1 Announce Type: new Abstract: Several recent contributions have analyzed and illustrated the effectiveness of operator filtering, both in terms of regularization and compression, when handling dense matrices arising from the discretization of integral operators, e.g. the single-layer operator. Previous works have introduced different filtering strategies, ranging from Laplacian-based filters to analytically derived ones, with the goal of improving the computational efficiency of iterative and direct solvers for integral equations in the two-dimensional space, like the 2D Electric Field Integral Equation (EFIE). In this work, we propose a filtering strategy based on the spectral truncation of the kernels of integral operators associated with the 3D EFIE. The approach relies on an appropriate spectral representation of the Green's function obtained via the spherical Hankel transform, which provides an analytical foundation for the proposed approach. Finally, we provide semi-analytical and numerical evidence of the impact of this filtering technique on the spectral properties of continuous integral operators and of their discretization through boundary elements, both for the static and dynamic cases.

Fundamental Limits of Hypergraph Edge Partitioning under Independent Edge Sampling

Javad Maheri, K. K. Krishnan Namboodiri, Petros Elia — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13491v1 Announce Type: new Abstract: Hypergraph edge partitioning is a central problem in theoretical and applied computer science, with broad impact on distributed computation, communications, optimization, and machine learning. In this setting, one is given a collection of hyperedges -- each consisting of up to $d$ vertices from a ground set of size $n$ -- and seeks to assign these hyperedges across $N$ partitions so as to minimize, for example, the vertex footprint, i.e., the maximum number of vertices that appear in any partition. We here identify the fundamental limits of hypergraph edge partitioning -- optimized over all conceivable algorithms -- for a broad class of probabilistic hypergraph models where each hyperedge may appear independently with \emph{its own} probability; a model sufficiently general to encompass well-known models such as the Degree-Corrected or Mixed-Membership models, the Hypergraph Stochastic Block model, the Latent-Space/Geometric or Kernel Models, and others. By pairing our deterministic partitioner with a new converse, we first show that, for any $n,d$, and under the very mild condition of $N \leq \binom{\lfloor\sqrt{\frac{nd}{2}}\rfloor}{d}$, as long as the hyperedge set $\mathbf{X}$ satisfies $|\mathbf{X}| \gtrsim n N \log N$, then with probability at least $1-2/3n^z$, no algorithm can provide a footprint $\pi_{\mathbf{X}}$ less than $$\pi^{\bigstar}_{\mathbf{X}} = \frac{1}{2\sqrt{2}}\frac{n}{N^{1/d}}. $$ We then show that our hypergraph partitioner comes to within a small constant factor from $\pi^{\bigstar}_{\mathbf{X}}$, for each $\mathbf{X}$. This optimality captures dense and sparse hypergraphs alike (with sizes down to linear in $n$), and it additionally entails a near-optimally balanced allocation of hyperedges across partitions.

Digital Twin-Based Simulation for Predictive Decision-Making in Waterway Logistics

Matthijs Jansen op de Haar, Daniel Frutos Rodriguez — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13492v1 Announce Type: new Abstract: This paper investigates the potential of a Digital Twin (DT) for freight routing across inland waterway networks under uncertain water-level conditions. Existing approaches insufficiently account for increasing climate-induced volatility in water levels, which often result in higher operational costs and emissions due to the need for more expensive transport alternatives, such as road transport. These existing methods often rely on reactive countermeasures to remain resilient. To address these limitations, six interviews with experts in domains related to inland shipping were conducted to identify three common contingency scenarios and appropriate operational responses. These scenarios were subsequently incorporated into a time-sliced simulation environment in which predictive decision-making, enabled by a DT environment, was compared against reactive approaches. The results demonstrate that predictive modeling substantially reduces operational costs and modal shifts at prediction accuracies between 70% and 100%, despite extreme conditions. In addition, the predictive model achieves an average 28.3% reduction in fuel-related costs by reducing the total distance ships travel. The simulation outcomes were evaluated together with domain experts to assess the practical relevance and applicability of the proposed DT-enabled approach.

NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13494v1 Announce Type: new Abstract: Goal-conditioned visual navigation requires a robot to act under partial observability by anticipating how its motion will change the future egocentric view and whether that change brings it closer to the goal. Navigation world models provide such visual foresight, but they remain prediction modules that require an external planner to convert predicted futures into closed-loop control. We propose Navigation World Action Model (NavWAM), a diffusion-transformer policy that turns navigation world-model prediction into executable action by representing future observations, goal-progress values, and action chunks in a shared latent sequence. By learning future prediction jointly with the action and value targets that determine closed-loop behavior, NavWAM makes visual foresight directly usable for robot control. We build NavWAM through simulation pretraining and real-robot adaptation, and evaluate it on image-goal navigation against planning-based world models and a representative direct navigation policy. Across offline benchmarks and closed-loop real-robot deployment, NavWAM improves over planning-based world-model baselines in our evaluations while using the default policy mode without CEM-style action search. Project page: https://dachii-azm.github.io/navwam/

Budget-Constrained Step-Level Diffusion Caching

Mingkun Lei, Tong Zhao, Liangyu Yuan, Chi Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13496v1 Announce Type: new Abstract: Step-level caching accelerates diffusion models by exploiting temporal redundancy across denoising steps. Existing methods make per-step cache decisions using threshold-based heuristics, without directly optimizing for final output quality. As a result, their inference latency varies across inputs and is difficult to control at deployment. In this work, we propose BudCache, which inverts this formulation: rather than letting per-step error thresholds dictate the runtime cost, we fix the compute budget in advance and search for the cache policy that best preserves the final output. To tackle the combinatorial complexity of step selection, we combine Simulated Annealing with deterministic Hill Climbing. This offline search identifies high-quality cache policies within minutes and introduces no online search or thresholding overhead during inference. When the compute budget is very tight, we further introduce cache-aware schedule alignment, which adapts the time discretization to the selected cache policy to reduce cache-induced trajectory mismatch. Experiments on FLUX.1-dev and Wan2.1 show that BudCache achieves better generation quality than heuristic caching baselines under the same inference budgets. Code is available at https://github.com/Westlake-AGI-Lab/BudCache

SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13497v1 Announce Type: new Abstract: This work introduces Spatial Annotations from Robot Demonstrations with Reliability Calibration (SPARC), a risk-aware framework that automatically labels robot demonstrations with structured spatial annotations and assigns each annotation a reliability score. Structured spatial annotations, such as bounding boxes, object trajectories, and manipulation phase labels, benefit a broad range of robotics applications from training grounded robot policies and embodied foundation models to motion planning and hierarchical task composition. Existing automated pipelines generate such annotations at scale but provide no reliable quality signal: detector confidence is poorly calibrated for annotation correctness, forcing a choice between accepting noisy labels or discarding useful samples. In contrast to existing automated pipelines, SPARC leverages the spatio-temporal structure inherent to robot tasks to generate a reliability signal, reducing noisy labels and retaining more useful samples. We further introduce Interaction-Aware Bench (IA-Bench), a benchmark that measures model accuracy in grounding the locations of interacted objects in robot demonstrations. On 1.7k human-annotated demonstrations spanning diverse embodiments and scenarios, SPARC significantly outperforms detection-only baselines in localization accuracy while retaining three times more samples at high-precision operating points. Our experiments demonstrate that models finetuned on our annotations achieve state-of-the-art results on object-grounding and pointing benchmarks among similarly sized models, while remaining competitive on broader spatial-reasoning suites without manually verified or annotated training data. Furthermore, policies trained on SPARC-generated annotations outperform baselines in cluttered, visually ambiguous real-world scenes. Code, data, and models are available at intuitive-robots.github.io/sparc-labeling.

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13501v1 Announce Type: new Abstract: Diffusion Transformers (DiTs) have become the dominant architecture for image and video generation, creating growing demand for efficient DiT serving. Existing systems assign each request a fixed parallel configuration throughout its lifetime. However, DiT workloads exhibit substantial heterogeneity across requests, execution stages, and system conditions, making static parallelism inefficient and often leading to poor GPU utilization and degraded service quality. This paper argues that DiT serving should treat GPU parallelism as a first-class schedulable resource. We present GF-DiT, a policy-programmable runtime for elastic DiT serving that dynamically adapts the parallelism of running requests according to workload demands and service objectives. GF-DiT introduces an asynchronous execution abstraction that decomposes requests into independently schedulable trajectory tasks and enables online GPU reallocation. To make elastic parallelism practical, GF-DiT further proposes group-free collectives, a lightweight communication abstraction that supports low-overhead online formation and reconfiguration of arbitrary execution groups. We implement GF-DiT in vLLM-Omni and evaluate it on representative image and video diffusion workloads. Compared with fixed-pipeline execution with static parallelism, GF-DiT improves throughput by up to 6.01$\times$, reduces mean latency by up to 95%, lowers SLO violation rates by up to 90%, and reduces communication-group setup overhead from 778 ms to approximately 60 $\mu$s.

Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured Environments

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13503v1 Announce Type: new Abstract: Robust localization in unstructured environments, such as agricultural fields, is a critical challenge for autonomous systems. LiDAR sensors provide detailed 3D information about the environment and are invariant to lighting conditions. For this reason, LiDAR-based place recognition methods have gained significant attention. In this paper, we propose MinkUNeXt-VINE++, a novel approach that combines early fusion of heterogeneous LiDAR data from two sensors (Livox Mid-360 and Velodyne VLP-16) and a learned re-ranking strategy in inference time. This fusion leverages the strengths of each sensor to provide a more comprehensive representation of the environment. Additionally, the re-ranking approach is particularly important in repetitive environments, such as vineyards, as finding true positives is a major challenge. We evaluated our approach using the TEMPO-VINE dataset, which provides heterogeneous LiDAR data in vineyard environments across different phenological stages. Our results demonstrate that MinkUNeXt-VINE++ significantly improves place recognition performance compared to single-sensor approaches and state-of-the-art methods. MinkUNeXt-VINE++ achieves a 20% improvement in the Recall@1 metric compared to single-sensor approaches, and +30% including re-ranking. The code of our method is publicly available for reproduction.

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

Qixu Chen, Satoshi Nakamura — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13507v1 Announce Type: new Abstract: Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech translation performance. We study how to train an audio-language model to make keep/drop decisions on paired speech directly from audio. To obtain reliable supervision without manual labels, we adopt a scalable two-stage Rank-to-Distill strategy. A lightweight ranker generates keep/drop pseudo-labels from noisy speech pairs, then trains an audio large language model to predict keep/drop directly from raw paired speech. The resulting model jointly captures acoustic fidelity and cross-lingual semantic consistency for the selection of speech-conditioned data. Experiments on CVSS-C and SpeechMatrix show consistent improvements over unfiltered training, yielding up to +1.4 ASR-BLEU for end-to-end S2ST.

Measurement-Calibrated Multi-Camera Fusion for Vision-Based Indoor Localization

Mateo Toro Diz, Jonathan Hoss, Noah Klarmann — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13509v1 Announce Type: new Abstract: Indoor vision-based localization systems are affected by detection noise, occlusions, and limited camera coverage, leading to uncertainty at multiple stages of the pipeline. While multi-camera data fusion is widely used to mitigate these issues, it is typically treated as a black-box component and evaluated solely end-to-end, obscuring its mechanistic contributions. To address this gap, this work investigates whether explicitly characterizing single-camera localization errors can be leveraged to calibrate and optimize multi-camera data fusion. We introduce a measurement-calibrated fusion approach that integrates component-wise error quantification, specifically isolating homography calibration, human detection, and motion tracking. A component-wise evaluation is conducted to quantify error contributions from homography calibration, human detection, and motion tracking. Experimental results show that data fusion improves localization accuracy compared to single-camera baselines. While measurement-calibrated fusion provides only limited improvement in absolute accuracy over standard fusion, it substantially reduces trajectory variance and improves motion smoothness, which are critical for applications requiring stable and continuous motion estimates. These results highlight the value of explicit error characterization when designing data fusion strategies for vision-based indoor positioning systems.

Spectrum Sharing Across Terrestrial and Non-Terrestrial Services in the FR3 Upper Midband

Paolo Testolina, Ergest Beshaj, Michele Polese, Tommaso Melodia — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13511v1 Announce Type: new Abstract: The frequency bands between 7 and 24 GHz, also known as upper midband or Frequency Range (FR) 3, are being considered as an enabler of 6th Generation (6G) mobile networks. This portion of the spectrum exhibits different propagation characteristics compared to frequencies above 24 GHz, while also offering the potential to provide larger bandwidth allocations for mobile systems than those available in the sub-6 GHz range. 6G technology and spectrum policy, however, will need to guarantee coexistence with the incumbents that already use these frequency bands, which include a variety of services, from radiolocation to satellite-based communications, remote sensing, and radioastronomy. In this paper, we consider the challenge of coexistence between 6G terrestrial systems and satellite incumbents in different portions of the FR3 bands. Using a large-scale 3D model of a terrestrial deployment in the city of Boston and an open-source ray tracing solution, we evaluate the level of Radio Frequency Interference (RFI) that tens of terrestrial Next Generation Node Bs (gNBs) generate toward satellites at different elevation angles. Our model, based on realistic obstruction, clutter, diffraction, and reflections, shows that sidelobes and Non-Line-of-Sight (NLoS) paths can significantly contribute to RFI. Besides directionality, the spatial distribution of gNBs also plays a key role in defining the RFI levels, suggesting that a careful design and operation of terrestrial deployments can create coexistence opportunities.

CloudCons: A Comprehensive End-to-End Benchmark for Cloud Resource Consolidation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13513v1 Announce Type: new Abstract: Driven by conservative over-provisioning to guarantee service reliability, resource utilization in cloud data centers remains at low levels. To mitigate this, the forecast-then-optimize paradigm has emerged to optimize consolidation by anticipating future demands. While emerging time series foundation models promise to enhance this paradigm through zero-shot generalization, existing benchmarks focus solely on prediction error metrics. The actual decision utility of these advanced models remains unverified, rendering their practical value for downstream tasks uncertain. To bridge this gap, we propose CloudCons, a comprehensive end-to-end benchmark designed to evaluate forecasting models within the specific context of cloud resource consolidation. We build high-quality datasets that cover diverse workloads from Huawei Cloud, Microsoft Azure, and Google Borg, capturing distinct service characteristics ranging from synchronized diurnal rhythms to stochastic, pulse-like bursts and high-frequency noise. We conduct an extensive evaluation of statistical, deep learning, and foundation models. Our experiments reveal a pivotal finding: while foundation models demonstrate superior zero-shot forecasting accuracy, this advantage does not inherently translate into better decision utility. Of practical significance, we systematically analyze how the selection of predictive quantiles acts as a critical lever. We provide actionable guidelines for calibrating these selections to balance the trade-off between resource efficiency and service reliability, offering vital insights for real-world deployment decisions.

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13515v1 Announce Type: new Abstract: World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

Adaptive-Frequency Resonate-and-Fire Neurons for Spectral Estimation of Streaming Radar Signals

Stefano Chiavazza, Sen Yuan, Marc Geilen, Francesco Fioranelli, Federico Corradi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13516v1 Announce Type: new Abstract: Frequency Modulated Continuous Wave (FMCW) radar systems traditionally rely on Fourier-based methods, such as the Fast Fourier Transform (FFT), to estimate target range and velocity. While computationally efficient, these approaches require storing and processing large blocks of data, which can become a bottleneck in memory-constrained or low-latency applications. In this work, we propose a neuromorphic-inspired signal processing method based on adaptive resonate-and-fire (ARF) neurons formulated as a discrete-time dynamical system. Each neuron dynamically adjusts its internal frequency to match dominant frequency components of the input radar signal, enabling direct estimation of target ranges and velocities without computing the full frequency spectrum. The proposed model operates in a sample-by-sample fashion, resulting in memory requirements that scale with the number of tracked targets rather than the signal length. A feedback mechanism is also introduced to enable multiple neurons to lock on distinct frequency components in multi-target cases. Results on simulated and experimental data demonstrate that the method can successfully track multiple targets. Compared to conventional FFT-based approaches, the proposed method offers reduced memory usage proportional only to the number of tracked targets, making it suitable for resource-constrained and edge-based radar applications.

Probabilistic, Resource-Aware, Asynchronous, Out-of-Order Choreographies

Mako Bates, Steven Baldasty, Ernest Hyun, Christian Skalka, Joseph P. Near — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13520v1 Announce Type: new Abstract: Futures-based implementations of out-of-order choreographies can substantially improve latency and throughput, but their actual behavior depends on resources such as communication delay, computation time, failures, and recovery. Existing formal models such as Ozone's O3 describe which executions are possible, but do not directly explain how likely those executions are or how long they take. In this work we present AsInst, a probabilistic, resource-aware language for modeling the semantics of asynchronous choreographies with out-of-order execution. AsInst programs are interpreted as temporal Bayesian networks that model both the values produced at runtime and the times at which they become available. We prove that this central semantics correctly captures a corresponding futures-style network semantics. We also show that AsInst can encode Ozone-style select-and-merge conditionals, and we use case studies to model communication-failure recovery and analyze runtime performance.

What's Old is New Again: Classical Dimensionality Reduction for Efficient Saliency-Guided Biometric Attack Detection

Samuel Webster, Walter Scheirer — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13528v1 Announce Type: new Abstract: Saliency-guided training is a paradigm in visual recognition that encourages models to focus on the most relevant image regions during learning. While its application in biometric presentation attack detection (PAD) has shown strong benefits in robustness and generalization, adoption is often limited by the high cost, domain specificity, and limited scalability of existing saliency acquisition methods, such as human annotations over a limited dataset. We present a novel, cost-efficient, and highly-scalable approach to saliency acquisition using maps inspired by classical dimensionality reduction techniques: PCA and LDA. Our proposed methods generate saliency maps directly from raw training data, requiring no human annotation nor domain knowledge. We contextualize the effectiveness of these saliency sources in three saliency-explored domains (iris PAD, synthetic face detection, fingerprint PAD) and demonstrate its scalability in two saliency-novel domains (fingerprint vein PAD and ID card PAD). Across all domains tested, models trained using dimensionality reduction-sourced saliency maps exceed baseline and sometimes SOTA saliency methods without any resource investment or domain-specific tooling. Our findings overcome an important yet unaddressed barrier to saliency-guided training for biometric attack detection and beyond.

Ride, Track, and Recover: Pilot Randomized Trial of a Wearable Digital Self-Management Intervention During a Veteran Endurance-Cycling Program

Alan Ta, Nilsu Salgin, Caleb Armstrong, Kala Phillips Reindel, Farzan Sasangohar — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13529v1 Announce Type: new Abstract: Post-traumatic stress disorder (PTSD) in veterans is characterized by persistent hyperarousal and comorbid anxiety and depressive symptoms that are difficult to monitor and manage outside clinical settings. Thirteen veterans participating in a Project Hero cycling event in Texas were randomized by computer-generated sequence in a naturalistic setting to two arms: (1) digital intervention plus physical activity, or (2) physical activity only, plus a third at-home monitoring control cohort consisting of 7 veterans selected from the broader Project Hero veteran community. Continuous smartwatch sensing combined heart rate and accelerometer features to detect hyperarousal events, which were confirmed in real time by participants. Weekly self-report measures of anxiety, depression, and PTSD severity were collected. Generalized additive mixed models characterized nonlinear trajectories over time. Baseline-normalized hyperarousal trajectories differed significantly across conditions, with the digital intervention group (n=7) showing structured stabilization compared to late-study escalation in the physical-only group (n=3). Both cycling groups exhibited acute symptom improvements during the endurance event; however, the digital intervention group demonstrated a higher overall maintenance of gains. The at-home control group (n=4) showed gradual symptom declines. Perceived precision of ML detections varied substantially across individuals and was positively associated with symptom severity, with higher-severity participants confirming a greater proportion of detected events. These results suggest that coupling wearable detection with digital self-management tools may support stabilization of hyperarousal and symptom improvement while emphasizing the importance of personalization and human-centered design in wearable mental health systems.

Graphical Causal Reasoning for Root Cause Analysis in Cloud Networks

Fabien Chraim, Dominik Janzing, John Evans — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13532v1 Announce Type: new Abstract: Cloud-computing relies on large-scale networks which are inherently complex systems. In this paper, we present a novel approach to root cause analysis (RCA) of cloud network incidents, leveraging graph-based causal discovery techniques. Our method addresses the limitations of rule-based automation by introducing a spatiotemporal grouping strategy and an automation ontology to reduce the dimensionality of the problem. We construct a causal graph from binary time series data using bivariate Granger causality and conditional independence tests. For inference, we introduce a probabilistic method that assigns edge-specific conditional probabilities as a function of time lag, allowing for interpretable, time-aware root cause scoring via causal graph traversal. We evaluated the system using a labeled dataset of 35 production incidents from a major cloud provider. The model successfully recalled the correct root cause in 85.7% of incidents and produced an exact match in 74.3%. In production, the deployed system has been used in over 800 real-world incidents, with positive qualitative feedback from network engineers. These results highlight the practicality of a data-driven, causal approach to RCA in dynamic and large-scale operational environments.

OneRetrieval: Unifying Multi-Branch E-commerce Retrieval with an Editable Generative Model

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13533v1 Announce Type: new Abstract: Industrial e-commerce search serves hundreds of millions of items through a multi-branch retrieval stage fused by hand-tuned merging without joint optimization. Generative retrieval (GR) raises the prospect of collapsing this stage into a single model, yet unification is gated by more than retrieval quality: the inverted-index branch converts below the platform average yet persists because it is almost the only branch where operations can inject a new term within hours without any model update; a one-model substitute must preserve this real-time editability. Existing GR methods structurally lack it: closed-codebook methods fix each slot to a quantized embedding at training, while open-vocabulary methods leave new-term routing to model generalization. We present OneRetrieval, a one-model GR framework built on Keyword-Aligned Encoding (KAE), which ties each identifier position to an interpretable attribute word, pairing competitive recall quality with the editability of the inverted index -- to our knowledge the first editable generative retrieval method. An information-theoretic merging organizes 18 attribute categories into six codebook groups with non-uniform capacity; reserved slots in each codebook can be bound to new words after deployment without retraining; and a four-stage fine-tuning pipeline secures quality and editability jointly. On five million real-traffic requests, OneRetrieval matches the deep recall of the strongest generative baseline, with an intervention hit rate over an order of magnitude above closed-codebook encodings. Online, replacing the inverted-index branch significantly lifts order volume; extending to nearly the entire stage holds conversion while improving CTR. The system is deployed at Kuaishou, serving hundreds of millions of PVs daily.

When Does Mixing Help? Analyzing Query Embedding Interpolation in Multilingual Dense Retrieval

Tongyao Zhu, Chao-Ming Huang, Min-Yen Kan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13537v1 Announce Type: new Abstract: While mixed-language querying is ubiquitous in multilingual communities, the sensitivity of dense retrievers to such queries remains poorly understood. We present a ratio-controlled study on mMARCO that systematically evaluates retrieval performance by varying the mixing proportion of parallel query translations via embedding-level mixing -- constructing mixed queries as an interpolation of monolingual embeddings. Experiments with BGE-M3 demonstrate that an optimal mixing ratio outperforms the best monolingual endpoint in 88/105 cases. We uncover a distinct asymmetry driven by English dominance: mixing is uniformly beneficial when retrieving from non-English document indices, whereas indices containing English are best served by pure English queries. Furthermore, English acts as the strongest mixing partner for every non-English document language. Finally, when controlling for English dominance, mixing gains correlate negatively with typological distance. We conclude that language-mix sensitivity is structured and predictable, and we validate the robustness of these patterns across model families and scales.

NetCause: Counterfactual Learning for Root Cause Analysis in Large-Scale Networks

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13543v1 Announce Type: new Abstract: Can a learned model capture how faults propagate through a large-scale network and use this knowledge to causally attribute customer impact to its underlying root cause? Existing root cause analysis techniques often rely on static rules, correlation heuristics, or topology-local reasoning, which struggle to generalize in dynamic environments where faults propagate across complex physical and logical dependencies. We present NetCause, a self-supervised learning-based framework that models network incidents as graph-temporal processes and uses counterfactual simulation to rank candidate root causes. This approach produces an interpretable ranking of root cause hypotheses and integrates naturally with operator-defined mitigation and remediation actions. We train the model on over 1,500 incidents collected over six months from a leading cloud provider's production network and evaluate it on 31 expert-labeled incidents. NetCause consistently improves root cause ranking quality in the regime most relevant to operational decision-making, achieving a 16.1% accuracy improvement over a rule-based heuristic baseline. While training is computationally intensive, inference is lightweight, requiring only seconds of GPU runtime per incident (well below typical telemetry collection latencies).

A general-purpose global regularization method for 3D volume integral operators

Thomas G. Anderson, Marc Bonnet, Luiz M. Faria, Carlos P\'erez-Arancibia — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13549v1 Announce Type: new Abstract: Singular volume integral operators associated with constant-coefficient partial differential operators extend the applicability of potential theory to inhomogeneous problems, for example arising from nonlinearities or variable coefficients. Typically the PDE kernels in these operators give rise to singularities at all $\mathcal{O}(1/h^3)$ volume discretization/evaluation points in a mesh of characteristic size $h$, while the slowly-decaying nature of such kernels give rise to long-range interactions that require coupling to fast summation algorithms. The presented method uses Green's identities to regularize a wide variety of both scalar-valued and vector-valued volume integral operators by use of a certain regularizing volume density interpolant. The analysis shows how the regularizing effect of the interpolant is global in the sense that the interpolation quality increases in an exactly compensatory fashion as the distance to the Green's function singularity decreases. High-order convergence estimates with tabulated simplex quadratures are established, including with exact representation of curved domains.

Uncertainty-Aware Hybrid Retrieval for Long-Document RAG

Hoin Jung, Xiaoqian Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13550v1 Announce Type: new Abstract: Retrieval augmented generation (RAG) depends critically on the quality and granularity of retrieved evidence. Large retrieval units preserve context but often introduce irrelevant content, which can dilute answer bearing evidence and worsen long context utilization. Fine-grained units are more compact, but they may be difficult to retrieve reliably because short chunks can lack semantic, lexical, or bridging cues needed to match the query. We propose Uncertainty-aware Multi-Granularity RAG (UMG-RAG), a training-free hybrid retrieval framework that treats chunk granularity as query-specific reliability estimation. Instead of training a new retriever or modifying the generator, UMG-RAG uses existing dense and sparse retrievers as complementary experts across multiple chunk granularities. For each query, it converts each expert-granularity score list into an evidence distribution, estimates reliability from distribution entropy, and fuses candidates according to query-specific semantic, lexical, and granularity confidence. We further introduce UMGP-RAG, a parent promotion variant that uses fine-grained hits to locate relevant evidence while returning broader non-redundant parent chunks for local coherence. Experiments on question answering benchmarks show that uncertainty-aware fusion and parent promotion improve generation quality while maintaining a lightweight, plug-and-play retrieval pipeline.

Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation

Aruna Dey, Suraj Biswas — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13556v1 Announce Type: new Abstract: Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.

Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models

Shengqiang Zhang, Ruotong Liao, Volker Tresp, Barbara Plank, Hinrich Sch\"utze — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13558v1 Announce Type: new Abstract: Text-guided image editing with visual autoregressive (VAR) generators requires controlling both what the model samples and where the sampled change is written back into the image code. Existing VAR editors mainly operate on token streams, features, or flat next-token logits, leaving two native structures of bitwise-residual VAR models underused: the per-bit Bernoulli prediction head and the additive multi-scale residual code field from which the image is assembled. We propose BitResEdit, a training-free editor for bitwise-residual VAR generators such as Infinity. BitEdit performs source-negative guidance by tilting the post-CFG per-bit log-odds along a source--target contrast computed on a shared edited prefix, then projects each update into a closed-form Bernoulli-KL trust region around the clean CFG sampler. ResEdit converts the sampled bits into per-scale continuous-code residuals, gates them with a localization mask, and re-injects them through the generator's native sum-of-scales. Together they couple decision-time bit guidance with combination-time code composition, so masked-out latent features are preserved exactly by code arithmetic while localized, scale-aware edits are applied inside the target region. On PIE-Bench with Infinity-2B, BitResEdit attains the strongest text alignment among same-backbone VAR editors, improving CLIP on the edited region by +1.07 over the strongest prior editor while keeping background preservation competitive with it. Ablations show BitEdit and ResEdit play complementary roles in target alignment and background preservation.

ReSCom: A Reconfigurable Spiking Neural Network Accelerator Using Stochastic Computing

Ali Alipour Fereidani, Mohammad Rasoul Roshanshah, Saeed Safari — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13560v1 Announce Type: new Abstract: Spiking Neural Networks (SNNs) provide an attractive framework for energy-efficient inference due to their event-driven computation and biologically inspired dynamics. However, efficient hardware realization of SNNs remains challenging because neuronal computations incur significant power and area costs, and uncontrolled approximate arithmetic can destabilize recurrent state updates when precision is not properly managed. To address these challenges, this paper presents ReSCom, a reconfigurable SNN accelerator that leverages stochastic computing to reduce hardware complexity while maintaining stable inference. The proposed architecture employs stochastic arithmetic for multiplication operations in neuron dynamics, while preserving exact fixed-point addition/subtraction operations. This stochastic strategy enables runtime trade-offs between accuracy, latency, and energy consumption. A unified reconfigurable neuron design supports Integrate-and-Fire (IF), Leaky Integrate-and-Fire (LIF), and Synaptic neuron models within a single hardware framework. Experimental results for MNIST inference on a Xilinx Artix-7 FPGA show that ReSCom achieves $92.80\%$ classification accuracy while consuming just $0.05~\mathrm{mJ}$ of operational energy per image at $100~\mathrm{MHz}$, outperforming the energy efficiency of recent state-of-the-art implementations. Furthermore, managing the stochastic bit-stream length allows explicit, dynamic control over accuracy-latency-energy trade-offs to meet target application constraints.

Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction Generalization

Stephen Moore, Lara Leijser, Richard Frayne, Roberto Souza — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13562v1 Announce Type: new Abstract: Purpose: To investigate whether contrast-informed data augmentation and domain-adversarial training improve the adult-to-neonatal generalization of the E2E-VarNet. Methods: Three training regimes were investigated: (1) adult-only training with unaugmented adult data, (2) mixed training with paired unaugmented and neonatal-informed augmented adult data, and (3) mixed training with a domain-adversarial objective. Models were trained on retrospectively undersampled multi-coil adult T2-weighted brain MR data and evaluated on neonatal and adult test data at acceleration factors $R=4$ and $R=8$ using quantitative metrics and qualitative evaluation. Feature analyses assessed whether domain-adversarial training altered the latent representations of unaugmented adult, augmented adult, and neonatal test samples. Results: Mixed training (Mixed) and mixed domain-adversarial training (Mixed-DAT) outperformed unaugmented adult-only training (Unaug-Only) when evaluated on neonatal data. At R=4, Mixed-DAT achieved the best performance (SSIM = 0.924 +/- 0.027, PSNR = 33.98 +/- 1.15 dB). At R=8, Mixed-DAT performed best when measured using SSIM (0.848 +/- 0.031 vs. 0.766 +/- 0.037 for Unaug-Only and 0.814 +/- 0.035 for Mixed) and Mixed performed best when measured using PSNR (29.56 +/- 0.83 dB vs. 26.26 +/- 0.78 dB for Unaug-Only and 29.43 +/- 0.83 dB for Mixed-DAT). Qualitative assessment of t-SNE plots suggested that Mixed-DAT increased the overlap among the latent representations of the unaugmented adult, augmented adult, and neonatal test data. Conclusion: Contrast-informed augmentation and domain-adversarial training improved adult-to-neonatal generalization of deep learning-based MR reconstruction. These findings suggest that contrast-informed data augmentation combined with adversarial training may improve robustness to domain shift in undersampled neonatal MR reconstruction.

Differentially Private Hierarchical Heavy Hitters

Ari Biswas, Graham Cormode, Yaron Kanza, Divesh Srivastava, Zhengyi Zhou — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13563v1 Announce Type: new Abstract: The task of finding _Hierarchical_ Heavy Hitters (HHH) was introduced by Cormode et al. [VLDB 2003] as a generalisation of the heavy hitter problem. While finding HHH in data streams has been studied extensively, the question of releasing HHH when the underlying data is private remains unexplored. In this paper, we study differentially private HHH release in both the streaming and non-streaming setting. In the non-streaming setting, we show the surprising result that the relative error in estimating the residual count for any prefix is independent of the height of the hierarchy and the number of heavy hitters in the stream. Meanwhile, in the streaming setting, although the exact version of HHH has low global sensitivity (as counting queries are 1-sensitive), the approximation functions due to streaming have high global sensitivity, linear in the available space. Despite this obstacle, we show that the absolute error for estimating frequencies in the steaming setting is independent of the available space.

A2D2: Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding

Sophia Tang, Yuchen Zhu, Molei Tao, Pranam Chatterjee — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13565v1 Announce Type: new Abstract: Discrete diffusion models offer a simple and stable likelihood-based framework for sequence generation, recently extended to any-length settings via token insertion. Principled reward-guided fine-tuning for any-length discrete diffusion, however, remains largely unexplored. We introduce Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding (A2D2), a unified framework for reward-guided fine-tuning of any-length discrete diffusion models via joint optimization of the insertion and unmasking policies together with a quality-based inference schedule. We derive the Radon-Nikodym derivative for the joint insertion-unmasking path measures, enabling theoretically guaranteed convergence to the intractable reward-tilted sequence distribution without requiring target samples. Building on this, we establish unmasking and insertion quality as tractable approaches for minimizing decoding error and introduce the Adaptive Joint Decoding (AJD) loss, which provably yields the optimal path measure that generates the reward-tilted distribution. Empirically, A2D2 improves reward optimization while enhancing generation flexibility and accuracy over prior fixed-length fine-tuning and inference-time guidance methods.

A Three-Layer Framework for AI in Scientific Discovery

Guojun Liao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13566v1 Announce Type: new Abstract: Current discussions of AI in scientific discovery are often dominated by two visible capabilities: search over existing knowledge and execution through optimization, simulation, and automation. Both are important, but neither fully captures the central act of discovery: the formation and evolution of models. This paper proposes a three-layer view of AI in discovery. Layer 1 is search and retrieval by large language models. Layer 2, as the main innovation of this paper, is model formation through qualitative reasoning: the capacity to recognize when a current framework is structurally inadequate and to understand the problem within a broader representational space, not through trial and error, but through structural insight into what is missing and where it can be found. Layer 3 is execution, optimization, and refinement. The main claim is that Layer 2 is both the most important and the least developed. Search without model formation remains confined to inherited frameworks, while execution without conceptual revision only amplifies an existing formulation. We illustrate Layer 2 reasoning through three case studies: S. S. Chern's intrinsic proof of the Gauss-Bonnet theorem, the resolution of the Nesterov Accelerated Gradient convergence problem via Lyapunov functions, and the autonomous disproof of the Erdos unit distance conjecture by OpenAI in 2026. Each case exhibits the same structural signature: a framework that had become inadequate, a missing conceptual object, and a resolution found in an unexpected neighboring field.

Adjusted Cup-Product Neural Layer

Snigdha Chandan Khilar — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13568v1 Announce Type: new Abstract: Many important observables in physics and geometry are cup products of cochains. The adjusted cup product neural layer has been introduced in this paper. It is a neural primitive that hard wires the cup product with an adjustment term from higher gauge theory. This creates a readout that is gauge invariant by design. Their main theoretical result shows that on a closed cycle the output relies entirely on the adjustment coefficient. Setting this coefficient to zero removes the output completely regardless of other parameters. Thus the adjustment is the only source of gauge invariant signal. They prove this observable is a nonzero quadratic form and is exactly invariant under one and two gauge transformations.

Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting

Yifan Hu, Hongzhou Chen, Peiyuan Liu, Yiding Liu, Zewei Dong, Jiang-Ming Yang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13571v1 Announce Type: new Abstract: Real-world time series are often highly incomplete and irregular due to sensor dormancy, transmission delays, and event-driven sampling, making reliable forecasting fundamentally challenging. Existing methods have evolved from impute-then-forecast pipelines to continuous-time models such as Neural ODEs and continuous-time graph networks. While these approaches improve the modeling of historical irregularity, they still rely on an implicit oracle assumption at inference time: the timestamps of future valid observations are presumed to be known in advance. This assumption limits practical relevance, since in many real systems the more fundamental question is not only what the future value will be, but also whether a valid observation will occur at all. In this paper, we propose Timeflies, a unified framework that reformulates forecasting as a joint problem of future observability inference and value estimation. To explicitly model the interaction between observation dynamics and state evolution, Timeflies adopts an observation stream and a value stream, coupled through three dedicated modules for reliability-aware embedding, observation-guided dependency modeling, and joint prediction. We further construct Shadow, a benchmark that combines natural missingness from public datasets with real-world industrial data, and introduce the Observation-Value Joint Entropy (OVJE) metric to comprehensively evaluate this coupled predictability. Extensive experiments show that Timeflies consistently outperforms existing methods, highlighting the importance of explicitly modeling future observability in time series forecasting with missing values. Code and dataset are available in https://github.com/ant-intl/Timeflies.

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy, Sriparna Saha — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13572v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: https://iitp-cse.github.io/ ArogyaSutra/

Learning with Simulators: No Regret in a Computationally Bounded World

Sasha Voitovych, Abhishek Shetty, Noah Golowich, Alexander Rakhlin — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13576v1 Announce Type: new Abstract: Understanding the minimal assumptions necessary for generalization is the fundamental question in learning theory. Unfortunately, most results rely heavily on independence (or some proxy thereof) of the data-generating process, while results for strongly dependent data are far more limited. Towards addressing this gap, we introduce the framework of simulatable processes, where the learner has access to a simulator that approximates the distribution generating the data (which may be an arbitrarily complex and dependent process). Surprisingly, given access to such a simulator, we show that we can recover the same learning guarantees as in the classical setting with independent data, namely, error bounds that depend on the VC dimension. Further, we use this framework to study the power of conditional sampling and show strict statistical and computational advantages in this setting. As a highlight of our framework, we exhibit a single algorithm that simultaneously learns any given VC class under all processes samplable in bounded polynomial time, with regret controlled by the time-bounded Kolmogorov complexity of the process. This provides a significant conceptual broadening of the classical PAC model.

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13578v1 Announce Type: new Abstract: Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution

Dachun Kai, Jiayao Lu, Yueyi Zhang, Xiaoyan Sun — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13580v1 Announce Type: new Abstract: Event-based vision has drawn increasing attention owing to its distinctive properties, including ultra-high temporal resolution and extreme dynamic range. Recent works have introduced it to video super-resolution (VSR) to enhance flow estimation and temporal alignment. In contrast, this paper shifts the focus of event signals from motion refinement to texture enhancement in VSR. We propose EvTexture++, the first event-driven framework dedicated to texture enhancement in VSR. It leverages high-frequency spatiotemporal details from events to improve texture recovery. EvTexture++ incorporates a customized texture enhancement branch, along with an iterative texture enhancement module that progressively exploits high-temporal-resolution event information for texture restoration. This enables gradual refinement of texture regions across iterations, yielding more accurate and detailed high-resolution outputs. Besides intra-frame texture recovery, large motions could degrade inter-frame temporal consistency, particularly in texture regions, leading to texture flickering. To mitigate this, we further exploit the continuous-time motion cues of events to enhance temporal consistency, introducing a temporal texture alignment module that estimates event-guided texture-aware flow for precise inter-frame texture alignment. Moreover, EvTexture++ is designed as a plug-and-play tool to flexibly boost the performance of existing VSR models. Experiments on five datasets demonstrate that EvTexture++ achieves state-of-the-art performance. When integrated into recent VSR models, it yields significant improvements, with gains of up to 1.55 dB in PSNR on the texture-rich Vid4 dataset. Code: https://github.com/DachunKai/EvTexture.

The Tone of Awareness: Topic, Sentiment, and Toxicity Maps During Mental Health Month on TikTok

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13581v1 Announce Type: new Abstract: Despite raising concerns about the mental health effects associated with the usage of TikTok, little is known about how related content is framed by creators and received by audiences. We collect the content of 28,341 TikTok videos and 80,130 comments from Mental Health Awareness Month (May) in 2023 and 2024 via the TikTok Research API, and study how the tone of awareness varies across topics and years. We characterize "tone" as the emotional and interpersonal framing of mental health discourse, operationalized through sentiment and toxicity measures. We extract topics from video text using BERTopic and log-odds keywords, then quantify topic-conditioned sentiment (XLM-T) and toxicity (Detoxify) separately for video transcriptions and comments. Sentiment captures the affective valence of content, while toxicity reflects the presence of harmful or abusive language. We find a stable set of recurring themes across years, spanning clinical conditions, emotional disclosure, self-care, and campaign-oriented content, with engagement highly skewed toward a small subset of topics. All sentiment and toxicity analyses are computed separately for video content and comments, allowing us to distinguish between content production and audience reception. Sentiment in videos is often negative for emotionally charged topics, while comments tend to shift toward more mixed or positive polarity, especially for suicide prevention. Toxicity is low in median overall, but exhibits longer-tailed outliers in comments than in videos that are more pronounced in comments and concentrated in specific topics (e.g., "Duet", "Suicide Prevention", and "Psychisch"). Overall, our results provide a topic-level decomposition of mental health discourse on TikTok during awareness-month campaigns.

Testing Bipartiteness in Logarithmic Rounds

Yumou Fei, Ronitt Rubinfeld — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13583v1 Announce Type: new Abstract: The seminal work of Goldreich and Ron (\textit{Combinatorica, 1999}) showed that bipartiteness of bounded-degree graphs can be tested using $O(\sqrt{n\log n})$ random walks of length $O(\log^{6} n)$. In this work, we improve their result by showing that $O(\sqrt{n})$ random walks of length $O(\log n)$ suffice. As a corollary, we obtain an $O(\log n)$-pass, $O(\sqrt{n}\log n)$-space streaming algorithm for testing bipartiteness, whose pass complexity is optimal in light of a recent lower bound of Fei, Minzer, and Wang (\textit{arXiv, 2026}). Our proof takes a different approach from that of Goldreich and Ron, using the semidefinite programming relaxation for Max-Cut introduced by Goemans and Williamson (\textit{J. ACM, 1995}).

Towards Effective Waste Segmentation for Automated Waste Recycling in Cluttered Background

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13587v1 Announce Type: new Abstract: Rapid expansion of urban areas and population growth is causing an immense increase in waste production, which demands the need for efficient and automated waste management. In this scenario, automated waste recycling (AWR) using deep learning methods can assist humans in optimal waste management. Recent deep learning approaches for AWR provide promising waste segmentation performance, however, these methods rely on large backbone networks that are inefficient for AWR systems and suffer from performance deterioration in cluttered scenes. To this end, an optimal waste segmentation network is introduced which effectively utilizes the spatial domain to capture localized structural dependencies and the spectral domain to efficiently extract global contextual relationships. This cascaded design allows the network to progressively leverage both local and global representations across complementary domains to highlight the semantic information necessary for effective segmentation of various waste objects. Furthermore, auxiliary feature enhancement module (AFEM) is introduced to enhance the target objects' boundaries and blob amplification for better segmentation in cluttered scenarios. Extensive experimentation on ZeroWaste-aug, ZeroWaste-f and SpectralWaste datasets reveals the merits of the proposed method.

Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

Meher Sai Preetam, Meher Bhaskar — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13589v1 Announce Type: new Abstract: We present Simplex-Constrained Sparse Bagging (SCSB), a mathematically rigorous framework for post-training compression and probability calibration of bootstrap-based bagging ensembles. Standard bagging ensembles (such as Random Forests, Bagged SVMs, and Bagged Neural Networks) assign uniform voting power to all constituent estimators. However, this naive uniform prior ignores the varying local competence of base estimators and contributes to model overconfidence. We formulate ensemble pruning and calibration as a joint optimization problem over the probability simplex by minimizing the Out-Of-Bag (OOB) loss. To induce sparsity, we address the theoretical "L1-simplex paradox" -- the mathematical reality that the L1 norm is constant on the simplex and fails to prune -- by introducing a concave quadratic penalty. SCSB is model-agnostic and achieves up to 96% ensemble compression, yielding linear inference speedups and superior probability calibration (lowered Expected Calibration Error) while preserving or enhancing generalization accuracy.

Multiagent Protocols with Aggregated Confidence Signals

Ali Elahi, Barbara Di Eugenio — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13591v1 Announce Type: new Abstract: Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existing method produces or evaluates a confidence for the output of a multiagent system. Prior work uses confidence within multiagent debate (MAD) to weight messages, trigger debate, or calibrate individual agents, but it never aggregates these into a single confidence for the system itself. We introduce three protocols that produce a final answer along with a single aggregated confidence by first transforming raw confidence signals to make them comparable across models, then combining them via soft voting or a probability fusion we call Bayesian fusion. This aggregated confidence is substantially more discriminative (AUARC) than that of the best single agent or the standard debate baselines, while correctness (F1-score) stays stable and recovers the losses MAD incurs on more ambiguous tasks. Analyzing two estimators, sequence probability and self-report, alongside parametric and non-parametric calibrators, we find that calibration improves F1 for both estimators while AUARC is less reliant on it. We evaluate six homogeneous and heterogeneous debating pairs per benchmark, across five benchmarks and four task types, spanning a range of model capabilities and sizes.

See What I See, Know What I Think: Dense Latent Communication Across Heterogeneous Agents

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13594v1 Announce Type: new Abstract: Multi-agent systems communicate mostly through text, paying a lossy and expensive decode and re-encode cost. KV-cache communication is a promising alternative, yet most prior work is homogeneous, using duplicate copies of the same model, and avoids the central challenge of cross-model latent alignment; existing heterogeneous methods are also restrictive, typically assuming shared input and using transferred caches mainly for steering. We study a more fundamental question: can heterogeneous agents be aligned well enough to perform real "mind reading" and transfer both what one agent sees and how it thinks? Our information-structure analysis reveals a duality: context-aware transfer is driven by sparse reasoning signals, while context-unaware transfer, where the receiver sees no input, requires dense contextual knowledge preservation. Motivated by this, we propose dense alignment for heterogeneous KV-cache communication via a lightweight cross-model cache transformation and two-phase training: reconstruction followed by generation. Across all six directions of {Qwen3-4B, 8B, 14B} and six in-domain and out-of-domain benchmarks, our method outperforms prior heterogeneous baselines, matches or exceeds text communication in context-aware settings at roughly 2 to 3 times lower compute, and remains effective in context-unaware transfer where prior methods collapse.

Reward Modeling for Multi-Agent Orchestration

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13598v1 Announce Type: new Abstract: Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self-supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi-agent executions to construct win-lose pairs for Bradley-Terry reward model training. Unlike existing MAS test-time scaling and orchestrator training frameworks that rely on costly sub-agent rollouts, OrchRM operates directly at the orchestration level, enabling efficient and high-performing reward-guided orchestrator training and MAS test-time scaling. OrchRM improves training efficiency by up to 10x in token usage while improving MAS test-time scaling performance by up to 8% in accuracy. These gains consistently transfer across multiple domains, including mathematical reasoning, web-based question answering, and multi-hop reasoning, demonstrating orchestration-level reward modeling as a scalable direction for robust multi-agent orchestration. Code will be available at https://github.com/Wang-ML-Lab/OrchRM.

MCR-Bionic Hand: Anatomical Structural Priors for Dexterous Manipulation

Haosen Yang, Guowu Wei — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13601v1 Announce Type: new Abstract: Dexterous robotic hands are usually formulated as high dimensional active control systems governed by degrees of freedom, actuation, and algorithms. Human hand dexterity, however, is partly encoded in the physical architecture of bones, ligaments, tendons, aponeuroses, and intrinsic muscles. This work describes that contribution as two linked forms of structural intelligence: structural prior generation, in which wrist to finger tenodesis, FDS/FDP routing, and the dorsal extensor hood transform low dimensional posture inputs into default grasp configurations and PIP to DIP coordination; and muscle mediated modulation, in which extrinsic muscles, lumbricals, and interossei regulate MCP posture, distal stability, fingertip force paths, and contact states around that default state. Based on this framework, MCR-Bionic Hand is developed as a 1:1 musculoskeletal biomimetic hand integrating a two row eight bone wrist, cross wrist tendons, anatomical flexor routing, volar plate and collateral ligament constraints, the dorsal extensor hood, and intrinsic muscle pathways within one body. Functional demonstrations and geometric mechanical models show that wrist posture induces multi joint pre shaping, the extensor hood maps PIP posture to a coupled DIP response, and intrinsic plus pathways modulate distal stability and fingertip action direction after grasp formation. Contact rich tasks, including coin rotation, pen transfer, dorsal coin flipping, and cube manipulation, show that MCR-Bionic links low dimensional state generation with fine post contact modulation. These results suggest that anatomical biomimetics is valuable not for visual similarity, but for identifying human hand structures that perform part of control.

EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

Harihara Muralidharan, Reema Baskar, Soo Hee Lee, Tim Proctor, Kenny Workman — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13602v1 Announce Type: new Abstract: We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. The benchmark includes 106 evaluations across CUT\&Tag/CUT\&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows. Across 5,088 valid trajectories from 16 model-harness pairs, no system passed a majority of attempts: GPT-5.5 / Pi led at 45.0\% (143/318 attempts; 95\% confidence interval (CI), 36.3--53.7), followed by GPT-5.5 / OpenAI Codex at 39.9\% (127/318 attempts; 95\% CI, 31.6--48.3). Claude Opus 4.8 Max / Pi and GPT-5.4 / Pi each passed 39.0\% (124/318 attempts; 95\% CI, 30.2--47.8 and 31.0--47.0, respectively). Performance varies across assay types, and many failed runs still contain parts of the correct answer. Agents often found the right files and computed useful intermediate results, but failed when the task required deeper, assay-specific scientific judgment.

Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13603v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning is the dominant paradigm for inference-time scaling in language models, yet the causal influence of individual steps on the final answer poorly understood. We estimate each step's causal importance via early exit and use this measure to study how answers form across the reasoning traces of several model families. Across diverse tasks, we find that reasoning typically crosses a \emph{commitment boundary} -- a sharp transition from transient intermediate guesses to a stable, high-confidence answer. This transition often happens in a single step, well before the model's reasoning block ends, and is followed by \emph{epiphenomenal} CoT steps that leave the final answer probability unaltered. Using attention probes, we show that answer-formation stages can be linearly decoded from intermediate reasoning steps with high accuracy and generalize robustly to unseen reasoning tasks. We exploit this signal to early-exit reasoning blocks at the commitment boundary, reducing the length of CoTs up to 55\% on average with negligible impact on model performance.

Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

Haochen Wu, Yi Hou, Shiguang Xie — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13604v1 Announce Type: new Abstract: Dispatch in three-sided marketplaces provides a natural setting for reinforcement learning from world feedback: decisions are evaluated by delayed operational outcomes such as delivery speed, courier utilization, and merchant congestion. We present a deployed reinforcement learning system at DoorDash that adapts dispatch objective weights in a large-scale food-delivery marketplace using delayed signals. Rather than replacing the combinatorial assignment optimizer, a store-level policy learned from logged marketplace data selects a discrete multiplier that shifts the dispatch optimizer's tradeoff between delivery quality and batching efficiency. This interface enables offline policy learning under noisy, delayed, and coupled feedback while preserving production feasibility constraints and operational safeguards. We train a shared value function using centralized offline data and decentralized store-level execution, with Double Q-learning targets and a conservative regularizer to reduce out-of-distribution value overestimation. In a production switchback experiment, the offline-trained policy increases batching and reduces courier-side time costs without degrading customer-facing delivery quality. Results illustrate how world feedback from a live economic and logistics system can be used to safely adapt decision policies online.

Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

Zach Studdiford, Gary Lupyan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13607v1 Announce Type: new Abstract: When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people's behavior does not exhibit the same types of failures because human reasoning uses principled and abstract world models. We evaluate human participants and 25 LLMs on their ability to engage in common-sense reasoning about a variety of everyday situations and observe similar patterns of errors in both people and models. We then identify the set of attention heads driving LLM responses and find that these heads implement a form of pattern-matching. These attention heads allow us to predict seemingly inexplicable reasoning errors in people caused by ostensibly irrelevant prompt details. Taken together, our results suggest that everyday causal reasoning in people and LLMs is more consistent with a form of pattern-matching than with abstract world models.

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13608v1 Announce Type: new Abstract: Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders

Minghao Luo, Liang Chen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13610v1 Announce Type: new Abstract: Search-augmented LLMs increasingly mediate everyday consumer recommendations by retrieving live web content. This creates a new risk: generative recommenders may consume polluted web content, such as fake reviews and promotional pages crafted to mislead recommendations. We ask: to what extent do search-augmented LLMs become unwitting promoters of fake products when consuming polluted retrieval results? To answer this, we introduce FORGE (Fake Online Recommendations in Generative Environments), a benchmark for measuring fake-product promotion under controlled web-content pollution. Given an upstream search result, FORGE locally rewrites real products in retrieved web pages into fake ones to simulate web-content pollution, and measures how often the LLM recommends the fake product. FORGE covers 225 real-world products across 15 categories and 5 consumer scenarios. Across 12 commercial and open-weights LLMs, all models are vulnerable: a single polluted page yields fooled rates of up to 27%, while the full top-3 replacement raises this to 73.8%. Vulnerability varies substantially across categories, increasing when models lack stable prior knowledge of the relevant products. Reasoning does not mitigate this vulnerability; instead, it often generates spurious social proof to justify false recommendations. We evaluate three defenses: skepticism prompting and consensus filtering (over model priors or cross-document evidence). Skepticism can exacerbate vulnerability, much like reasoning, while filtering risks suppressing legitimate products. We release FORGE at https://github.com/leoluolol/forge-benchmark.

Beyond the IT Checklist: Engineering a Reasonable Standard of Care for Cyber Safety

Matthew E. Jablonski, Linton Wells II, Kathryn B. Laskey, F. Brett Berlin — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13612v1 Announce Type: new Abstract: Current U.S. cyber policy, centered on security, often treats documentation of controls and incident reports as a proxy for safety in the built environment. This paper argues that such an approach is inadequate for cyber-physical systems, where digital failures can produce kinetic harm. We construct and code a corpus of critical infrastructure policy documents (N=292, 2000-2025) to examine how "reasonable care" is operationalized across the NIST SP 800-160 Vol.~2 resilience lifecycle. The resulting maps show that obligations are concentrated in the Anticipate phase and emphasize administrative compliance, while Withstand and Recover phases rely heavily on delegated references to IT-focused control catalogs that are poorly aligned with physics-based hazards. We identify three major disconnects: miscalibrated delegated standards, recovery defined as notification rather than engineered navigation, and uneven adaptation requirements across sectors. We then propose a modernized standard of care anchored in hazard-specific traceability, structured assurance cases, and cyber resiliency engineering. Finally, we recommend that federal policy pair these engineering obligations with targeted incentives so that resilient architectures for critical infrastructure become a viable business decision rather than an unfunded expectation.

Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks

Achraf Hsain, Sultan Almuhammadi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13621v1 Announce Type: new Abstract: Shielded reinforcement learning is typically presented as a runtime safety mechanism that compiles temporal-logic specifications into automata restricting an agent's actions. We argue this is the wrong product. The same automata-theoretic machinery -- specification compilation, product game construction, attractor computation, and winning-region extraction -- is better read as a design-time analytical instrument whose outputs are structural insights about a system rather than runtime constraints on a deployed agent. We instantiate this through a constrained two-player safety game for network defense. The two specifications are enforced asymmetrically: the defender specification defines the unsafe region of the game, whereas the attacker specification restricts the adversary's legal actions during attractor computation. Solving the game yields a defensibility verdict -- a formal certificate that a topology-specification pair is or is not defensible -- with the associated winning region and shield. Beyond the binary verdict, we derive topology-level metrics from the attractor structure and combine them with post-convergence behavior from shield-constrained adversarial multi-agent reinforcement learning. Together these form a defensibility fingerprint capturing both a network's formal safety properties and its operational behavior under adaptive play. A what-if analysis shows that formal defensibility and operational effectiveness capture distinct aspects of security: small architectural changes can produce large shifts in operational outcomes while leaving formal safety margins nearly unchanged. Shield synthesis is thus most valuable not as a deployment mechanism for safe agents, but as a framework for answering architectural questions about whether, where, and how a system can be defended. The defensibility verdict is the output, not the safe policy.

Finding Conservation Laws of Large Dynamical Systems with Tasks and Futures: A Case Study in Utilizing Dynamic Data Dependencies

R\"udiger Nather — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13623v1 Announce Type: new Abstract: As parallel workloads grow in complexity, managing fine-grained data dependencies becomes a critical challenge. Futures offer a promising model for handling these dependencies, particularly in irregular algorithms, but they also come with the restriction of value-immutability. This immutability limits the ability to perform in-place memory updates, a necessity for high-performance linear algebra where memory recycling is paramount. In this paper, we address these limitations by introducing a new construct, await_delete, which extends traditional future semantics to allow safe value reuse once consumers are finished. Building on this extension, we present a novel future-based algorithm for the block-wise inversion of dense, symmetric matrices, motivated by a recent algorithm for finding conservation laws of dynamical systems. We implement our approach in an extended version of Taskflow and evaluate it through strong-scaling experiments. Our results demonstrate that while futures incur significant overhead on smaller problem sizes, they achieve nearly linear scaling on large matrices. We analyze the amortization threshold and show that futures are a viable high-performance tool for large-scale linear algebra.

Beyond Uniform Tokens: Adaptive Compression for Time Series Language Models

Jialin Gan, Xin Qiu, Guangzhe Chen, Xue Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13624v1 Announce Type: new Abstract: Large language models (LLMs) have enabled time series (TS) analysis by jointly modeling numerical observations and textual context through a shared token interface. However, TS tokens and prompt tokens exhibit fundamentally different information structures, making uniform token processing inefficient. In this paper, we study token efficiency in TS language modeling from an asymmetric-token perspective. We show that TS tokens have highly uneven spectral contributions, where many tokens share redundant frequency patterns while a small subset preserves critical temporal evidence. We also observe that prompt-token influence attenuates with model depth, suggesting that full prompt retention across all layers is unnecessary. Based on these findings, we develop an adaptive token budgeting framework that compresses TS tokens via frequency-domain structure and progressively reduces prompt tokens across layers. Experiments across forecasting, classification, imputation, and anomaly detection demonstrate up to \textit{\textbf{7.68$\times$}} inference acceleration and performance gains in \textit{\textbf{78\%}} of evaluated settings, showing the effectiveness of asymmetric token compression for scalable TS foundation models.

Revisiting Vehicle Color Recognition in Long-Tailed Surveillance Scenarios

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13625v1 Announce Type: new Abstract: Vehicle color recognition is an important cue for vehicle identification in surveillance systems, especially when license plates are illegible due to low resolution, occlusion, motion blur, or poor illumination. However, real-world vehicle color distributions are highly imbalanced, making overall accuracy insufficient to assess performance on rare but operationally relevant colors. This paper presents a comprehensive study of vehicle color recognition under severe class imbalance using UFPR-VeSV, a challenging real-world surveillance dataset. We investigate synthetic minority-class augmentation through two off-the-shelf generative strategies: text-conditioned image generation with RunDiffusion/JuggernautXL and image-conditioned color editing with Gemini 2.0 Flash. The curated synthetic data are combined with modern visual representations, loss reweighting, learning-rate scheduling, color-safe augmentation, foreground-aware preprocessing, and ensemble fusion. The bestperforming approach achieves 94.6% micro accuracy and 79.7% macro accuracy, improving macro accuracy by 8.2 percentage points over recent literature. A manual error analysis further shows that many remaining failures are visually ambiguous even for human annotators, highlighting the practical limits of color-based vehicle identification in unconstrained surveillance imagery. The generated images and source code are publicly available at https://github.com/viniciusorru/vcr-synthetic

Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial Approaches

Kyuil Lee, Dezhi Yu, Yongkang Huang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13626v1 Announce Type: new Abstract: We study generative modeling of Bach-style symbolic piano music using a shared MIDI corpus and three model families: autoregressive LSTMs with attention, latent-variable models including recurrent VAEs and vector-quantized VAEs, and generative adversarial networks. We compare their ability to model polyphonic note sequences, learn useful latent representations, and generate stylistically coherent compositions. Our experiments show that the autoregressive LSTM with attention produces the most musically coherent samples, while vector quantization helps mitigate posterior collapse and yields more structured outputs than conventional recurrent VAEs. The adversarial approach captures local pitch patterns but remains difficult to train and generalizes less reliably to Bach's style. These results highlight the relative strengths and failure modes of autoregressive, latent-variable, and adversarial approaches for symbolic music generation.

A near-quadratic lower bound on the border determinantal complexity of $\sum_i x_i^n$ via conormal specialization

Karthik Sheshadri — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13628v1 Announce Type: new Abstract: The border determinantal complexity $\dcb(f)$ of a polynomial $f$ is the least $m$ such that $f$ is a limit of determinants of $m\times m$ matrices of affine-linear forms. We prove that for every $n\ge3$, over $\CC$, \[ \dcb\Big(\sum_{i=1}^n x_i^n\Big)\ \ge\ \frac{(n-1)^2}{4e}, \qquad \sdcb\Big(\sum_{i=1}^n x_i^n\Big)\ \ge\ \frac{(n-1)^2}{2e} \] in the ordinary and symmetric models respectively; both match the known $O(n^2)$ upper bounds up to the constant. To our knowledge these are the first border determinantal lower bounds for an explicit family that are superlinear in the number of variables: the known quadratic border bound for the permanent reads the \emph{dimension} of the dual variety and is linear in its number of variables, whereas we transfer the dual \emph{degree}. The proof has two ingredients. The first is an unconditional bound on the slot-$(n-2)$ conormal multidegree of the multiplicity-one Gauss-graph cycle of an arbitrary affine-linear determinant -- singular, reducible, and non-reduced fibers allowed -- by a multihomogeneous B\'ezout count of a lifted kernel incidence. The second is a specialization argument: along any degeneration $\det A_c\to\sum_ix_i^n$, the flat limit of these Gauss-graph cycles contains the conormal variety of the Fermat cone with positive coefficient. A cone-shift identity converts that conormal multidegree into the classical dual degree $n(n-1)^{n-2}$ of the smooth Fermat hypersurface, and an $(n-1)$-st root yields the quadratic bound. The exact lower bounds of the author's companion manuscripts follow as corollaries.

From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

Pedro Correa, Olivier Perrotin, Samir Sadok, Paula Costa, Thomas Hueber — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13630v1 Announce Type: new Abstract: The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces. We evaluate four speech representation families for 3D facial synthesis, comparing their facial reconstruction quality across two facial decoders using objective metrics and a perceptual evaluation. We additionally conduct probing analyses that relate tokenized representations to phonetic units and to articulatory deformations. We found that encoding phonetic classes is beneficial for accurate facial animation prediction on both semantic and label-based representations with comparable facial animation quality. From the latter, we introduce an Audio Visual Text-to-Speech (AVTTS) pipeline that leverages, as a shared space, discrete representations to decode speech and 3D facial motion.

Beyond Virtual Delay: Improving Packet Delay Bound in Network Calculus

Yuming Jiang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13631v1 Announce Type: new Abstract: In network calculus, a fundamental result is the classical delay bound given by the horizontal deviation between the arrival and service curves. While widely used, the classical bound is derived from the notion of virtual delay. In this work, we first show that the maximum packet delay is always upper-bounded by the maximum virtual delay, revealing inherent conservatism when applying the virtual-delay-based bound to packet delay. Motivated by this insight, we revisit packet delay analysis and derive a new packet delay bound that requires no assumptions beyond the arrival and service curves. Specializing the new bound to a system with leaky-bucket arrival curve and rate-latency service curve shows strict improvement over the classical bound, which is further demonstrated through a case study in time-sensitive networking (TSN).

Aerial Wildfire Suppression Planning with a Hybrid CNN-Cellular Automata Fire Model

Ion Matei, Maksym Zhenirovskyy, Takuya Kurihana, Rohit Vupala, Anthony Wong — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13633v1 Announce Type: new Abstract: Aerial wildfire suppression requires not only predicting fire spread, but also designing effective intervention strategies under operational and environmental uncertainty. We present a modeling and optimization framework for aerial wildfire suppression that combines a hybrid neural-cellular automaton wildfire model with gradient-based design of targeted aerial drops. The wildfire model predicts spatially varying spread behavior from terrain, fuel, and wind data, while the intervention module determines binary drop actions with continuous-valued location and orientation parameters mapped to the simulation grid. Water and retardant are represented with distinct suppression effects, corresponding to immediate reduction of active burning and persistent reduction of future spread. To evaluate the robustness of the resulting suppression plans, we quantify both aleatoric uncertainty through Monte Carlo sampling of daily fire-state realizations and epistemic uncertainty through spatially correlated prediction-error perturbations. A case study based on the 2020 Bear Fire shows that the framework can generate coherent aerial suppression schedules for reducing total fire-affected area and can support uncertainty-aware analysis of wildfire intervention strategies.

Operads for compositional reasoning in LLMs

Nathaniel Bottman, Kyle Richardson — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13634v1 Announce Type: new Abstract: Question decomposition, i.e. breaking a complex query into simpler sub-queries whose answers are composed to produce a final answer, is a widely used strategy for improving LLM reasoning, yet it currently lacks a rigorous mathematical foundation. In this paper, we propose operads, mathematical structures that model many-in, one-out operations and compositions thereof, as a natural framework for describing question decomposition. We define the questions operad $Q$, in which operations correspond to question templates and composition corresponds to substitution of sub-answers, and show how QA models can be interpreted as algebras over $Q$. Beyond reframing existing practice, this operadic perspective points toward new methods, in particular a notion of operadic consistency, which measures whether a QA model's answers agree across the partial collapses of a question decomposition tree. Empirical evaluation of operadic consistency is reported in our companion paper (Bottman, Liu, and Richardson, 2026), which finds it strongly correlated with accuracy across twelve LLMs and four multi-hop QA datasets and outperforming standard temperature-based self-consistency baselines. We argue that operads are the natural mathematical home for question decomposition, and that invariants such as operadic consistency open new directions for analyzing and improving the reliability of multi-step reasoning.

The Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual Learning

Ayushman Trivedi, Bhavika Melwani — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13637v1 Announce Type: new Abstract: Catastrophic forgetting is often viewed as the destruction of previously learned knowledge during sequential learning. Building on the Accessibility Collapse framework, we investigate the geometric structure of recoverability in continual learning. Using Split CIFAR-100 and a sequentially trained ResNet-18, we analyze recoverability, representational drift, and recovery complexity across ten tasks. We introduce Recovery Subspace Dimensionality (k_t), a measure of the minimum number of singular directions required to preserve 90 percent of full probe performance. Contrary to our Recoverability Diffusion hypothesis, recovery dimensionality remains stable throughout training (mean k_t = 8.0) despite substantial representational drift. Principal-angle drift strongly predicts recoverability (r = -0.862), and a simple geometric model explains 82.2 percent of recoverability variance. These findings support the Stable Recovery Manifold hypothesis, suggesting that forgotten knowledge remains compactly decodable despite representational reorganization. The results indicate that catastrophic forgetting is primarily an accessibility and manifold-alignment problem rather than information destruction.

Tuning Agent-Based Predator-Prey Models Toward Lotka-Volterra Dynamics

Corinna Mandl, Siddharth Chaturvedi, Marcel van Gerven — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13639v1 Announce Type: new Abstract: Recent growth in compute power has made it increasingly feasible to use large-scale agent-based models to simulate complex adaptive systems. A central difficulty is that such models contain many local rules and parameters, where small changes can lead to runaway behaviour, population collapse, or saturation at artificial bounds. We study this problem in a continuous predator-prey system where sheep and wolves are active agents with local sensing, internal energy, and recurrent neural network-based controllers. We ask whether environmental and demographic parameters can be tuned so that the resulting population dynamics resemble classical Lotka-Volterra cycles. We optimise these parameters with a feature-based loss that rewards sustained oscillations, phase lag, bounded populations, and long-term persistence, first for random controllers and then for evolved controllers in a more naturalistic setting. The model is implemented in ABMax, a JAX-based agent-based modelling framework that enables efficient batched simulation on hardware accelerators.

The Moving Drone: Negotiating Agency Between the Voice and the Virtual

Nithya Shikarpur, Victor Arul, Anna Huang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13640v1 Announce Type: new Abstract: Melodic material in Hindustani music is presented in relation to a tonic, usually sustained by the tanpura, a four-stringed drone instrument. Rooted in Hindustani music, 'The Moving Drone' sets the traditionally static drone into motion that, throughout the performance, gains increasing agency transitioning from reactive to more proactive roles. The work employs four independent loopers in Max/MSP to function as 'virtual' drones. They are populated cyclically in real-time as the vocalist improvises, creating an organic and evolving feedback loop between the voice and the virtual drone. This relationship further evolves melodically by pitch shifting the loops, which introduces a dimension of sudden, explicit movement. Then it changes timbrally, via the integration of GaMaDHaNi, a singer conditioned pitch-to-voice generative AI model to resynthesize looped audio. While current music AI approaches prioritize high-fidelity and realism of generated content which has sparked anxiety over job replacement for the music community, this work intentionally utilizes low-fidelity generative outputs, further necessitating human interpretation and situational context in order to be complete. 'The Moving Drone' positions technology and generative AI within established socio-cultural musical practices, proposing a virtual drone as an active, responsive, and co-creative musical agent.

Recursive Agent Harnesses

Elias Lumer, Sahil Sen, Kevin Paul, Vamse Kumar Subbiah — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13643v1 Announce Type: new Abstract: Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in Anthropic's dynamic workflows. We name and study the pattern between these two lines of work, where the recursive unit is a full agent harness with filesystem tools, code execution, and planning rather than a model call with no tools. We call this the Recursive Agent Harness (RAH) and frame it as harness recursion, the code-first extension to the model recursion of RLMs. A parent agent generates and runs an executable script that spawns subagent harnesses in parallel for fine-grained workloads and uses structured function calls for small subtasks. We provide a controlled evaluation on long-context reasoning. With the backbone held fixed at GPT-5 to match the published Codex and RLM baselines, RAH improves the Codex coding-agent baseline from 71.75% to 81.36% on Oolong-Synthetic (199 samples, 13 context-length buckets up to 4M tokens), a gain attributable to the harness rather than the model. With a stronger backbone, Claude Sonnet 4.5, the same design reaches 89.77%.

Surflo: Consistent 3D Surface Flow Model with Global State

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13644v1 Announce Type: new Abstract: Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this: per-view methods emit overlapping, unaligned pointmaps that grow linearly with input count, while global-latent methods commit to a fixed, low-resolution output. We introduce Surflo, which compresses a variable number of unposed RGB views into K latent tokens-one global state-and decodes oriented 3D surface points by independently transporting them from noise onto the surface via flow matching. This frees the output from any fixed grid or token budget: the same latent yields from a few thousand to a million points in a single forward pass. To suppress the local inconsistencies inherent to independent per-point decoding, an inference-time guidance term correlates nearby points by injecting a photometric gradient during ODE integration. Surflo matches or surpasses feed-forward baselines on surface metrics, runs an order of magnitude faster than optimization-based methods that require hundreds of views, and is the only feed-forward approach to combine a global latent with arbitrary-resolution decoding.

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13647v1 Announce Type: new Abstract: We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

Nathaniel Bottman, Yinhong Liu, Kyle Richardson — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13649v1 Announce Type: new Abstract: Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for systems built by iterated substitution, suggests a complementary diagnostic: a model's direct answer to a compositional query should agree with the answer it produces by composing a stated decomposition of the same query. We instantiate this idea as operadic consistency (OC), a per-question signal. Across twelve instruction-tuned LLMs (4B to 671B parameters, open-weights and closed-source) on four multi-hop QA datasets, OC is strongly correlated with accuracy on every dataset (Pearson $r \in [0.86, 0.94]$, all $p \leq 0.0004$), and is the only signal we evaluate with $r \geq 0.85$ uniformly across all four datasets. Chain-of-thought self-consistency (CoT-SC; Wang et al., 2023) matches OC on HotpotQA and DROP ($r = 0.93, 0.87$) but drops to $r \approx 0.45$ on MuSiQue and StrategyQA. At the per-question level, OC contributes information beyond CoT-SC and semantic entropy on every dataset (cluster-robust $p \leq 10^{-16}$ for the OC coefficient), and the conclusion is robust to additionally controlling for constructed decomposition-aware baselines ($p \leq 10^{-13}$). The same signal yields selective-prediction improvements (accuracy at fixed coverage) over a tuned CoT-SC baseline at the equal-cost $K = 3$ budget (AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell). On five frontier thinking models, where the decomposition is extracted from the model's own chain of thought, the same equal-cost comparison gives positive selective-prediction point-estimate lift on all 16 (dataset, budget, metric) cells tested, with 95% CIs excluding zero on 12 of the 16.

World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13652v1 Announce Type: new Abstract: Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.

Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

Jen-Hao Cheng, Yipeng Wang, Hao Zhang, Gengshan Yang, Jenq-Neng Hwang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13655v1 Announce Type: new Abstract: We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.

Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

Guo Yu, Wenlin Liu, Yulan Hu, Hao-Xuan Ma, Jun-Peng Jiang, Han-Jia Ye — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13657v1 Announce Type: new Abstract: On-policy distillation (\textsc{OPD}) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, \textsc{OPD}-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full \textsc{OPD}. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textsc{OPD} into ordinary dense parameter rewriting; instead, \textsc{OPD} retains important geometric signatures of on-policy post-training.

Before You Think: System 0, AI-Mediated Cognition and Cognitive Colonization

Marianna Bergamaschi Ganapini, Massimo Chiriatti, Enrico Panai, Giuseppe Riva — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13658v1 Announce Type: new Abstract: This paper examines three recent frameworks for understanding the cognitive and epistemic consequences of artificial intelligence: Tri-System Theory, Thinkframes, and System 0. It argues that while the first two capture important dimensions of AI's influence on individual reasoning and collective epistemic practices, System 0 occupies a theoretically distinctive position that neither can fully replicate. The paper introduces the concept of cognitive colonization, according to which AI systems can embed external interests within the architecture of the self in ways that are difficult for users to perceive. Because such systems are already widely deployed, understanding these invisible forms of influence is an urgent philosophical and practical task.

Specifying Hardware Communication as Programs

Ernest Ng, Nikil Shyamsunder, Francis Pham, Adrian Sampson, Kevin Laeufer — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13659v1 Announce Type: new Abstract: To test and debug hardware modules, it is common to write two programs: a driver, which translates high-level transactions into interactions on the module's input and output signals, and a monitor, which analyzes a signal-level execution trace and recognizes a transaction. These two programs are commonly implemented separately for each hardware protocol, but this separation entails manual effort and risks inconsistencies. We advocate an alternative approach. We present a DSL in which users specify hardware communication protocols as succinct imperative programs. Crucially, the same specification can be used to both drive designs and monitor transactions. We present the design of a tool, which given a specification in our DSL and a waveform, automatically infers a transaction-level trace consistent with the waveform. We discuss plans to evaluate our DSL on real-world interconnects such as Wishbone and AXI-Stream.

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13662v1 Announce Type: new Abstract: LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfaces that shape agent behavior. We frame this as environment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We present EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. EurekAgent engineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention. EurekAgent sets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than $11 in total API cost. We open-source our code and results, and call for environment engineering as a core research direction for developing reliable autonomous research agents.

HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13663v1 Announce Type: new Abstract: Tool-augmented LLM agents commonly rely on step-wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emph{execution-granularity mismatch}: locally deterministic tool workflows are unfolded into repeated model-visible decisions, consuming context and forcing the model to manage low-level dataflow in the trace. We introduce \textbf{HyperTool}, a unified executable MCP-style tool interface that changes the model-visible unit of tool execution. A model invokes HyperTool with a code block that can call existing tools through their original schemas, manipulate returned values, and pass intermediate results locally, folding deterministic tool subroutines into a single outer call. To train models to use this interface, we synthesize HyperTool-format trajectories from cross-tool compositional tasks and verify them in real MCP environments. On MCP-Universe, HyperTool improves average accuracy from 15.69\% to 35.29\% on Qwen3-32B and from 9.93\% to 33.33\% on Qwen3-8B, and surpass GPT-OSS and Kimi-k2.5 on average accuracy, showing that our HyperTool can substantially improve multi-step tool use.

Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

Dimitri Kachler, Damien Sileo, Pascal Denis — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13668v1 Announce Type: new Abstract: With the growth of LLMs' (Large Language Models) capabilities, there has been an increasing push to curate high quality datasets by filtering samples in the training data. In general, Data Attribution (DA) methods aim to estimate how individual samples in a training dataset can precondition a model to generate certain outputs. As an example, one might be interested in which samples in the data could be the source of toxic behavior after training the LLM. Many methods quantify this conditioning through the paradigm of influence functions. While methods of this family are effective in its function, they lack the necessary processing speed and storage compactness to be practically implemented on large datasets. We propose a method, Influcoder, as a quick and cost-effective approach to influence-based Data Attribution at scale.

Agents-K1: Towards Agent-native Knowledge Orchestration

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13669v1 Announce Type: new Abstract: Current LLM-based research agents have advanced through agent orchestration, yet largely overlook scientific knowledge orchestration. Existing works often reduce papers to abstracts, surface mentions, and flat \texttt{cites} edges, omitting key entities, claims, evidence, mechanisms, and method lineages essential for scientific reasoning. To this end, we introduce \textbf{Agents-K1}, an end-to-end knowledge orchestration pipeline that converts raw documents into agent-native scientific knowledge graphs. Agents-K1 integrates three components under a unifying theoretical foundation: a multimodal parser whose five-module schema captures entities, multimodal evidence, citations, and typed inter-entity relations across the full paper rather than abstracts alone; a 4B information-extraction backbone trained with GRPO under a rule-based reward; and a graphanything CLI, a tri-source agent interface that unifies web search, multimodal graph retrieval, and cross-document traversal. On top of this, we process 2.46 million scientific papers across six subjects to produce \textbf{Scholar-KG}, of which we release a one-million-paper subset, and the full Scholar-KG is accessible via the SCP link below. The same pipeline can be extended to general-domain corpora and to schema-conformant data synthesis. Extensive experiments demonstrate that Agents-K1 achieves superior performance in scientific information extraction, knowledge graph construction, and multi-hop scientific reasoning.

Automated reproducibility assessments in the social and behavioral sciences using large language models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13670v1 Announce Type: new Abstract: Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.

Understanding Truncated Positional Encodings for Graph Neural Networks

James Flora, Mitchell Black, Weng-Keen Wong, Amir Nayyeri — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13671v1 Announce Type: new Abstract: Positional encodings (PEs) enhance the power of graph neural networks (GNNs), both theoretically and empirically. Two of the most popular families of PEs - spectral (e.g., Laplacian eigenspaces, effective resistance) and walk-based (polynomials of the adjacency matrix) - are theoretically equivalent in expressive power, with expressivity between the 1-WL and 3-WL tests. However, this equivalence assumes the GNN uses the "complete" version of these PEs, which requires $O(n^3)$ time and space complexity. Instead, practitioners commonly use truncated variants of these encodings, such as the first $k$ eigenspaces or powers of the adjacency matrix. However, the theoretical properties of these truncated PEs are unknown. In this work, we initiate the study of these truncated PEs. Theoretically, we show that, under truncation, several families of PEs are fundamentally different in expressive power. As a corollary, we show that truncated spectral PEs are no longer stronger than the 1-WL test. We also study a family of spectral PEs, the $k$-harmonic distances, to highlight the differences in expressive power of even closely related truncated PEs. Finally, we experimentally show that a mix of truncated PEs is preferable to any single family on real-world datasets.

$\texttt{WEAVER}$, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

Arnav Kumar Jain, Yilin Wu, Jesse Farebrother, Gokul Swamy, Andrea Bajcsy — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13672v1 Announce Type: new Abstract: The potential impacts of world models (WMs, i.e., learned simulators) on robotics are far-reaching -- policy evaluation, policy improvement, and test-time planning -- all with limited real-world interaction. To unlock these downstream capabilities, a WM needs to jointly satisfy three desiderata: $\textit{(i)}$ fidelity (i.e., producing simulated trajectories that correlate with reality), $\textit{(ii)}$ consistency (i.e., producing simulated trajectories that are coherent over long horizons), and $\textit{(iii)}$ efficiency (i.e., producing simulated trajectories quickly). We propose $\texttt{WEAVER}$ (World Estimation Across Views for Embodied Reasoning): a WM architecture that simultaneously achieves all three desiderata, providing state-of-the-art results on robotic manipulation tasks. $\texttt{WEAVER}$ is a multi-view WM trained to predict future latents and reward values via a flow-matching loss. We distill the key design decisions across model architecture, memory, and prediction objectives required to unlock the kinds of long-horizon dynamic manipulation tasks that have confounded prior world modeling approaches. We apply $\texttt{WEAVER}$ in robotic hardware, demonstrating its effectiveness at policy evaluation ($\rho$=0.870 correlation with real-world success rate), policy improvement (real-world success rate improvement of $38\%$ on top of the $\pi_{0.5}$ robot foundation model), and test-time planning (real-world success rate improvement of $14\%$ with a $5-10\times$ speedup over prior WMs). $\texttt{WEAVER}$ also demonstrates better performance than prior WMs when evaluated on out-of-distribution scenarios. Code, models, and videos at: https://arnavkj1995.github.io/WEAVER/ .

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13673v1 Announce Type: new Abstract: Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13674v1 Announce Type: new Abstract: This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at https://github.com/wdrink/RepWAM.

Improving Robotic Generalist Policies via Flow Reversal Steering

Andy Tang, William Chen, Andrew Wagenmaker, Chelsea Finn, Sergey Levine — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13675v1 Announce Type: new Abstract: Generalist policies can learn a wide range of skills from diverse robot datasets. In order to solve or improve on challenging news tasks, we need a way to infer and invoke the appropriate actions from the policy's rich behavioral prior, especially when directly commanding the policy fails. We focus on flow matching generalists and propose Flow Reversal Steering (FRS): a method that takes suboptimal but ``reasonable'' actions, finds their latent noises by passing them through the flow policy in reverse, and maps them to nearby generalist action modes. We evaluate FRS across many simulated and real-world manipulation settings. First, FRS can turn coarse semantic guidance from humans or vision-language models (VLMs) into corresponding good robot actions, improving zero-shot control. These gains can be distilled with behavioral cloning by training an auxiliary policy to output noises that the generalist maps to good actions -- showing up to 95% absolute task success rate boosts in under a minute of training. Finally, FRS enables policy improvement by bootstrapping reinforcement learning with semantic knowledge, improving on several tasks that standard RL fails to improve on.

Modality Forcing for Scalable Spatial Generation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13676v1 Announce Type: new Abstract: Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (370M to 3.3B parameters), we find that larger models trained on more image data produce more accurate depth. Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception. https://modality-forcing.github.io/

Mana: Dexterous Manipulation of Articulated Tools

Zhao-Heng Yin, Guanya Shi, Pieter Abbeel, C. Karen Liu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13677v1 Announce Type: new Abstract: Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an animation problem. Inspired by computer animation, Mana employs a coarse-to-fine pipeline that transforms procedurally-generated grasp keyframes into manipulation trajectories through motion planning and reinforcement learning. The data generation process is largely automatic, requiring only a few mouse clicks to specify functional affordances (<1 minute per tool). Across four articulated tools spanning different scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating a scalable approach to dexterous articulated tool use.

InterleaveThinker: Reinforcing Agentic Interleaved Generation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13679v1 Announce Type: new Abstract: Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13680v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) has become a standard mechanism for grounding language models in external knowledge, yet conventional retrieval based on lexical or semantic similarity is poorly suited for complex reasoning tasks: a semantically similar problem may demand an entirely different solution strategy, while a superficially different problem may share the same underlying reasoning pattern. We propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. RA-RFT uses gold-relevance distillation to train a retriever that ranks contexts by expected reasoning benefit rather than semantic overlap, and then fine-tunes the policy model via reinforcement fine-tuning methods with retrieved analogous demonstrations, so the model learns to leverage reasoning traces under verifiable outcome rewards. We further analyze the diversity of retrieved contexts and find that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct reasoning scaffolds for individual problems. Across challenging mathematical reasoning benchmarks, RA-RFT consistently outperforms standard reinforcement fine-tuning methods. For example, it improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively -- suggesting that reasoning-aware retrieval is a complementary axis of improvement and orthogonal to advances in reward design or training curricula.

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13681v1 Announce Type: new Abstract: Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.

Analog Quantum Asynchronous Event-Based Graph Neural Network

Kristian Sotirov, Shaheen Acheche, Antonio A. Gentile, Osvaldo Simeone — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11000v1 Announce Type: cross Abstract: Asynchronous, event-based graph neural networks (AEGNNs) have recently emerged as an efficient paradigm for processing the sparse and high-temporal-resolution data from event cameras. In this paper, we propose quantum analog AEGNNs (QA-AEGNNs), a novel framework to implement an AEGNN on a neutral-atom quantum computer. Neutral-atom quantum processors offer a programmable analog quantum computing platform based on controllable Rydberg-atom interactions. To this end, we map the streaming event data to an array of trapped neutral atoms, where each atom represents a graph node (event) and is positioned such that geometric proximity reflects the spatio-temporal neighborhood of events. The native Rydberg Hamiltonian of the quantum processor is programmed to mirror the message-passing computations of the AEGNN, with atomic qubit states serving as node feature embeddings and inter-atom interactions realizing graph edges. Furthermore, we propose a hybrid quantum-classical training scheme in which the analog Hamiltonian parameters (e.g., laser pulse amplitudes and detunings) are optimized using classical feedback to learn the quantum AEGNN model from data. Our approach leverages the continuous Hamiltonian dynamics and massive parallelism of neutral-atom quantum systems to natively execute event-based graph computations with potential accuracy improvements

Handbook of Error-Correcting Codes

Victor V. Albert, Philippe Faist — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11484v1 Announce Type: cross Abstract: Barcode scans, clear phone calls, reliable data storage, satellite communication, and large-scale quantum computation are all made possible by error correction. We present a handbook version of The Error Correction Zoo, a curated reference of methods for protecting classical or quantum information from errors during storage and transmission. The handbook includes descriptions of these error-correcting codes and a classification according to the symbols they use. It also catalogues relations among codes and related objects such as sphere packings, lattices, designs, groups, and classical and quantum phases of matter. The collection is intended both as a rigorous reference and as a practical aid for tracing the web of code relationships and uncovering new connections.

SAT, MaxSAT, and SMT for QLDPC Distance Computation: A Large-Scale Empirical Study

Yu-Fang Chen, Seyed Mohammad Reza Jafari, Ching-Yi Lai — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12445v1 Announce Type: cross Abstract: Exact distance computation for quantum LDPC (QLDPC) codes plays a central role in validating candidate fault-tolerant quantum-code constructions, yet the computational structure of this problem remains poorly understood. Despite substantial recent progress in QLDPC design, it remains unclear which algorithmic principles govern the practical scalability of exact distance computation and which classes of exact solvers are best suited to this task. To address these questions, we conduct a systematic study of SAT- and MaxSAT-based formulations for exact QLDPC distance computation across representative codes. We further compare these formulations against several established exact-distance approaches in order to better understand the algorithmic landscape of exact QLDPC distance computation. Our study challenges and refines several prevailing intuitions about exact QLDPC distance computation. First, despite the XOR-rich structure of QLDPC parity checks, practical scalability appears to be governed more by the handling of cardinality constraints and optimization bounds than by parity reasoning alone. Accordingly, XOR-aware reasoning does not provide a systematic advantage across our benchmark suite. Second, Brouwer-Zimmermann-style search, long regarded as the benchmark paradigm for exact distance computation in sparse classical codes, no longer maintains its traditional scalability advantage in the QLDPC setting. This finding challenges the expectation that techniques successful for sparse classical codes remain dominant for QLDPC codes. Third, substantial qualitative differences arise even among MaxSAT solvers themselves. Branch-and-bound MaxSAT significantly outperforms unsat-core-based MaxSAT on challenging benchmarks, demonstrating that solver architecture and optimization strategy play a decisive role in practical scalability.

Forward-Time Black-Scholes Reconstruction via Regularized Legendre Reduction

Phuong M. Nguyen, Matt Nguyen, Loc H. Nguyen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12450v1 Announce Type: cross Abstract: We study a forward-time formulation of the Black-Scholes equation with state-dependent volatility. In contrast to the classical terminal-value pricing problem, where the option payoff is prescribed at maturity and the price is computed backward in time, the present problem prescribes the current option-price profile and seeks to recover the option-price profile at the expiration date T. This formulation is ill-posed, since the equation evolves in the unstable direction of the parabolic operator and high-frequency perturbations in the initial data may be strongly amplified. To address this difficulty, we introduce a price-dimensional reduction based on shifted Legendre polynomials. The original Black-Scholes equation is projected onto a finite-dimensional Legendre basis in the asset-price variable, leading to a system of ordinary differential equations in time for the expansion coefficients. This reduction acts as a spectral cutoff and also relaxes the degeneracy caused by the factor S^2 at the zero-price boundary. The main reconstruction method is a dimension-reduced Legendre--Tikhonov method. We prove existence, uniqueness, data stability, and convergence for each fixed truncation level. We also include a reduced PINN solver as a secondary computational comparison after the Legendre reduction. Numerical experiments with smooth, butterfly-spread, and European put payoffs show that the Legendre--Tikhonov method recovers the terminal option-price profile from noisy initial data, while the reduced PINN solver provides a useful additional benchmark. Comparisons with the conventional physical-space quasi-reversibility method demonstrate the stabilizing effect of the Legendre reduction.

Identifiability Without Gaussianity: Symbolic World Models and Near-Infinite Temporal Consistency

Seth Dobrin, {\L}ukasz Chmiel — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12471v1 Announce Type: cross Abstract: Klindt, LeCun, and Balestriero (arXiv:2605.26379) proved that Joint-Embedding Predictive Architectures (JEPAs) achieve linear identifiability, the linear recovery of the world's true latent variables, if and only if the world's latent dynamics follow a Gaussian, stationary process. This Gaussian boundary implies a fundamental limit on temporal consistency: for any non-Gaussian physical system, the representation error of a statistical World Model grows monotonically with time. We prove that this limit is an artifact of the statistical alignment mechanism, not a property of World Models in general. We introduce the Physics-Grounded Symbolic Architecture (PGSA) and prove three results: (1) a PGSA achieves exact linear identifiability for all physical regimes, regardless of the latent distribution; (2) the per-step error of a PGSA is bounded by numerical precision alone; and (3) as a direct consequence, a PGSA maintains temporal consistency for an unbounded number of transitions, a property we term near-infinite temporal consistency. We further prove that statistical World Models cannot achieve this property for any non-Gaussian system, regardless of model capacity or the volume of training data. The algebraic cores of four of the theorems are formalized in Lean 4 with Mathlib4 v4.31.0 (zero sorry placeholders); the Klindt et al. converse is taken as an external premise. The contrast establishes that symbolic grounding in the causal generator of the world's dynamics is the sufficient condition and, in non-Gaussian regimes, the only condition for near-infinite temporal consistency.

A Mathematical Theory of Value: a synthesis on goal-directed agency under resource constraints

Cheng Qian — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12502v1 Announce Type: cross Abstract: We propose that value -- the quantity goal-directed agents create, destroy, and exchange -- is a lawful structural quantity in the same category as information. Following Shannon's method, we make one ruthless abstraction: value is the rate at which an agent converts a resource into goal-progress, relative to a frame fixed by its goal. A scale-invariance axiom forces a logarithmic measure, $V=\sum_i k_i \ln e_i$; compounding of a reinvested resource forces the same form via the ergodicity argument of Peters (2019). The two routes are kin rather than independent; their agreement is a consistency check, not an over-determination. We derive a coding theorem of value: $\Delta G \le I(X;Y)$, achieved by Bayes-proportional allocation; realized value decomposes as $G=D(q\|r)-D(q\|p)$, identifying misalignment with measurable waste. For populations, value is frame-relative while price is frame-independent; a fleet that pools its resource and fuses its perception inherits the ceiling $G_{\mathrm{fleet}} \le I(X;Y_{1:m}) \le H(X)$ (a corollary; an earlier sum-form claim was wrong and is corrected in v5). A dynamical layer yields an is/ought asymmetry from which alignment emerges as a control-stability condition with a closed-form residual. We test the single-frame laws on live language models in a pre-registered scale-up: perception mutual information tracks realized capability rather than parameter count (Spearman $\rho = 0.977$ pooled over 30 model$\times$domain points), out-of-sample $\Delta G$ tracks $I(X;Y)$, and over-confidence is measurable dissipation; a further pre-registered test shows the bridge is shape-invariant across four task shapes ($n=42$, slope 0.953). None of the mechanisms is individually new -- generalized Kelly, Armstrong & Mindermann (2018), classical control; the contribution is their unification and the governance mapping (incentive design over oversight) that follows.

Feature-preserving Latent-EnKF for Data Assimilation of Flows with Shocks

Hemanth Chandravamsi, Hangchuan Hu, Ponkrshnan Thiagarajan, Tamer A. Zaki — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12559v1 Announce Type: cross Abstract: The ensemble Kalman filter (EnKF) is widely adopted for sequential data assimilation, but fails for solutions with discontinuities, such as shocks in compressible flows. Uncertainty in shock location induces multimodal ensemble statistics that violate the Gaussian assumptions underlying the EnKF, producing large-scale spurious oscillations in the analysis state. We introduce a feature-preserving latent-EnKF that performs the ensemble update in a learned low-dimensional latent space, where shock and flow features admit a smooth manifold representation, thereby preserving sharp features during EnKF analysis. The updated latent state is mapped back to physical state through a shared decoder for all ensemble members. The algorithm eliminates the member-specific ordered training and positivity flooring used in prior approaches. Numerical experiments on a Sod shock tube and Mach 2 shock interaction with a 2D cylinder, using sparse and noisy observations, show accurate feature recovery of shocks and contact discontinuities without spurious oscillations.

Revisiting the ABCs of Working with AI: A Replication with Radiologists

Daniel Martin — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12585v1 Announce Type: cross Abstract: Artificial intelligence (AI) systems increasingly assist human experts, but the consequences of AI assistance on productivity can be heterogeneous. Caplin, Deming, S. Li, Martin, Marx, Weidmann, and Ye (2025b) provide evidence that two characteristics, ability and belief calibration, help to determine the returns to AI assistance. This note shows that their results replicate to a setting where professional radiologists analyze chest X-rays with access to state-of-the-art machine learning predictions. I leverage the public Collab-CXR data repository described by Moehring, Kutwal, Huang, Banerjee, Jacobi, Eber, Mendoza, Chung, Dayan, Gupta, Bui, Truong, Pareek, Langlotz, Lungren, Agarwal, Rajpurkar, and Salz (2025) and first analyzed for human-AI collaboration by Agarwal, Moehring, Rajpurkar, and Salz (2023). To faithfully reproduce the analysis in Caplin, Deming, S. Li, Martin, Marx, Weidmann, and Ye (2025b), I use the radiologist assessments from the repeated-case designs, which include 68 radiologists and 11,420 paired radiologist-patient-pathology observations. The results of this replication support the external validity of their core findings: lower baseline ability and higher calibration predict larger incremental value from AI.

Estimating Individualized Treatment Effects in Acute Ischemic Stroke with Causal Transformation Models (TRAM-DAG): A Multi-Centre Observational Study with External RCT Validation

Oliver D\"urr, Lisa Herzog, Pascal B\"uhler, Susanne Wegener, Beate Sick — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12623v1 Announce Type: cross Abstract: Personalized medicine in acute ischemic stroke requires moving beyond average treatment effects (ATE) to individualized treatment effect (ITE) estimates to support treatment decisions. In acute ischemic stroke, mechanical thrombectomy has been shown to be more effective on average than lysis in randomized controlled trials (RCTs), such as the MR CLEAN study. We aim to identify which individual patients benefit most from mechanical thrombectomy compared to lysis. The outcome of interest is the modified Rankin Scale (mRS) at three months, an ordinal measure of functional disability (0: no symptoms, 6: death). We demonstrate that causal transformation models on directed acyclic graphs (TRAM-DAG) can be used for ITE estimation after being fitted on observational MAGIC multi-center stroke patient data. To ensure comparability with the MR CLEAN population, which we use for validation, we train the TRAM-DAG on a MAGIC sub-population with NIHSS at admission >= 6, corresponding to one inclusion criterion of MR CLEAN. The fitted model is then used to estimate ITEs for stroke patients in the MR CLEAN population. While these ITE estimates cannot be confirmed experimentally, we show that their average is consistent with the trial's reported ATE. Furthermore, the ITE estimates correctly rank trial patients by their observed frequency of a good outcome (mRS at three months <= 2). These findings support the use of causal models like TRAM-DAG for personalized decision-making in stroke care and highlight their ability to bridge the gap between observational evidence and clinical trials.

Epistemic Uncertainty Is Not the Reducible Kind

Robin Young — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12646v1 Announce Type: cross Abstract: The standard taxonomy of predictive uncertainty defines epistemic uncertainty as the part removable by collecting more data, while the standard measure identifies it with a mutual-information term. We prove the definition and the measure are extensionally inconsistent. On an explicit construction, the measure assigns all uncertainty to the epistemic class, yet no quantity of training data reduces it. Reducibility is instead a property of the pair (uncertainty, acquisition class), and the dichotomy resolves into three parts: aleatoric, sample-reducible epistemic, and mechanism-reducible epistemic uncertainty. An exact identity for the value of an observation shows that in-distribution data never reduces mechanism-irreducible uncertainty and generically increases it. Ensemble disagreement, the deployed epistemic estimate, tracks the training procedure rather than the epistemic term. It collapses to zero beneath a positive truth under consistent training, and equals hyperparameter-scaled initialization noise under interpolation. A finite-sample falsification test and seed-swept experiments confirm the theory.

Computationally tractable robust differentially private mean estimation

Kelly Ramsay — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12654v1 Announce Type: cross Abstract: We develop a new, differentially private mean estimator called the balloon mean. The main features of the balloon mean are that it is computationally tractable and enjoys robustness to outlying observations. It is based on an iterative clipping procedure over expanding Mahalanobis balls, or ``balloons.'' The method satisfies zero-concentrated differential privacy and depends on a small number of interpretable tuning parameters. We provide theoretical guarantees under heavy-tailed and contaminated elliptical models, characterizing its statistical performance and robustness to outliers. Extensive simulations demonstrate that the balloon mean is robust to heavy-tailed and contaminated data, and outperforms existing differentially private mean estimators in contaminated settings.

A Communication Complexity Lower Bound for Nonuniformly Convex Consensus Optimization

Demyan Yarmoshik, Maxim Klimenko — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12675v1 Announce Type: cross Abstract: We study the communication complexity of convex decentralized optimization over time-varying networks, where $n$ nodes hold private functions and must agree on the global minimizer using only synchronous exchanges with neighbors. The cost is the number of communication rounds to reach accuracy $\varepsilon$ -- a measure akin to round complexity in the LOCAL model, but constrained by nodes sharing only oracle responses. We prove a new lower bound of $\Omega\!\left(\chi_{\mathcal G} \sqrt{\kappa_g}\,\log\frac{n}{\chi_{\mathcal G}}\log\frac1\varepsilon\right)$ communication rounds, where $\chi_{\mathcal G}$ is the condition number of the network Laplacians and $\kappa_g$ that of the global objective, showing the round complexity attainable under uniform regularity cannot be matched in the nonuniform regime. The construction rests on spectral graph theory: we embed time-rotating star gadgets into the edges of an expander and patch them to preserve spectral connectivity.

Quantum Reservoir Computing for Short-Term Power Load Forecasting in Resource-Constrained Energy Systems

Mansi Od, Param Pathak, Nouhaila Innan, Muhammad Shafique — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12806v1 Announce Type: cross Abstract: Short-term load forecasting is essential for reliable energy management, but practical deployment on edge devices requires models that remain accurate under limited memory, finite measurement budgets, and hardware noise. This work proposes a hardware-efficient Quantum Reservoir Computing (QRC) framework for energy load forecasting, where a fixed quantum reservoir transforms temporal input windows into high-dimensional features and only a classical Elastic Net readout is trained. To reduce deployment cost, the trained readout is compressed using post-training fixed-point quantization at bit widths from 8 to 2 bits. The framework is evaluated on the Tetouan and Spain energy load datasets under exact statevector simulation, 512-shot finite sampling, and realistic hardware-noise models from IBM FakeTorino and IBM FakeMarrakesh. Results show that 6-bit readout precision preserves full-precision forecasting performance while reducing readout memory by 81.2%. Below this point, degradation becomes dataset dependent, with Tetouan showing stronger sensitivity and Spain degrading more gradually. Hardware-noise validation further shows that the trained readout transfers to noisy reservoir states without retraining. These findings support quantized QRC as a resource-aware forecasting approach for near-term quantum time-series applications.

Graph Reinforcement Learning for Calibration-Aware Quantum Circuit Routing

Yash Vardhan Tomar, Dheeraj Peddireddy, Vaneet Aggarwal — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12816v1 Announce Type: cross Abstract: Quantum circuit routing is a key step in compiling programs for noisy intermediate-scale quantum processors. Routes that appear efficient by standard overhead metrics can still lose fidelity when they pass through poorly calibrated couplers. We study a calibration-aware graph reinforcement-learning router that uses same-day IBM Heron r2 calibration data to choose hardware-edge SWAPs. We train the policy with proximal policy optimization and evaluate it with exact simulated fidelity across nine Munich Quantum Toolkit (MQT) Bench circuits and three calibration snapshots. Across these evaluations, pooled mean exact fidelity is $0.727$, compared with $0.440$ for SABRE-best20 and $0.481$ for target-aware SABRE. Fidelity gains come with higher routed two-qubit counts and are concentrated in the 5q and 8q circuit families; under the fixed tree action graph, all 10q families favor SABRE-best20. Overall, our results show that calibration-aware learned routing can improve fidelity beyond gate-count-driven compilation.

Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata

Daniel Soliman — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12824v1 Announce Type: cross Abstract: AI governance for medical imaging is formalizing: the 2026 ACR-SIIM Practice Parameter recommends local acceptance testing and ongoing drift monitoring, and the ACR Assess-AI registry monitors AI outputs using DICOM metadata for context. We argue that a necessary, currently unmonitored layer sits beneath output metrics: whether incoming studies remain within the acquisition envelope a model was validated on. Using a LUNA16-trained MONAI RetinaNet lung-nodule detector, we test whether acquisition state behaves as a structured, measurable variable. On real paired CT differing only in reconstruction kernel (NLST B30f vs B80f), kernel alone shifted AI-measured diameter and flipped a Fleischner size category in 5.2% (8 of 155) of nodules at fixed patient and acquisition, while detection confidence was unchanged (Wilcoxon p=0.22). Under controlled LIDC-IDRI perturbations the effects dissociated by axis: the noise axis degraded detection confidence (p=5.9e-32, concentrated in nodules under 6 mm) but not measurement, while the frequency/kernel axis corrupted measurement (p=8.6e-13) but not detection. A 4-feature pixel fingerprint recovered reconstruction identity (patient-level AUC about 0.95 on real CT, 0.995 on a QIBA phantom) where the ConvolutionKernel DICOM tag was uninformative (identical labels across reconstructions). The kernel axis transported across four manufacturers (leave-one-vendor-out AUC 0.94-0.98, matching the within-vendor ceiling). Acquisition state thus maps to distinct AI failure modes, frequency content to measurement reliability and noise to detection sensitivity, and is not recoverable from metadata. Acquisition-aware, input-side validation is the missing layer for the acceptance-testing and drift-monitoring requirements now entering imaging-AI accreditation.

Completely Independent Spanning Trees in $k$-Outerplanar Triangulated Discs

Toru Araki — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12827v1 Announce Type: cross Abstract: Let $T_{1}, T_{2}, \dots, T_{k}$ be $k$ spanning trees of a graph $G$. For any pair of vertices $u$ and $v$, if the $u$--$v$ paths in the $k$ spanning trees are pairwise openly disjoint, then the spanning trees are called completely independent spanning trees (CISTs) of $G$. In this paper, we first prove that every 3-connected 2-outerplanar triangulated disc has two completely independent spanning trees. Next, for a 3-connected 3-outerplanar triangulated disc $G$, we provide sufficient conditions for $G$ to have two completely independent spanning trees. We provide an example of a 3-connected 4-outerplanar triangulation that does not have two completely independent spanning trees.

OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction

Danning Jiang, Zheming An, Yalong Zhao, Lipeng Lai — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12838v1 Announce Type: cross Abstract: Predicting single-cell transcriptional responses to genetic, chemical and cytokine perturbations is a fundamental challenge in computational biology and AI Virtual Cell (AIVC) modeling, with direct implications for drug discovery and the elucidation of gene regulatory networks. Existing approaches often rely on auxiliary cell-state encoders, hierarchical variational autoencoders, dedicated Transformer encoder-decoder modules, or gene-interaction priors to compress high-dimensional expression profiles into latent representations. While effective, these designs increase architectural complexity and may limit scalability and generalizability. This paper introduces OCOO-T, a minimalist flow-matching-based AIVC model for transcriptional perturbation response prediction. OCOO-T utilizes a vanilla Transformer stack that operates directly on continuous gene expression profiles and formulates perturbation response prediction as a continuous-time denoising process. Perturbation embeddings, dosage information, and cell-line/cell-type specificity are integrated through adaptive layer normalization and in-context tokens. Comprehensive evaluations on Tahoe100M, Replogle, and PBMC benchmarks demonstrate that OCOO-T achieves state-of-the-art performance across diverse perturbations and cell types while effectively scaling to long transcriptional profiles through patching and depatching of cellular contexts. By leveraging the simplicity of Transformer-based denoising for single-cell omics, OCOO-T provides an effective and scalable framework for in-silico cellular simulation.

Prediction-Powered Causal Inference by Automatic Debiased Machine Learning and Semi-Supervised Riesz Regression

Masahiro Kato — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12892v1 Announce Type: cross Abstract: This study investigates semiparametric efficient estimation of causal and structural parameters in a semi-supervised setting. In our setting, unlabeled auxiliary regressors are available in addition to labeled observations consisting of outcomes and regressors. Our goal is to construct estimators of causal and structural parameters whose asymptotic variances are smaller than those of estimators constructed using only labeled data. We refer to this framework as prediction-powered causal inference (PPCI). We first derive the efficient influence function and the efficiency bound, which imply that the use of auxiliary regressors can attain a smaller asymptotic variance than the efficiency bound attainable from labeled observations alone. Then, by combining the efficient influence function with the debiased machine learning (DML) framework, we propose methods that we call DML-PPCI. If we construct an estimating-equation estimator, we refer to the method as EE-DML-PPCI; if we construct a targeted-learning estimator, we refer to the method as TMLE-DML-PPCI. The asymptotic variances of both estimators match our derived efficiency bound. In the construction of the estimators, estimation of the efficient influence function plays an important role. In our study, the efficient influence function is also a Neyman orthogonal score, which depends on the Riesz representer and the regression function. For Riesz representer estimation, we develop semi-supervised generalized Riesz regression with convergence rate guarantees.

On perfect flag-rank metric codes

Gianira N. Alfarano, Usman Mushrraf, Ferdinando Zullo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12934v1 Announce Type: cross Abstract: Flag-rank-metric codes arise as a natural generalization of rank-metric codes in the context of network communication. While recent research has mainly focused on algebraic and structural properties of these codes, the combinatorial geometry underlying the flag-rank metric remains largely unexplored. In this paper, we initiate a detailed investigation of this geometry. We explicitly determine the size of spheres of small flag-rank radius in the space $\mathrm{U}(n,\mathbb{F}_q)$ of upper triangular matrices over the finite field $\mathbb{F}_q$, and consequently obtain formulas for the size of balls of radius at most $3$. Using these enumerative results, we derive a sphere-packing bound for flag-rank-metric codes and introduce the notion of perfect codes with respect to the flag-rank metric. We observe that no non-trivial perfect flag-rank-metric codes exist in $\mathrm{U}(n,\mathbb{F}_q)$ for $n\in\{2,3\}$. We then investigate the possible parameters of perfect codes in higher dimensions. For minimum distance $3$, we obtain a characterization in terms of the codimension of the code, and show that suitable maximum flag-rank distance codes with minimum distance $3$ yield non-trivial perfect codes. For minimum distances $5$ and $7$, we derive explicit quadratic and cubic conditions, respectively, that any perfect code must satisfy. Finally, using asymptotic estimates for balls of fixed radius, we prove that for fixed length $n$ and $\delta\in\{3,5,7,9,11\}$, perfect linear flag-rank-metric codes with minimum distance $\delta$ do not exist over $\mathbb{F}_q$ for all sufficiently large $q$.

Quantum-Driven Neuromorphic Computing for Million-Qubit-Scale Workloads

Adams Ivanov, Samer Rahmeh, Erick Giovani Sperandio Nascimento, Daniela Herrmann — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12968v1 Announce Type: cross Abstract: We introduce Apollo, a 10000 node p-qubit neuromorphic processor fabricated in 16 nm mixed signal CMOS and operating fully at room temperature with a typical analog core power envelope of about 0.5 W. Its fundamental element, the p-qubit, is a bistable stochastic unit whose continuous time state fluctuations are driven by integrated quantum entropy units that inject true quantum derived randomness. This enables ultrafast stochastic transitions at low energy while preserving a classical state representation. Apollo combines these p-qubits with a high degree Hyperion 256 interconnect topology, allowing efficient embedding of dense Ising and QUBO problems with substantially reduced minor embedding overhead compared with sparse annealing platforms. We show that, through the Suzuki Trotter correspondence, the equilibrium statistics and annealing dynamics of the p-qubit network reproduce key properties of transverse field quantum annealing without cryogenic cooling, long lived coherence, or microwave control. Beyond device level validation, Apollo is evaluated on a three dimensional spin glass benchmark previously used to study quantum advantage in superconducting annealers. Across 300 disorder realizations, Apollo reaches substantially lower ground state energies than reported cryogenic quantum annealing hardware, while remaining distinct from classical simulated annealing and simulated quantum annealing. A 350 nm release candidate device experimentally validates the core p-qubit dynamics, thermodynamic sampling correctness, and continuous time annealing behavior. These results establish Apollo as a room temperature, industrially scalable platform for quantum driven energy based optimization, probabilistic inference, generative modeling, and hybrid classical quantum workflows.

Deep Sleep Classification via EEG Signal Criticality: A Passive BCI Approach for Sleep-Improvement Neurofeedback

Stanis{\l}aw Nar\k{e}bski, Tomasz Komendzi\'nski, Tomasz M. Rutkowski — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13017v1 Announce Type: cross Abstract: Automated sleep staging is a fundamental application of passive Brain-Computer Interfaces (pBCI), decoding spontaneous neural states to enable closed-loop interventions independent of user intent. This study evaluates criticality features derived from Detrended Fluctuation Analysis (DFA) for the specific identification of deep sleep (N3). We analyzed $347,232$ EEG epochs from $290$ older women using UMAP manifold learning to visualize state transitions. Subsequently, six classifiers were benchmarked via 10-fold cross-validation, using balanced accuracy to determine the optimal "state-sensing" engine for neurofeedback.Naive Bayes achieved the highest mean balanced accuracy ($87.17\% \pm 0.24\%$), significantly outperforming a fully connected deep neural network (FNN: $81.58\%$) and Random Forest ($80.97\%$). Linear models (LDA: $57.21\%$; SVM: $51.01\%$) performed poorly, indicating that DFA-derived criticality features reside on a distinct, non-linear manifold. Probabilistic decoding of EEG criticality provides a high-accuracy sensing mechanism for pBCIs. This robust classification pipeline supports the development of state-dependent neurofeedback, such as targeted auditory stimulation, to enhance cognitive recovery.

A solvable model for unsupervised federated learning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13045v1 Announce Type: cross Abstract: We introduce a theoretical framework for analyzing federated learning in a generative setting through a teacher-multiple interacting students scenario, in which each student receives a distinct realization of the data, either through a different noise corruption or by accessing a different subset, possibly of varying size. Using theoretical tools in equilibrium disordered system, we analytically show that interactions among students systematically enhance learning performance: highly noisy students require fewer samples to recover the underlying pattern, while low-noise students achieve a larger overlap with the ground-truth signal. We derive the optimal Bayesian conditions for teacher recovery as functions of the sample complexity, noise level, and interaction strength, and validate these predictions through numerical simulations. The resulting dynamics can be mapped onto equilibrium sampling in a Restricted Boltzmann Machine with a structured hidden layer, providing a principled theoretical understanding of how interactions improve distributed generative modeling.

Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13095v1 Announce Type: cross Abstract: Multi-talker speech recognition is often addressed by combining automatic speech recognition (ASR) and speaker diarization in a pipeline system. Recently, LLM-based approaches have shown promise by jointly modeling semantic and speaker information, but they typically require large-scale multi-talker corpora that are costly to annotate. In this paper, we investigate how to efficiently train an LLM-based system with limited real-recorded data while maintaining high accuracy in speaker attribution. We propose several strategies: (1) a dual-encoder architecture to extract semantic and speaker features, (2) a feature interleaving format to merge these features as the inputs to the LLM, (3) a length-aware speaker ID loss to enhance diarization capability, and (4) an adaptive threshold strategy for ASR loss computation to mitigate hallucinations caused by speech overlaps. These strategies balance training between ASR and diarization tasks. Our system outperforms open-source baseline approaches, achieving relative improvements of 18% on the AliMeeting corpus and 24% on the Aishell4 corpus.

Generating Training Targets for Real-World Speech Enhancement via Close-to-Distant Microphone Projection

Tomohiro Nakatani, Rintaro Ikeshita, Naoyuki Kamo, Marc Delcroix, Shoko Araki — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13109v1 Announce Type: cross Abstract: Training neural networks (NNs) for speech enhancement (SE) in distant speech-capturing scenarios requires paired distorted and clean reference speech signals. While such data are often generated through simulation, the mismatch between simulated and real recordings significantly limits SE accuracy. To address this issue, we propose Close-to-Distant microphone Projection (C2D projection), a method that generates paired data from real recordings captured by close and distant microphones. C2D projection estimates an optimal projection matrix that transforms close-microphone inputs into clean reference signals aligned with distant-microphone recordings, while simultaneously performing denoising. We show this projection can be effectively realized using a variant of the Parametric Multichannel Wiener Filter (PMWF). Experimental results demonstrate that an NN trained with C2D-projected data outperforms the state-of-the-art Guided Source Separation (GSS) on the challenging CHiME6 dinner party ASR task under oracle diarization, when using the enhanced output from GSS as an auxiliary input to the NN.

Robust State-Conditional Feature-Weighted Jump Models for Temporal Clustering

Federico P. Cortese, Alessio Farcomeni — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13146v1 Announce Type: cross Abstract: We propose a robust feature-weighted jump model for time-dependent clustering. A penalty is used to encourage smoothness of transitions over time, while robustness is achieved through the use of a Tukey's biweight loss function. An additional parameter controls the variability of feature weights across states, allowing the model to assign state-specific relevance to each feature. We illustrate in simulation how the method accurately recovers the true cluster sequence and reliably identifies relevant features, outperforming competing approaches, particularly in the presence of outliers. We conclude with two empirical applications, one on the number of conflict-related homicides in Kosovo in the period 1998-2000, and another on macroeconomic performance of twelve European countries in the period 1949-2024.

A Dual-Mode Faust-to-CLAP Compilation System

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13193v1 Announce Type: cross Abstract: We describe faust2clap, a framework establishing the first officially maintained compilation pathway from Faust DSP specifications to the CLAP format. The system operates in two different modes. A static mode employs ahead-of-time compilation to yield native binaries of optimal efficiency, while a dynamic mode uses runtime interpretation to permit DSP code modification without interrupting the host application. This latter capability addresses a persistent friction in audio software development, namely the cumulative overhead of the edit, compile, and reload cycle. We detail the algorithmic machinery underlying both modes, focusing specifically on the problem of parameter identity. To preserve both parameter values and their bindings to host automation across structural DSP mutations, we introduce an address-based identity matching algorithm and a stable slot allocation scheme. The implementation, comprising approximately 2,400 lines of C++ architecture and Python tooling code, has been integrated into the main Faust distribution.

Switching Hamiltonian Monte Carlo for sampling from mixture distributions

A. Sharma — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13234v1 Announce Type: cross Abstract: We introduce a switching Hamiltonian Monte Carlo method for sampling from finite mixture Boltzmann-Gibbs distributions. We propose symmetric numerical integrators to approximate switching Hamiltonian dynamics interlaced with Poisson jumps, where the regime-switching chain is simulated using the uniformization technique or the stochastic simulation algorithm. We prove geometric ergodicity of the resulting Markov chain. We develop an approach based on the discrete Poisson equation associated with numerical schemes to estimate the error in computing ergodic averages. Using this approach we prove that the proposed numerical integrators have second-order bias. This approach is simple and can be generalized to other settings, for example, kinetic Langevin equations. Finally, we verify the convergence result via numerical experiment.

Finite-Query Collapse and Modal Exact Bases in the SCI Hierarchy

Christopher Sorg — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13250v1 Announce Type: cross Abstract: We study the exact-basis problem for Solvability Complexity Index (SCI) computational problem families through finite-query transports. A raw finite-query reduction permits arbitrary encodings and finite transcript reconstructions, with only a continuous output decoder. For the Colbrook-Hansen (CH23) singleton-window spectral/pseudospectral block, this raw preorder collapses the expected two-source structure: the diagonal exact spectral and fixed-$\varepsilon$ pseudospectral sources are raw- and continuous-finite-query equivalent, and, for computable $\varepsilon$ under the evaluation-name representations, TTE-finite-query equivalent, so the six-problem ambient is raw-principal. We then introduce modal finite-query preorders, whose admissibility conditions may restrict encodings, decoders, reconstructions, uniformity, and geometric naturality. We also characterize TTE finite-query transport as computable point transport with a uniform finite interface trace; after forgetting the trace this gives strong Weihrauch reducibility, and the implication is strict. Under a CH23 geometric modality generated by representation inclusions, unitary and graph relabelings, and neutral stabilizations, the same ambient has exactly two minimal exact sources. This gives a calibrated reformulation of the exact-basis problem: natural SCI families should be classified by modality-indexed exact bases and refinement maps, not by one raw preorder alone.

ProtoX-AD: Self-Explainable Time Series Anomaly Detection and Characterization

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13277v1 Announce Type: cross Abstract: Recent advances in time series anomaly detection (TSAD) have highlighted the effectiveness of self-supervised classification-based approaches. These methods apply transformations to normal training samples, training a classifier to recognize transformation-specific patterns that help identify anomalies through increased classification errors. Despite their strong performance, a significant challenge is their lack of explainability, as they provide limited insight into the characteristics of flagged anomalies. To address this limitation, we propose ProtoX-AD, a prototype-based self-explainable framework for self-supervised TSAD. ProtoX-AD learns transformation-aware latent representations alongside interpretable prototypes, enabling both accurate anomaly detection and the identification of distinct anomalous profiles through prototype-based explanations. Additionally, it allows for systematic analysis of how transformation design impacts detection performance and explainability. Experimental results on synthetic and real-world datasets demonstrate that ProtoX-AD achieves detection performance comparable to its black-box counterparts while offering more consistent and semantically meaningful explanations than existing explainable baselines. Our code is publicly available at https://github.com/Aitorzan3/ProtoX-AD.

Simultaneous Latent Budget Trees for Stratified Classification

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13295v1 Announce Type: cross Abstract: In the era of Explainable Artificial Intelligence, there is a renewed focus on single trees for their ease of interpretation. This paper introduces Simultaneous Latent Budget Trees, a probabilistic machine learning framework for classification trees in the presence of a stratification factor such as a temporal, spatial, or demographic variable, acting as a control variable or potential confounder. Standard tree growth procedures are not designed to optimize a conditional split rule. A model-based split rule is proposed in which child nodes are interpreted as latent components of a simultaneous mixture model, such as the Simultaneous Latent Budget Model and its constrained versions, fitted to the parent node. Mixing parameters drive the observations, differently for each group, to the child nodes whereas latent budgets parameters update the response classes profile of each level of the control variable. Parameters are estimated by least squares considering a neural network perspective of the model. An informative tree structure can be interactively visualized with interpretation aids on the node and the paths, including visual pruning and decision tree selection procedure. Suitable measures are proposed to handle an unbalanced response class distribution. The proposed methodology is applied to investigate gender-related differences in disease progression of Amyotrophic Lateral Sclerosis. The SLBT library with the various tree-based algorithms is available in the linked GitHub repository.

Extended Frege proofs, circuits and rewriting

Jan Krajicek — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13367v1 Announce Type: cross Abstract: Inspired by a statement about Extended Frege proof systems by Jain and Jin (FOCS 2022) we prove that: - there is a p-time binary relation $\approx$ between circuits that implies their logical equivalence, - the relation $\approx$ implies that each of the two circuits can be rewritten into the other one by possibly deleting some gates and adding at most seven new gates, - if the equivalence $C \equiv D$ has a size $s$ proof in an Extended Frege or a Circuit Frege proof system then there is a chain of circuits $E_i$ $$ C = E_0 \approx \dots \approx E_t = D $$ with $t \le s^{O(1)}$.

An LLM System for Autonomous Variational Quantum Circuit Design

Kenya Sakka, Wataru Mizukami, Kosuke Mitarai — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13380v1 Announce Type: cross Abstract: The design of high performing quantum circuits remains largely dependent on human expertise. We introduce an autonomous agentic framework that employs large language models (LLMs) to conduct iterative quantum circuit designs under explicit design constraints. Our system integrates seven components: Exploration, Generation, Discussion, Validation, Storage, Evaluation, and Review. These components form a closed-loop workflow that combines web-based knowledge acquisition, literature-grounded critique, executable code generation, and experimental feedback. We evaluate the framework on two tasks: quantum feature map construction for quantum machine learning and ansatz generation for variational quantum eigensolver applications in quantum chemistry. In image classification benchmarks, the best generated feature map outperforms representative quantum feature maps and, when scaled to larger qubit counts, surpasses the classical radial basis function kernel. In molecular ground state estimation across seven molecules, the generated ansatz attains competitive accuracy with widely used chemically inspired and hardware-efficient constructions while satisfying the imposed scaling constraints. These results establish LLM driven agentic system as a viable paradigm for automated quantum circuit design and illustrate how AI systems can participate in iterative scientific optimization workflows across scientific domains.

Foundations of Practical Quantum Advantage in Quantum-Informed Machine Learning for Predicting Chaos

Maida Wang, Xiao Xue, Minh Chung, Peter V. Coveney — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13422v1 Announce Type: cross Abstract: We develop theoretical foundations for a practical quantum-advantage mechanism in quantum-informed machine learning for chaotic dynamical systems. A family of k-indexed higher-order quantum statistical priors (Q-Priors) hosts the k-point marginal of the invariant measure on n_q = kq qubits, extending the single-site construction of prior work. We prove a two-stage advantage. In the representation stage, superposition and entanglement compactly store non-factorisable spatial correlations of the invariant measure on n_q qubits. In the extraction stage, joint Bell measurements on two copies estimate any post hoc Pauli functional with a copy-pair count independent of n_q, whereas any adaptive single-copy protocol for the corresponding full-Pauli read-out requires Omega(2^(n_q)) copies; this is a provable quantum-classical separation in copy-measurement complexity. The two-copy read-out is realised in simulation and on IQM superconducting processors. Two case studies instantiate the mechanism in workflows of independent scientific value: a turbulent channel-flow study in which the two-copy read-out yields a named non-diagonal correlator of the invariant measure (the velocity-direction coherence), and a medium-range weather forecasting workflow on the European Centre for Medium-Range Weather Forecasts ERA5 reanalysis in which the diagonal k <= 2 Q-Prior steers a Koopman rollout, improves anomaly-correlation skill by 10-39% across 48-240 h lead times, and reduces the long-horizon collapse of rollouts onto a static mean field. The two conditions of our practical-advantage definition are met at complementary levels, identifying a candidate route to practical quantum advantage before fault-tolerant hardware.

Endpoint Anticipation for Low-Latency Spoken Dialogue

Sathvik Udupa, Shinji Watanabe, Petr Schwarz, Jan Cernocky — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13450v1 Announce Type: cross Abstract: While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based model anticipates endpoints upto 2.56 seconds in advance, enabling speculative execution of LLM and TTS pipelines on partial context. We introduce metrics to quantify the trade-off between realized latency reduction and computational redundancy. Evaluation across conversational and task-oriented datasets shows our model consistently outperforms competitive VAP-based baselines. Integration with the Unmute framework demonstrates a 505 ms average latency reduction with a 28.4% increase in speculative computation, effectively masking sequential bottlenecks to enable complex reasoning in real-time speech-to-speech interaction.

Optical Implementation of Equilibrium Propagation Using Spatial Photonic Ising Machines

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13454v1 Announce Type: cross Abstract: Equilibrium Propagation offers a compelling alternative to traditional machine learning for training energy-based networks. Here we demonstrate a hybrid optical-digital implementation of EP using a Spatial Photonic Ising Machine (SPIM). The SPIM exploits the gauge transformation method to optically encode both continuous neuron states and rank-1 binary trainable patterns as phase modulations via a spatial light modulator, with inference realized using a finite difference scheme. The experimental system is evaluated on the Wine classification dataset. The potential of this approach, including the use of continuous couplings and structured coupling matrices, is evaluated numerically on the more complex MNIST dataset. Our work provides a concrete pathway toward energy-efficient physical implementations of Equilibrium Propagation.

AgentRivet: an automated system for producing Rivet routines from journal publications

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13535v1 Announce Type: cross Abstract: Particle physics collider experiments provide Rivet routines as part of the analysis preservation strategy for model-independent measurements. Rivet is a C++ toolkit that allow new theoretical models to be compared to the measurements, thus aiding the development and tuning of Monte Carlo event generators as well as searches for physics beyond the Standard Model. However, analysis coverage is known to be incomplete, with only 39% of measurements having documented and publicly available Rivet routines. In this article, we design and implement an automated workflow based on Large Language Models with the goal of providing the missing routines. This multi-step workflow, referred to as AgentRivet, extracts the physics analysis information from published papers and writes the missing Rivet routines, with intermediate code- and physics- reviews as part of an autonomous quality control. We report the results obtained using commercial Large Language Models, provided by OpenAI, Anthropic, and Google, for two recent measurements from the ATLAS and CMS experiments. We find that AgentRivet produces competent Rivet routines with few syntax errors. The physics fidelity of the routines is reasonable and follows the explanations given in the relevant publications. Nevertheless, physics-implementation issues do arise and are investigated using the artefacts produced by AgentRivet. The majority of physics implementation issues arise from subtle-but-ambiguous definitions in the given publication, although some models struggle to implement complex observables even when clear definitions are given.

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13544v1 Announce Type: cross Abstract: Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.

Price Elasticity of Gas Demand on L1 and L2: Evidence from Ethereum and Arbitrum

Pranay Anchuri, Akaki Mamageishvili — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13555v1 Announce Type: cross Abstract: We estimate the causal price elasticity of gas demand on Ethereum mainnet (L1) and Arbitrum One (L2), a quantity necessary for calibrating fee mechanism simulations, evaluating resource pricing reforms, and explaining observed usage patterns. A two-way fixed effects panel regression instrumented by each wallet's own lagged base fee removes the congestion-driven endogeneity that causes naive regressions to substantially underestimate demand sensitivity. On Ethereum mainnet (full year 2025), the pooled IV elasticity is -0.006***, near-inelastic: a 10% fee increase reduces total gas demand by approximately 0.06%. On Arbitrum One (October 2025--April 2026), the pooled IV elasticity is -0.036**. Both chains are inelastic in the aggregate, with L2 measurably more responsive than L1. A per-resource decomposition of L2 demand reveals elasticities ranging from modestly elastic computation (-0.027*) to -0.27*** for refunds, with storage growth (-0.15***) and calldata (-0.06*) in between. Behavioral clustering identifies always-on protocol wallets as near-inelastic and high-volume operators as substantially more responsive, with cluster-level elasticities up to roughly 6x the pooled estimate. These results establish an empirical foundation for downstream simulations and for evaluating fee mechanism designs.

Approximability limits for bounded-degree max-LINSAT and implications for decoded quantum interferometry

Maximilian J. Kramer, Carsten Schubert, Jens Eisert — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13570v1 Announce Type: cross Abstract: For general max-k-XORSAT with $k \geq 3$, no polynomial-time algorithm can do substantially better than random guessing on worst-case instances unless $\mathsf{P} = \mathsf{NP}$: approximating beyond the random-assignment value of $1/2$ is $\mathsf{NP}$-hard. The picture changes when each variable appears in at most $D$ constraints. In that bounded-degree setting, polynomial-time algorithms can provably beat the random baseline by an additive amount of order $1/\sqrt{D}$. For Boolean instances, this scaling is known to be optimal: the matching hardness result is due to Trevisan, while the corresponding algorithmic guarantee was established by Barak et al. Whether the same holds over general finite fields, and what it implies for quantum algorithms, has not been established. We make this connection explicit and extend the hardness to max-E$k$-LINSAT$(q,r)$ with bounded degree $D$ and over arbitrary finite fields $\mathbb{F}_q$, proving that it is $\mathsf{NP}$-hard to exceed $r/q + \mathcal{O}_{q,r}(1/\sqrt{D})$. These results provide the complexity-theoretic benchmark for the bounded-degree instances targeted by decoded quantum interferometry (DQI), QAOA, and classical heuristics. Any quantum advantage on bounded-degree instances is therefore confined to the constant prefactor. We further show that in the context of DQI and on $(k,D)$-regular instances, this prefactor is sensitive to the nature of the decoder: DQI with classical decoders faces an information-theoretic $1/\sqrt{D \log D}$ barrier that prevents it from matching the hardness scaling, while DQI with quantum decoders is compatible with the $1/\sqrt{D}$ scaling -- identifying quantum decoding as the key ingredient for matching the complexity-theoretic scaling with DQI.

Differential Geometric Conditions for Koopman Linearizability of Control-Affine Systems

Shankar A. Deka — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13577v1 Announce Type: cross Abstract: Koopman linearization opens many possibilities for control synthesis and analysis of nonlinear systems. Whether or not any given nonlinear control system admits a finite-dimensional Koopman representation remains a crucial question to address. A related problem is to categorize the class of all Koopman linearizable nonlinear control systems. In this work, we present differential geometric conditions on the drift and control vector fields of a control-affine nonlinear system, that must be necessarily satisfied for Koopman linear transformation to exist. The same conditions are also shown to be sufficient for (a slightly weaker notion of) Koopman linearizability on control-invariant manifolds. Further, these conditions, together with an additional condition, become necessary and sufficient for Koopman linearizability to a controllable linear system. Our examples illustrate the ease of checking these conditions, and also shed light on how Koopman linearizing transformation may not exist for a control-affine system even though one can linearize the autonomous part of the system via Koopman lifting.

Max-Min Secrecy Rate Optimization for Secure ISAC Networks: Global Optimization and Low-Complexity Algorithm

Thanh-Nha To, Trung Quang Pham, Dang Y Hoang, Hoang-Lai Pham, Tuan Anh Pham — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13582v1 Announce Type: cross Abstract: In this paper, we investigate a secure integrated sensing and communication (ISAC) system in which multiple communication users (CUs) coexist with multiple untrusted sensing users (SUs) that may eavesdrop on the confidential information intended for the CUs. To promote security fairness among users, we formulate a max-min secrecy rate optimization problem subject to a transmit power budget and sensing quality requirements characterized by beampattern matching error constraints. The resulting design problem is highly non-convex due to the secrecy rate expressions and non-convex sensing constraints. To address these challenges, we first reformulate the problem using semidefinite relaxation (SDR). Based on the reformulated problem, we develop a branch-and-bound (BB) framework combined with convex relaxations to obtain the globally optimal solution within a prescribed accuracy. To further reduce computational complexity, we propose a low-complexity algorithm based on successive convex approximation (SCA), which iteratively solves a sequence of convex subproblems and converges to a local solution. Numerical results demonstrate that the proposed BB algorithm achieves the global optimum and provides a benchmark for performance evaluation. Moreover, the proposed SCA-based algorithm attains near-optimal secrecy performance with significantly lower computational complexity, making it attractive for practical ISAC deployments.

Distribution-Agnostic Robust Trajectory Optimization via Chance-Constrained Reinforcement Learning

Yashdeep Chaudhary, Roberto Armellin, Harry Holt, Marco Sagliano — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13605v1 Announce Type: cross Abstract: This paper presents a distribution-agnostic robust trajectory-optimization framework based on chance-constrained reinforcement learning. The uncertainty is represented here through initial conditions and process noise, with the only requirement being that it can be sampled. A deterministic nominal trajectory is first computed offline, and reinforcement learning is then used only to robustify that baseline through a structured affine closed-loop correction law comprising a feedforward control adjustment and time-varying feedback gains. Probabilistic feasibility is enforced empirically through rollout-based upper-tail quantiles, while terminal dispersion is regulated through covariance-feasibility penalties. The framework is assessed on two materially different trajectory design problems. The flagship case study is a three-dimensional multi-impulse Earth-Mars transfer, where the learned policy is benchmarked against a recent robust trajectory-optimization reference under Gaussian uncertainty and then evaluated under bounded uniform uncertainty and under process disturbances not seen during training. The second case study is a stochastic atmospheric pinpoint rocket landing problem, used to assess portability to a short-horizon continuous-thrust setting with drag, mass depletion, and glide-slope constraints. The results show that the proposed framework can remain competitive in upper-tail fuel cost while preserving probabilistic feasibility, and that the same robustification scaffold can be carried across heterogeneous spacecraft trajectory planning problems without redesign of its core stochastic-control structure.

Majority-of-Three is Optimal

Divit Rawal, Nikita Zhivotovskiy — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13614v1 Announce Type: cross Abstract: We give a short proof that the majority vote of three independent consistent classifiers is an optimal learner in the realizable PAC setting. This proves optimality for the simplest voting scheme, while simplifying both the algorithmic structure and the probabilistic analysis of previous voting learners, including the algorithm of S. Hanneke and the analysis of bagging by K. Green Larsen.

Valid Inference with Synthetic Data via Task Exchangeability

Lezhi Tan, Tijana Zrnic — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.13629v1 Announce Type: cross Abstract: There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM-as-a-judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.

Work-Efficient Query Evaluation in Constant Time with PRAMs

Jens Keppeler, Thomas Schwentick, Christopher Spinrath — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2301.08178v5 Announce Type: replace Abstract: The article studies query evaluation in parallel constant time in the CRCW PRAM model. While it is well-known that all relational algebra queries can be evaluated in constant time on an appropriate CRCW PRAM model, this article is interested in the efficiency of evaluation algorithms, that is, in the number of processors or, asymptotically equivalent, in the work. Naive evaluation in the parallel setting results in huge (polynomial) bounds on the work of such algorithms and in presentations of the result sets that can be extremely scattered in memory. The article discusses some obstacles for constant-time PRAM query evaluation. It presents algorithms for relational operators and explores three settings, in which efficient sequential query evaluation algorithms exist: acyclic queries, semijoin algebra queries, and join queries -- the latter in the worst-case optimal framework. Under mild assumptions -- that data values are numbers of polynomial size in the size of the database or that the relations of the database are suitably sorted -- constant-time algorithms are presented that are weakly work-efficient in the sense that work $\mathcal{O}(T^{1+\varepsilon})$ can be achieved, for every $\varepsilon>0$, compared to the time $T$ of an optimal sequential algorithm. Important tools are the algorithms for approximate prefix sums and compaction from Goldberg and Zwick (1995).

Cybersecurity Threat Hunting and Vulnerability Analysis Using a Neo4j Graph Database of Open Source Intelligence

Elijah Pelofske, Lorie M. Liebrock, Vincent Urias — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2301.12013v3 Announce Type: replace Abstract: Open source intelligence is a powerful tool for cybersecurity analysts to gather information both for analysis of discovered vulnerabilities and for detecting novel cybersecurity threats and exploits. Here, we present a Neo4j graph database formed by shared connections (shared sub-string matches) between open source intelligence text including blogs, cybersecurity bulletins, news sites, antivirus scans, social media posts (such as Reddit and Twitter), and threat reports. These connections are comprised of possible indicators of compromise (IP addresses, domains, hashes, email addresses, phone numbers), information on known exploits and techniques (CVEs and MITRE ATT\&CK Technique IDs), and potential sources of information on cybersecurity exploits such as twitter usernames. The construction of the database of potential IOCs is detailed. Examples of utilizing the graph database for querying connections between known malicious IOCs and open source intelligence documents, including threat reports, are shown. We show that this type of relationship querying can allow for more effective use of open source intelligence for threat hunting, malware family clustering, and vulnerability analysis. We show four specific examples of interesting connections found in the graph database; the connections to a known exploited CVE, a known malicious IP address, a malware hash signature, and a portable executable shared resource file.

On Approximating the Dynamic Response of Synchronous Generators via Operator Learning: A Step Towards Building Deep Operator-based Power Grid Simulators

Christian Moya, Amirhossein Mollaali, Guang Lin, Meng Yue — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2301.12538v2 Announce Type: replace Abstract: This paper develops an Operator Learning framework for approximating the dynamic response of synchronous generators. The framework can be used to (i) build a neural network-based generator model that interacts with a power grid simulator or (ii) shadow the true generator's transient response. First, we develop a data-driven Deep Operator Network (DeepONet) to approximate the infinite-dimensional solution operator of the generators. Then, we design a numerical scheme based on DeepONet that simulates the generator's response over a given time horizon. The proposed scheme recursively employs the trained DeepONet to simulate the response for a given multi-dimensional input that describes the interaction between the generator and the power grid. In addition, we design a residual DeepONet numerical scheme that can incorporate information from existing mathematical models. We accompany this residual DeepONet scheme with an estimate for the prediction's cumulative error. Finally, we build a data aggregation (DAgger) strategy that allows fine-tuning of DeepONets using aggregated training data that the DeepONets will likely encounter during interactive simulations with other grid components. As a proof of concept, we demonstrate that the proposed frameworks can effectively approximate the transient model of a synchronous generator.

On Pitfalls of $\textit{RemOve-And-Retrain}$: Data Processing Inequality Perspective

Junhwa Song, Keumgang Cha, Junghoon Seo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2304.13836v5 Announce Type: replace Abstract: The RemOve-And-Retrain (ROAR) benchmark is widely used to evaluate feature attribution methods, yet its validity remains underexplored from an information-theoretic perspective. We show that model- and data-agnostic post-processing of attribution maps (transformations that, by the data processing inequality, \emph{cannot} add information about the decision function) can often improve ROAR scores. This means that an improved ROAR ranking is not, by itself, evidence that an attribution map carries more information about the model. We trace this failure mode to a bias toward spatially blurry masks. Experiments on CIFAR-10, SVHN, and CUB-200 show a consistent association between blurriness and ROAR performance, a pattern that also appears in the ROAD variant. We provide guidelines for more cautious removal-based benchmarking, with implications for validating mechanistic understanding of neural network internals.

ResidualPlanner+: a scalable matrix mechanism for marginals and beyond

Guanlin He, Yingtai Xiao, Levent Toksoz, Zeyu Ding, Danfeng Zhang, Daniel Kifer — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2305.08175v5 Announce Type: replace Abstract: Noisy marginals are a common form of confidentiality protecting data release and are useful for many downstream tasks such as contingency table analysis, construction of Bayesian networks, and even synthetic data generation. Privacy mechanisms that provide unbiased noisy answers to linear queries (such as marginals) are known as matrix mechanisms. We propose ResidualPlanner and ResidualPlanner+, two highly scalable matrix mechanisms. ResidualPlanner is both optimal and scalable for answering marginal queries with Gaussian noise, while ResidualPlanner+ provides support for more general workloads, such as combinations of marginals and range queries or prefix-sum queries. ResidualPlanner can optimize for many loss functions that can be written as a convex function of marginal variances (prior work was restricted to just one predefined objective function). ResidualPlanner can optimize the accuracy of marginals in large scale settings in seconds, even when the previous state of the art (HDMM) runs out of memory. It even runs on datasets with 100 attributes in a couple of minutes. Furthermore, ResidualPlanner can efficiently compute variance/covariance values for each marginal (prior methods quickly run out of memory, even for relatively small datasets). ResidualPlanner+ provides support for more complex workloads that combine marginal and range/prefix-sum queries (e.g., a marginal on race, a range query on age, and a combined race/age tabulation that answers age range queries for each race). It even supports custom user-defined workloads on different attributes. With this added flexibility, ResidualPlanner+ is not necessarily optimal, however it is still extremely scalable and outperforms the prior state-of-the-art (HDMM) on prefix-sum queries both in terms of accuracy and speed.

A Proof-theoretic Semantics for Intuitionistic Linear Logic

Yll Buzoku — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2402.01982v5 Announce Type: replace Abstract: The approach taken by Gheorghiu, Gu and Pym in their paper on giving a base-extension semantics for Intuitionistic Multiplicative Linear Logic is an interesting adaptation of the work of Sandqvist for IPL to the substructural setting. What is particularly interesting is how naturally the move to the substructural setting provided a semantics for the multiplicative fragment of intuitionistic linear logic. Whilst ultimately the Gheorghiu, Gu and Pym used their foundations to provide a semantics for bunched implication logic, it begs the question, what of the rest of intuitionistic linear logic? In this paper, I present just such a semantics. This is particularly of interest as this logic has as a connective the bang, a modal connective. Capturing the inferentialist content of formulae marked with this connective is particularly challenging and a discussion is dedicated to this at the end of the paper.

Assembling ensembling: An adventure in approaches across disciplines

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2405.02599v3 Announce Type: replace Abstract: When discussing model ensembling or ensemble modeling, a term arises across numerous disciplines, what is meant by it can vary drastically. The very meaning of 'ensemble' - a collection together - conjures different ideas even within disciplines when approaching phenomena. For example, one might think of a set of descriptions of a phenomenon in the world, perhaps a time series or a snapshot of multivariate space, and perhaps that set is comprised of data-independent descriptions, or perhaps it is quite intentionally fit *to* data, or even a suite of data sets with a common theme or intention. Recently, ensemble models have appeared widely across applications, for disease forecasting, environmental suitability modeling, and more. In this piece, we present a typology of the scope of potential perspectives across disciplines to disambiguate terms, concepts, and processes associated with 'ensembles' and 'ensembling'. We do not provide an exhaustive review nor do we recommend that all disciplines must adopt a common suite of terms, but instead focus on facilitating communication, awareness, identification of gaps, and adoption of tools to avoid independent efforts to reinvent the wheel across disciplines. To anchor our discussion, we provide a Shiny App to contain the typology, with a living collection, or compendium, of example publications about ensembles.

Geometry of Lightning Self-Attention: Identifiability and Dimension

Nathan W. Henry, Giovanni Luca Marchetti, Kathl\'en Kohn — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2408.17221v3 Announce Type: replace Abstract: We consider function spaces defined by self-attention networks without normalization, and theoretically analyze their geometry. Since these networks are polynomial, we rely on tools from algebraic geometry. In particular, we study the identifiability of deep attention by providing a description of the generic fibers of the parametrization for an arbitrary number of layers and, as a consequence, compute the dimension of the function space. Additionally, for a single-layer model, we characterize the singular and boundary points. Finally, we formulate a conjectural extension of our results to normalized self-attention networks, prove it for a single layer, and numerically verify it in the deep case.

Simply-typed constant-domain modal lambda calculus I: distanced beta reduction and combinatory logic

Sean Walsh — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2410.17463v4 Announce Type: replace Abstract: A system $\boldsymbol\lambda_{\theta}$ is developed that combines modal logic and simply-typed lambda calculus, and that generalizes the system studied by Montague and Gallin. Whereas Montague and Gallin worked with Church's simple theory of types, the system $\boldsymbol\lambda_{\theta}$ is developed in the typed base theory most commonly used today, namely the simply-typed lambda calculus. Further, the system $\boldsymbol\lambda_{\theta}$ is controlled by a parameter $\theta$ which allows more options for state types and state variables than is present in Montague and Gallin. A main goal of the paper is to establish some basic metatheory of $\boldsymbol\lambda_{\theta}$: (i) an Andrews-like characterization of its models in terms of combinatory logic is given, and this combinatory logic involves a $\mathsf{BCKW}$-like basis rather than an $\mathsf{SKI}$-like basis and (ii) semantic conservation and expressibility results relating $\boldsymbol\lambda_{\theta}$ to the maximal system $\boldsymbol\lambda_{\omega}$ are proven. Similar results are proven for the relation between $\boldsymbol\lambda_{\omega}$ and$\boldsymbol\lambda$, the corresponding ordinary simply-typed lambda calculus. This answers a question of Zimmermann in the semantics of the simply typed setting. In a companion paper this is extended to Church's simple theory of types. We further develop a partial correspondence between a pure combinatory logic centered on the $\mathsf{BCKW}$-like basis and the weak deductive system for $\boldsymbol\lambda_{\omega}$ wherein $\beta$-reduction is not allowed under a lambda abstract, and we use this to show partial deductive conservation between the maximal system $\boldsymbol\lambda_{\omega}$ and the intermediary systems $\boldsymbol\lambda_{\theta}$.

Competition and Diversity in Generative AI

Manish Raghavan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2412.08610v3 Announce Type: replace Abstract: Recent evidence, both in the lab and in the wild, suggests that the use of generative artificial intelligence reduces the diversity of content produced. The use of the same or similar AI models appears to lead to more homogeneous behavior. Our work begins with the observation that there is a force pushing in the opposite direction: competition. When producers compete with one another (e.g., for customers or attention), they are incentivized to create novel or unique content. We explore the impact competition has on both content diversity and overall social welfare. Through a formal game-theoretic model, we show that competitive markets select for diverse AI models, mitigating monoculture. We further show that a generative AI model that performs well in isolation (i.e., according to a benchmark) may fail to provide value in a competitive market. Our results highlight the importance of evaluating generative AI models across the breadth of their output distributions, particularly when they will be deployed in competitive environments. We validate our results empirically by using language models to play Scattergories, a word game in which players are rewarded for answers that are both correct and unique. Overall, our results suggest that homogenization due to generative AI is unlikely to persist in competitive markets, and instead, competition in downstream markets may drive diversification in AI model development.

Learning Robot Safety from Sparse Human Feedback using Conformal Prediction

Aaron O. Feldman, Joseph A. Vincent, Maximilian Adang, JunEn Low, Mac Schwager — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2501.04823v2 Announce Type: replace Abstract: Ensuring robot safety can be challenging; user-defined constraints can miss edge cases, policies can become unsafe even when trained from safe data, and safety can be subjective. Thus, we learn about robot safety by showing policy trajectories to a human who flags unsafe behavior. From this binary feedback, we use the statistical method of conformal prediction to identify a region of states, potentially in learned latent space, guaranteed to contain a user-specified fraction of future policy errors. Our method is sample-efficient, as it builds on nearest neighbor classification and avoids withholding data as is common with conformal prediction. By alerting if the robot reaches the suspected unsafe region, we obtain a warning system that mimics the human's safety preferences with guaranteed miss rate. From video labeling, our system can detect when a quadcopter visuomotor policy will fail to steer through a designated gate. We present an approach for policy improvement by avoiding the suspected unsafe region. With it we improve a model predictive controller's safety, as shown in experimental testing with 30 quadcopter flights across 6 navigation tasks. Code and videos are provided.

Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes

Davide Barbieri, Matteo Bonforte, Peio Ibarrondo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2501.08425v3 Announce Type: replace Abstract: In this paper we analyze the behaviour of the stochastic gradient descent (SGD), a widely used method in supervised learning for optimizing neural network weights via a minimization of non-convex loss functions. Since the pioneering work of E, Li and Tai (2017), the underlying structure of such processes can be understood via parabolic PDEs of Fokker-Planck type, which are at the core of our analysis. Even if Fokker-Planck equations have a long history and a extensive literature, almost nothing is known when the potential is non-convex or when the diffusion matrix is degenerate, and this is the main difficulty that we face in our analysis. We identify two different regimes: in the initial phase of SGD, the loss function drives the weights to concentrate around the nearest local minimum. We refer to this phase as the drift regime and we provide quantitative estimates on this concentration phenomenon. Next, we introduce the diffusion regime, where stochastic fluctuations help the learning process to escape suboptimal local minima. We analyze the Mean Exit Time (MET) and prove upper and lower bounds of the MET. Finally, we address the asymptotic convergence of SGD, for a non-convex cost function and a degenerate diffusion matrix, that do not allow to use the standard approaches, and require new techniques. For this purpose, we exploit two different methods: duality and entropy methods. We provide new results about the dynamics and effectiveness of SGD, offering a deep connection between stochastic optimization and PDE theory, and some answers and insights to basic questions in the Machine Learning processes: How long does SGD take to escape from a bad minimum? Do neural network parameters converge using SGD? How do parameters evolve in the first stage of training with SGD?

Fourier Multi-Component and Multi-Layer Neural Networks: Unlocking High-Frequency Potential

Shijun Zhang, Hongkai Zhao, Yimin Zhong, Haomin Zhou — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2502.18959v3 Announce Type: replace Abstract: The architecture of a neural network and the choice of its activation function are both fundamental to its performance. Equally important is ensuring that these two elements are well matched, as their alignment is key to effective representation and learning. In this paper, we introduce the Fourier Multi-Component and Multi-Layer Neural Network (FMMNN), a model that combines sine-type activations with the multi-component and multi-layer structure of MMNNs. In an FMMNN, each component is represented as a trainable linear combination of fixed random sine-type basis functions, while multi-layer composition generates more complex and adaptive high-frequency features. We establish that FMMNNs retain exponential expressive power for function approximation even under a low-rank architectural structure. We also analyze the optimization landscape of FMMNNs and find it to be substantially more favorable than that of standard fully connected neural networks, especially for high-frequency targets. In addition, we propose a scaled random initialization method for the first-layer weights in FMMNNs, which accelerates training and improves final performance when sufficient samples are available. Extensive numerical experiments support our theoretical insights, showing that FMMNNs achieve strong accuracy and favorable convergence behavior on oscillatory function-approximation benchmarks.

WildIFEval: Instruction Following in the Wild

Gili Lior, Asaf Yehudai, Ariel Gera, Liat Ein-Dor — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2503.06573v3 Announce Type: replace Abstract: Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior. We release our dataset to promote further research on instruction-following under complex, realistic conditions.

Data-Driven Soft Robot Control via Adiabatic Spectral Submanifolds

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2503.10919v3 Announce Type: replace Abstract: The mechanical complexity of soft robots creates significant challenges for their model-based control. Specifically, linear data-driven models have struggled to control soft robots on complex, spatially extended paths that explore regions with significant nonlinear behavior. To account for these nonlinearities, we develop here a model-predictive control strategy based on the recent theory of adiabatic spectral submanifolds (aSSMs). This theory is applicable because the internal vibrations of heavily overdamped robots decay at a speed that is much faster than the desired speed of the robot along its intended path. In that case, low-dimensional attracting invariant manifolds (aSSMs) emanate from the path and carry the dominant dynamics of the robot. Aided by this recent theory, we devise an aSSM-based model-predictive control scheme purely from data. We demonstrate the effectiveness of our data-driven model in tracking dynamic trajectories across diverse tasks. We validate on high-fidelity, high-dimensional finite-element models of a soft trunk robot and Cosserat-rod-based elastic soft arms, with additional experiments confirming robust performance even in the presence of experimental noise. Notably, we find that five- or six-dimensional aSSM-reduced models outperform the tracking performance of other data-driven modeling methods by a factor up to 10 across all closed-loop control tasks.

Oncomorphic neural agent populations for resource-limited sequential learning

Philip Greulich, Michael Levin, Rosalia Moreddu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2503.12743v3 Announce Type: replace Abstract: Distributed artificial intelligence (AI) often operates under sequential task exposure, uneven compute, and decentralized coordination. Here, we present a cancer-inspired, or oncomorphic, multi-agent framework in which simulated neural agents can replicate, mutate their neural network architecture, migrate across task environments, undergo ecological turnover, and recruit learning/ecological resources from a finite shared reserve. We evaluate the framework in controlled synthetic nonlinear classification environments in which each agent trains only on its local task, allowing population ecology rather than centralized optimization to determine which neural network architectures persist. For various initial conditions, we find that stronger selection increased the endpoint local accuracy of surviving agent populations. Architecture mutation played a state-dependent role: diverse initial populations performed best at low mutation, whereas clonal large-architecture populations benefited from mutation-generated variation. Selection also increased end-of-run multi-task competence, measured by evaluating surviving agents on all environments without additional training. Recruitment and elevated baseline replication reshaped demographic support while prediction quality remained within a narrow band, consistent with redistribution of finite learning resources. Time-resolved entropy and dominance analyses revealed concentration toward successful architectures, while finite training cycles kept agents in a non-asymptotic learning regime. These results provide proof-of-concept mechanistic evidence that oncomorphic population dynamics may offer a route to decentralized adaptation in engineering applications under bounded local resources.

Radar-Guided Polynomial Fitting for Metric Depth Estimation

Patrick Rim, Hyoungseob Park, Vadim Ezhov, Jeffrey Moon, Alex Wong — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2503.17182v5 Announce Type: replace Abstract: We propose POLAR, a novel radar-guided depth estimation method that introduces polynomial fitting to efficiently transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a fundamental insight: although MDE models often infer reasonable local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale and shift (affine) transformation insufficient given three or more of these regions. To address this limitation, we use polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust predictions non-uniformly across depth ranges. In this way, POLAR generalizes beyond affine transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces local monotonicity via first-derivative regularization. POLAR achieves state-of-the-art performance across three datasets, outperforming existing methods by an average of 24.9% in MAE and 33.2% in RMSE, while also achieving state-of-the-art efficiency in terms of latency and computational cost.

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2504.21561v5 Announce Type: replace Abstract: Multimodal agents, which integrate a controller e.g., a vision language model) with external tools, have demonstrated remarkable capabilities in tackling complex multimodal tasks. Existing approaches for training these agents, both supervised fine-tuning and reinforcement learning, depend on extensive human-annotated task-answer pairs and tool trajectories. However, for complex multimodal tasks, such annotations are prohibitively expensive or impractical to obtain. In this paper, we propose an iterative tool usage exploration method for multimodal agents without any pre-collected data, namely SPORT, via step-wise preference optimization to refine the trajectories of tool usage. Our method enables multimodal agents to autonomously discover effective tool usage strategies through self-exploration and optimization, eliminating the bottleneck of human annotation. SPORT has four iterative components: task synthesis, step sampling, step verification, and preference tuning. We first synthesize multimodal tasks using language models. Then, we introduce a novel trajectory exploration scheme, where step sampling and step verification are executed alternately to solve synthesized tasks. In step sampling, the agent tries different tools and obtains corresponding results. In step verification, we employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller for tool usage through preference tuning, producing a SPORT agent. By interacting with real environments, the SPORT agent gradually evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks shows that the SPORT agent achieves 6.41% and 3.64% improvements, underscoring the generalization and effectiveness introduced by our method. The project page is https://SPORT-Agents.github.io.

Visual enhancement and 3D representation for underwater scenes: a review

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2505.01869v2 Announce Type: replace Abstract: Underwater visual enhancement (UVE) and underwater 3D reconstruction pose significant challenges in computer vision and AI-based tasks due to complex imaging conditions in aquatic environments. Despite the development of numerous enhancement algorithms, a comprehensive and systematic review covering both UVE and underwater 3D reconstruction remains absent. To advance research in these areas, we present an in-depth review from multiple perspectives. First, we introduce the fundamental physical models, highlighting the peculiarities that challenge conventional techniques. We survey advanced methods for visual enhancement and 3D reconstruction specifically designed for underwater scenarios. The paper assesses various approaches from non-learning methods to advanced data-driven techniques, including Neural Radiance Fields and 3D Gaussian Splatting, discussing their effectiveness in handling underwater distortions. Finally, we conduct both quantitative and qualitative evaluations of state-of-the-art UVE and underwater 3D reconstruction algorithms across multiple benchmark datasets. Finally, we highlight key research directions for future advancements in underwater vision.

Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2505.04021v3 Announce Type: replace Abstract: Inference providers must maintain availability for many LLMs, including low-volume but essential models, making resource efficiency increasingly important as token prices fall. Analysis of production traces reveals a dynamic bursty-group pattern in which sets of models become active together and shift over time; existing space- and time-sharing approaches lack principled mechanisms to adapt to this variability, forcing trade-offs between SLO adherence and efficiency. We observe that elastic memory allocation can unify spatial and temporal sharing. Based on this insight, we have developed Prism, a memory-centric LLM co-serving framework that applies memory ballooning to reclaim memory across models and support both forms of sharing under a single scheme. Prism's balloon driver, referred to as kvcached, has been open-sourced at https://github.com/ovg-project/kvcached, and deployed in production environments across 10K+ GPUs.

Learning on a Razor's Edge: Identifiability and Singularity of Polynomial Neural Networks

Vahid Shahverdi, Giovanni Luca Marchetti, Kathl\'en Kohn — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2505.11846v3 Announce Type: replace Abstract: We study function spaces parametrized by neural networks, referred to as neuromanifolds. Specifically, we focus on deep Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) with an activation function that is a sufficiently generic polynomial. First, we address the identifiability problem, showing that, for almost all functions in the neuromanifold of an MLP, there exist only finitely many parameter choices yielding that function. For CNNs, the parametrization is generically one-to-one. As a consequence, we compute the dimension of the neuromanifold. Second, we describe singular points of neuromanifolds. We characterize singularities completely for CNNs, and partially for MLPs. In both cases, they arise from sparse subnetworks. For MLPs, we prove that these singularities often correspond to critical points of the mean-squared error loss, which does not hold for CNNs. This provides a geometric explanation of the sparsity bias of MLPs. All of our results leverage tools from algebraic geometry.

Lightweight and Interpretable Transformer via Mixed Graph Algorithm Unrolling for Traffic Forecast

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2505.13102v4 Announce Type: replace Abstract: Unlike conventional "black-box" transformers with classical self-attention mechanism, we build a lightweight and interpretable transformer-like neural net by unrolling a mixed-graph-based optimization algorithm to forecast traffic with spatial and temporal dimensions. We construct two graphs: an undirected graph $\mathcal{G}^u$ capturing spatial correlations across geography, and a directed graph $\mathcal{G}^d$ capturing sequential relationships over time. We predict future samples of signal $\mathbf{x}$, assuming it is "smooth" with respect to both $\mathcal{G}^u$ and $\mathcal{G}^d$, where we design new $\ell_2$ and $\ell_1$-norm variational terms to quantify and promote signal smoothness (low-frequency reconstruction) on a directed graph. We design an iterative algorithm based on alternating direction method of multipliers (ADMM), and unroll it into a feed-forward network for data-driven parameter learning. We periodically insert graph learning modules for $\mathcal{G}^u$ and $\mathcal{G}^d$ that play the role of self-attention. Experiments show that our unrolled networks achieve competitive traffic forecast performance as state-of-the-art prediction schemes, while reducing parameter counts drastically.

Convergence analysis of GMRES applied to Helmholtz problems near resonances

Victorita Dolean, Pierre Marchand, Axel Modave, Timoth\'ee Raynaud — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2505.16345v2 Announce Type: replace Abstract: The finite element solution of Helmholtz problems near resonant or quasi-resonant frequencies poses significant challenges, as iterative solvers typically suffer from severely degraded convergence. We analyze the convergence behavior of GMRES applied to linear systems arising from such configurations. Theoretical convergence estimates are derived based on harmonic Ritz values, highlighting their proximity to small eigenvalues as a key determining factor. We further examine deflation strategies and their interplay with preconditioning techniques, using the Complex Shifted Laplacian preconditioner as a case study. Numerical experiments on resonant and quasi-resonant test cases validate the theoretical framework and demonstrate the effectiveness of deflation strategies. This study provides new insights and practical guidance for analyzing and improving iterative solvers for time-harmonic problems near resonances.

ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2505.20076v4 Announce Type: replace Abstract: Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation, and are often tied to a particular level of granularity along the local-to-global spectrum. This leads to explanations that lack a unified view and may miss key interactions. We present ExPLAIND, a theoretically grounded, unified framework that integrates model components, data, and training trajectory while supporting explanations across granularities. We generalize recent work on gradient path kernels, reformulating models trained by AdamW as kernel machines. From the resulting kernel feature maps, we derive novel parameter-wise and step-wise influence scores. We empirically validate the resulting decomposition of model behavior in several settings and apply ExPLAIND to two case studies. Our findings on a Transformer exhibiting Grokking support previously proposed learning phases, while refining the final phase as one in which outer layers align around a representation pipeline learned after memorization. For EuroLLM pretraining, ExPLAIND reveals a two-phase dynamic, with the first characterized by outer-layer MLP learning and the second by increased relative influence of intermediate attention layers. These results establish ExPLAIND as a unified framework for interpreting model behavior and training dynamics.

LLM-ODDR: A Large Language Model Framework for Joint Order Dispatching and Driver Repositioning

Tengfei Lyu, Siyuan Feng, Hao Liu, Hai Yang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2505.22695v2 Announce Type: replace Abstract: Ride-hailing platforms face significant challenges in optimizing order dispatching and driver repositioning operations in dynamic urban environments. Traditional approaches based on combinatorial optimization, rule-based heuristics, and reinforcement learning often overlook driver income fairness, interpretability, and adaptability to real-world dynamics. To address these gaps, we propose LLM-ODDR, a novel framework leveraging Large Language Models (LLMs) for joint Order Dispatching and Driver Repositioning (ODDR) in ride-hailing services. LLM-ODDR framework comprises three key components: (1) Multi-objective-guided Order Value Refinement, which evaluates orders by considering multiple objectives to determine their overall value; (2) Fairness-aware Order Dispatching, which balances platform revenue with driver income fairness; and (3) Spatiotemporal Demand-Aware Driver Repositioning, which optimizes idle vehicle placement based on historical patterns and projected supply. We also develop JointDR-GPT, a fine-tuned model optimized for ODDR tasks with domain knowledge. Extensive experiments on real-world datasets from Manhattan taxi operations demonstrate that our framework significantly outperforms traditional methods in terms of effectiveness, adaptability to anomalous conditions, and decision interpretability. To our knowledge, this is the first exploration of LLMs as decision-making agents in ride-hailing ODDR tasks, establishing foundational insights for integrating advanced language models within intelligent transportation systems. While the current framework incurs higher computational costs than traditional methods, we show that parallel decomposition and model distillation can reduce latency to production-viable levels for deployment.

RAGPPI: RAG Benchmark for Protein-Protein Interactions in Drug Discovery

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2505.23823v2 Announce Type: replace Abstract: Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question-answer benchmark of 4,420 question-answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold-standard dataset (500 QA pairs) through expert-driven data annotation. We developed an ensemble auto-evaluation LLM that incorporates expert labeling characteristics, average fact-abstract similarity (F1), and low-similarity fact counts (F2), enabling the construction of a silver-standard dataset (3,720 QA pairs). We are committed to maintaining RAGPPI as a resource to support the research community in advancing RAG systems for drug discovery QA solutions.

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

Hosu Lee, Junho Kim, Hyunjun Kim, Yong Man Ro — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2506.01274v2 Announce Type: replace Abstract: Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to video understanding remains constrained by suboptimal frame selection strategies, albeit with the rapid development of video-specialized LMMs. Prior works attempted to solve this with static heuristics or external retrieval modules to feed frame-level information, but these approaches often fail to capture visual cues grounded to the given user queries conflating raw visual dynamics with true semantic relevance. In this paper, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), the first framework to integrate online policy-gradient reinforcement learning into frame-level optimization for video-LLMs. ReFoCUS aims to learn a frame selection policy, leveraging reward signals derived from reference models to capture their underlying scoring behavior over frame combinations that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive and query-conditional selection architecture that ensures contextual consistency while reducing complexity. Our policy learning removes the need for explicit frame-level supervision, as it implicitly discovers optimal and semantically consistent frame compositions. ReFoCUS consistently improves reasoning accuracy across multiple video QA benchmarks, demonstrating the advantage of aligning frame selection with model-internal utility.

Deterministic-Allocation and Anonymous Joint Advertising in E-commerce Platforms

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2506.02435v4 Announce Type: replace Abstract: With the advancement of machine learning, an increasing number of studies are employing automated mechanism design (AMD) methods for optimal auction design. However, all previous AMD architectures designed to generate optimal mechanisms that satisfy near dominant strategy incentive compatibility (DSIC) fail to achieve deterministic allocation, and some also lack anonymity, thereby impacting the efficiency and fairness of advertising allocation. This has resulted in a notable discrepancy between the previous AMD architectures for generating near-DSIC optimal mechanisms and the demands of real-world advertising scenarios. In this paper, we prove that in all online advertising scenarios, when all ad slots must be allocated, previous non-deterministic allocation AMD methods lead to the non-existence of feasible solutions in the vast majority of cases, resulting in a gap between the rounded solution and the optimal solution. Furthermore, we propose JTransNet, a transformer-based neural network architecture, designed for optimal deterministic-allocation and anonymous joint auction design. Although the deterministic allocation module in JTransNet is designed for the latest joint auction scenarios, it can be applied to other non-deterministic AMD architectures with minor modifications. Additionally, our offline and online data experiments demonstrate that, in joint auction scenarios, JTransNet significantly outperforms the considered baselines in terms of platform revenue.

The Parametrised Complexity of Counting Small Sub-Hypergraphs

Marco Bressan, Julian Brinkmann, Holger Dell, Marc Roth, Philip Wellnitz — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2506.14081v3 Announce Type: replace Abstract: Subgraph counting is a fundamental and well-studied problem whose computational complexity is well understood. Quite surprisingly, the hypergraph version of subgraph counting has been almost ignored. In this work, we address this gap by investigating the most basic sub-hypergraph counting problem: given a (small) hypergraph $H$ and a (large) hypergraph $G$, compute the number of sub-hypergraphs of $G$ isomorphic to $H$. Formally, for a family $\mathcal{H}$ of hypergraphs, let #Sub($\mathcal{H}$) be the restriction of the problem to $H \in \mathcal{H}$; the induced variant #IndSub($\mathcal{H}$) is defined analogously. Our main contribution is a complete classification of the complexity of these problems. Assuming the Exponential Time Hypothesis, we prove that #Sub($\mathcal{H}$) is fixed-parameter tractable if and only if $\mathcal{H}$ has bounded fractional co-independent edge-cover number, a novel graph parameter we introduce. Moreover, #IndSub($\mathcal{H}$) is fixed-parameter tractable if and only if $\mathcal{H}$ has bounded fractional edge-cover number. Both results subsume pre-existing results for graphs as special cases. We also show that the fixed-parameter tractable cases of #Sub($\mathcal{H}$) and #IndSub($\mathcal{H}$) are unlikely to be in polynomial time, unless respectively #P = P and Graph Isomorphism $\in$ P. This shows a separation with the special case of graphs, where the fixed-parameter tractable cases are known to actually be in polynomial time.

CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing

Dinh-Khoi Vo, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2506.18438v3 Announce Type: replace Abstract: Editing natural images using textual descriptions in text-to-image diffusion models remains a significant challenge, particularly in achieving consistent generation and handling complex, non-rigid objects. Existing methods often struggle to preserve textures and identity, require extensive fine-tuning, and exhibit limitations in editing specific spatial regions or objects while retaining background details. This paper proposes Context-Preserving Adaptive Manipulation (CPAM), a novel zero-shot framework for complicated, non-rigid real image editing. Specifically, we propose a preservation adaptation module that adjusts self-attention mechanisms to preserve and independently control the object and background effectively. This ensures that the objects' shapes, textures, and identities are maintained while keeping the background undistorted during the editing process using the mask guidance technique. Additionally, we develop a localized extraction module to mitigate the interference with the non-desired modified regions during conditioning in cross-attention mechanisms. We also introduce various mask-guidance strategies to facilitate diverse image manipulation tasks in a simple manner. CPAM can be seamlessly integrated with multiple diffusion backbones, including SD1.5, SD2.1, and SDXL, demonstrating strong generalization across different model architectures. Extensive experiments on our newly constructed Image Manipulation BenchmArk (IMBA), a robust benchmark dataset specifically designed for real image editing, demonstrate that our proposed method is the preferred choice among human raters, outperforming existing state-of-the-art editing techniques. The source code and data will be publicly released at the project page: https://vdkhoi20.github.io/CPAM

ShowFlow: From Robust Single Concept to Condition-Free Multi-Concept Generation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2506.18493v3 Announce Type: replace Abstract: Customizing image generation remains a core challenge in controllable image synthesis. For single-concept generation, maintaining both identity preservation and prompt alignment is challenging. In multi-concept scenarios, relying solely on a prompt without additional conditions like layout boxes or semantic masks, often leads to identity loss and concept omission. In this paper, we introduce ShowFlow, a comprehensive framework designed to tackle these challenges. We propose ShowFlow-S for single-concept image generation, and ShowFlow-M for handling multiple concepts. ShowFlow-S introduces a KronA-WED adapter, which integrates a Kronecker adapter with weight and embedding decomposition, and together with a novel Semantic-Aware Attention Regularization (SAR) training objective to enhance single-concept generation. Building on this foundation, ShowFlow-M directly reuses robust models learned by ShowFlow-S to support multi-concept generation without extra conditions, incorporating a Subject-Adaptive Matching Attention (SAMA) and a Layout Consistency guidance as the plug-and-play module. Extensive experiments and user studies validate ShowFlow's effectiveness, highlighting its potential in real-world applications like advertising and virtual dressing. Our source code will be publicly available at: https://htrvu.github.io/showflow.

Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation

Jiho Choi, Sang Jun Lee — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2506.21855v3 Announce Type: replace Abstract: In this paper, we propose Periodic-MAE, a self-supervised framework for learning generalizable spatio-temporal representations of periodic physiological signals from unlabeled facial videos. The proposed method leverages a masked autoencoder (MAE), which learns high-dimensional facial representations by reconstructing masked video tokens without relying on remote photoplethysmography (rPPG) specific supervision. To explicitly align representation learning with the characteristics of rPPG, we introduce a periodicity-aware frame masking strategy based on video resampling, enabling the encoder to learn representations that capture quasi-periodic temporal patterns relevant to pulse signal estimation. In addition, physiological bandlimit constraints are integrated into the MAE pre-training framework, exploiting the sparsity of pulse signals in the frequency domain to guide the learned representations toward physiologically meaningful patterns. After pre-training, the learned representations are transferred to downstream rPPG estimation, where the encoder serves as a generic feature extractor for recovering pulse-related signals from facial videos. We conduct extensive experiments on four benchmark datasets, including PURE, UBFC-rPPG, MMPD, and V4V. Moreover, we evaluate the proposed approach on a real-world rPPG dataset collected under unconstrained lighting conditions and subject motion. Experimental results demonstrate that Periodic-MAE consistently improves rPPG estimation performance, particularly in challenging cross-dataset and real-world evaluation settings. Our code is available at https://github.com/ziiho08/Periodic-MAE.

Memory as a Service (MaaS): Purpose-Bound Memory Mediation for Cooperative Agents

Haichang Li — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2506.22815v2 Announce Type: replace Abstract: Agentic programming is code-centered, while its useful memory context extends beyond code. A programming agent may draw on memory from test, review, build, and release agents; design, product, security, operations, and compliance agents; meeting, finance, calendar, and workflow agents; personal agents; and agents acting for other people. These memories can help agents optimize, debug, test, and evaluate software, while carrying different owners, purposes, recipients, and disclosure boundaries. We propose \emph{Memory as a Service} (MaaS) as \emph{purpose-bound memory mediation}: each invocation is evaluated by owner, requester, recipient, task, and declared purpose, and the mediator chooses whether to \emph{withhold}, \emph{abstract}, or \emph{reveal} each candidate item. We formalize this by separating cooperative utility, disclosure leakage, and purpose-bound authorization, then ground the position with diagnostic stress tests on MAGPIE. Relevance-based retrieval reaches AUROC $0.570$ and leaks $53.0\%$ of private items; contextual-integrity prompting reduces leakage by $21.8$ percentage points while leaving $32.6\%$ residual leakage; and $4.5\%$ of private items contain explicit safe-hint abstractions. These probes motivate memory governance as a separate design problem for cooperative programming agents.

How Reliable are Fairness Audits with Unreliable Data?

Yash Vardhan Tomar — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2506.23033v3 Announce Type: replace Abstract: Fairness audits are a key component of responsible machine-learning deployment. Yet, audit-recommendation reliability under incomplete protected-label access is still poorly understood. In this work, we focused on protected-label missingness in fairness mitigation audits. We introduced a seed-calibrated stress test to separate missingness effects from seed-to-seed movement already present under complete labels. Across ACS/Folktables tasks, missingness settings that retain some protected labels usually do not move selected mitigation methods beyond a complete-label seed-to-seed baseline. At $0%$ protected-label access, candidates collapse to an empirical-risk-minimization baseline and deterministic tie-breaking rather than revealing a broad missingness effect. We also found that threshold optimization can turn fairness gains on a single protected axis into intersectional harm above a seed baseline, and this threshold-optimizer finding persists under random-forest validation. Overall, our results highlight that protected-label missingness should be reported with seed-null calibration, candidate-set context, and intersectional consequences before it is treated as evidence of audit fragility.

New algorithms for girth and cycle detection

Liam Roditty, Plia Trabelsi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2507.02061v3 Announce Type: replace Abstract: Let $G=(V,E)$ be an unweighted undirected graph with $n$ vertices and $m$ edges. Let $g$ be the girth of $G$, that is, the length of a shortest cycle in $G$. We present a randomized algorithm with a running time of $\tilde{O}\big(\ell \cdot n^{1 + \frac{1}{\ell - \varepsilon}}\big)$ that returns a cycle of length at most $ 2\ell \left\lceil \frac{g}{2} \right\rceil - 2 \left\lfloor \varepsilon \left\lceil \frac{g}{2} \right\rceil \right\rfloor, $ where $\ell \geq 2$ is an integer and $\varepsilon \in [0,1]$, for every graph with $g = polylog(n)$. Our algorithm generalizes an algorithm of Kadria \etal{} [SODA'22] that computes a cycle of length at most $4\left\lceil \frac{g}{2} \right\rceil - 2\left\lfloor \varepsilon \left\lceil \frac{g}{2} \right\rceil \right\rfloor $ in $\tilde{O}\big(n^{1 + \frac{1}{2 - \varepsilon}}\big)$ time. Kadria \etal{} presented also an algorithm that finds a cycle of length at most $ 2\ell \left\lceil \frac{g}{2} \right\rceil $ in $\tilde{O}\big(n^{1 + \frac{1}{\ell}}\big)$ time, where $\ell$ must be an integer. Our algorithm generalizes this algorithm, as well, by replacing the integer parameter $\ell$ in the running time exponent with a real-valued parameter $\ell - \varepsilon$, thereby offering greater flexibility in parameter selection and enabling a broader spectrum of combinations between running times and cycle lengths. We also show that for sparse graphs a better tradeoff is possible, by presenting an $\tilde{O}(\ell\cdot m^{1+1/(\ell-\varepsilon)})$ time randomized algorithm that returns a cycle of length at most $2\ell(\lfloor \frac{g-1}{2}\rfloor) - 2(\lfloor \varepsilon \lfloor \frac{g-1}{2}\rfloor \rfloor+1)$, where $\ell\geq 3$ is an integer and $\varepsilon\in [0,1)$, for every graph with $g=polylog(n)$. To obtain our algorithms we develop several techniques and introduce a formal definition of hybrid cycle detection algorithms. [...]

PlaceRep: Geospatial Place Representation Learning from Large-Scale Point-of-Interest Data

Mohammad Hashemi, Hossein Amiri, Andreas Zufle — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2507.02921v4 Announce Type: replace Abstract: Learning effective representations of urban environments requires capturing spatial structure beyond fixed administrative boundaries. Existing geospatial representation learning approaches typically aggregate Points of Interest (POIs) into pre-defined administrative regions such as census units or ZIP code areas, assigning a single embedding to each region. However, POIs often form semantically meaningful groups that extend across, within, or beyond these boundaries, defining places that better reflect human activity and urban function. To address this limitation, we propose PlaceRep, a geospatial representation learning method that constructs place-level representations by clustering spatially and semantically related POIs. PlaceRep summarizes large-scale POI graphs from U.S. Foursquare data to produce general-purpose urban region embeddings while automatically identifying places across multiple spatial scales. By eliminating model pre-training, PlaceRep provides a scalable and efficient solution for multi-granular geospatial analysis. Experiments using the tasks of population density estimation and housing price prediction as downstream tasks show that PlaceRep outperforms most state-of-the-art graph-based geospatial representation learning methods and achieves up to a x100 speedup in generating region-level representations on large-scale POI graphs. The implementation of PlaceRep is available at https://github.com/mohammadhashemii/PlaceRep.

Single vs. Multiple Branches in DeepONet and S-DeepONet: Network Architecture Follows Coupling in Multiphysics Systems

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2507.03660v2 Announce Type: replace Abstract: `Real-time prediction of complex physical systems requires surrogate models that learn from data while representing strong multiphysics coupling. Deep Operator Networks have shown success in single-physics problems, yet their effectiveness in capturing nonlinear interactions in coupled systems (such as thermo-mechanical or electro-thermal coupling) remains underexplored. Here we pose a practical question: should the architecture of a neural operator reflect the strength of physical coupling it aims to model? We compare single-branch and multi-branch designs, in both feedforward and sequential recurrent forms, across three representative systems: a reaction--diffusion problem with heterogeneous sources, a nonlinear thermo-electrical problem with temperature-dependent conductivity and Joule heating, and a viscoplastic thermo-mechanical model of steel solidification. Single-branch networks consistently outperform multi-branch variants in tightly coupled regimes by encouraging shared latent representations, whereas multi-branch designs remain favorable for decoupled or single-physics tasks. Once trained, these surrogates deliver full-field predictions up to $1.8 \times 10^4$ times faster than physics-based solvers.

Meta-Learning Transformers to Improve In-Context Generalization

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2507.05019v2 Announce Type: replace Abstract: In-context learning enables transformer models to generalize to new tasks based solely on input prompts, without any need for weight updates. However, existing training paradigms typically rely on large, unstructured datasets that are costly to store, difficult to evaluate for quality and balance, and pose privacy and ethical concerns due to the inclusion of sensitive information. Motivated by these limitations and risks, we propose an alternative training strategy where we leverage a collection of multiple, small-scale, and domain-specific datasets. We empirically demonstrate that the increased quality and diversity of such data improve the generalization abilities of in-context learners beyond their training domain, while achieving comparable performance with models trained on a single large-scale dataset. We investigate this paradigm by leveraging meta-learning to train an in-context learner on the Meta-Album collection under several settings. Firstly, we show the performance in a controlled environment, where the test domain is completely excluded from the training knowledge. Secondly, we explore the robustness of these models to forgetting in a continual scenario where the information is accessible for a limited time. Finally, we explore the more challenging unsupervised scenario. Our findings demonstrate that transformers still generalize for in-context prediction when trained on a curated dataset collection while offering advantages in modularity and replaceability.

LISTEN: Lightweight Industrial Sound-representable Transformer for Edge Notification

Changheon Han, Yun Seok Kang, Yuseop Sim, Hyung Wook Park, Martin Byung-Guk Jun — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2507.07879v3 Announce Type: replace Abstract: Deep learning-based machine listening is broadening the scope of industrial acoustic analysis, yet its widespread implementation on live shop floors is hindered by the reliance on large, task-specific annotated datasets for every new task. While emerging general-purpose sound foundation models aim to alleviate data dependency, they reveal critical dilemmas in practice. General-purpose sound foundation models are computationally expensive and fail in industrial scenarios characterized by tonal harmonics, broadband noise, and transient fault events, making instant, on-site deployment impractical. These challenges combined mean that a practical, end-to-end system for deploying a sound foundation model on a live shop floor has remained elusive. To address this challenge, this study introduces LISTEN (Lightweight Industrial Sound-representable Transformer for Edge Notification), the first lightweight foundation model specialized for industrial sound. Through Knowledge Distillation (KD) from the large-scale teacher model IMPACT (Industrial Machine Perception via Acoustic Cognitive Transformer), we construct LISTEN optimized for resource-constrained edge environments. By freezing the backbone and training only a shallow head on minimal target-process data, rather than performing full fine-tuning or retraining, LISTEN achieves nearly identical performance to IMPACT across diverse manufacturing processes. This study further demonstrates a complete system for real-time machine monitoring, encompassing data acquisition with Industrial Internet of Things (IIoT) devices, rapid model adaptation using minimal annotated data, and real-time monitoring on a low-cost edge device. By validating the entire system on a live CNC machine, this work establishes the first feasible end-to-end system for deploying a lightweight industrial sound foundation model in an active industrial environment.

Reconstructing Template-Memorized Images from Natural Prompts

Sol Yarkoni, Mahmood Sharif, Roi Livni — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2507.07947v4 Announce Type: replace Abstract: Recent advances in generative models, such as diffusion models, have raised concerns related to privacy, copyright infringement, and data stewardship. To better understand and control these risks, prior work has introduced techniques and attacks that reconstruct images, or parts of images, from training data. While these results demonstrate that training data can be recovered, existing methods often rely on high computational resources, partial access to the training set, or carefully engineered prompts. In this work, we present a new attack that requires low resources, assumes little to no access to the training data, and identifies seemingly benign prompts that can lead to potentially risky image reconstruction. We further show that such reconstructions may occur unintentionally, even for users without specialized knowledge. For example, we observe that for one existing model, the prompt ``blue Unisex T-Shirt'' generates the face of a real individual. Moreover, by combining the identified vulnerabilities with real-world prompt data, we discover prompts that reproduce memorized visual elements. Our approach builds on insights from prior work and leverages domain knowledge to expose a fundamental vulnerability arising from the use of scraped e-commerce data, where templated layouts and images are closely tied to pattern-like textual prompts. The code for our attack is publicly available at https://github.com/TheSolY/lr-tmi.

One Token to Fool LLM-as-a-Judge

Yulai Zhao, Haolin Liu, Dian Yu, Sunyuan Kung, Meijia Chen, Haitao Mi, Dong Yu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2507.08794v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models, particularly in reference-based settings like Reinforcement Learning with Verifiable Rewards (RLVR). However, we uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking. We find that superficial inputs, which we term ''master keys'' such as non-word symbols (e.g., '':'' or ''.'') or generic reasoning openers (e.g., ''Thought process:'' or ''Let's solve this problem step by step.''), can consistently elicit false positive rewards without any substantive reasoning. Our systematic evaluation demonstrates this is a widespread failure affecting a diverse range of models, including leading proprietary systems such as GPT-o1 and Claude-4. These results challenge the assumed robustness of LLM judges and pose a significant threat to their reliability. To address this, we propose a simple yet effective data augmentation strategy using truncated model outputs as adversarial negative examples. The resulting Master Reward Models (Master-RMs) demonstrate state-of-the-art robustness against these ''master key'' attacks while maintaining high performance in standard evaluation settings. We supplement these findings with a comprehensive analysis of the vulnerability across model scales, prompt variations, and common inference-time strategies, offering insights to guide future research on robust LLM evaluation. We release our robust, general-domain reward models and the synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

Emergence of Hierarchical Emotion Organization in Large Language Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2507.10599v2 Announce Type: replace Abstract: As large language models (LLMs) increasingly power conversational agents, understanding how they model users' emotional states is critical for ethical deployment. Inspired by emotion wheels, i.e., a psychological framework that argues emotions organize hierarchically, we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded theories for developing better model evaluations.

Towards Tight Robust Coresets for $k$-Medians Clustering

Lingxiao Huang, Zhenyu Jiang, Yi Li, Xuan Wu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2507.11260v2 Announce Type: replace Abstract: This paper considers coresets for the robust $k$-medians problem with $m$ outliers, and new constructions in various metric spaces are obtained. Specifically, for metric spaces with a bounded VC or doubling dimension $d$, the coreset size is $O(m) + \tilde{O}(kd\varepsilon^{-2})$, which is optimal up to logarithmic factors. For Euclidean spaces, the coreset size is $O(m\varepsilon^{-1}) + \tilde{O}(\min\{k^{4/3}\varepsilon^{-2}, k\varepsilon^{-3}\})$, improving upon a recent result by Jiang and Lou (ICALP 2025). These results also extend to robust $(k,z)$-clustering, yielding, for VC and doubling dimension, a coreset size of $O(m) + \tilde{O}(kd\varepsilon^{-2z})$ with the optimal linear dependence on $m$. This extended result improves upon the earlier work of Huang et al. (SODA 2025). The techniques introduce novel dataset decompositions, enabling chaining arguments to be applied jointly across multiple components.

From Benchmarks to Skills: Low-Rank Factors for LLM Evaluation

Aviya Maimon, Amir DN Cohen, Gal Vishne, Shauli Ravfogel, Reut Tsarfaty — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2507.20208v2 Announce Type: replace Abstract: Current evaluations of large language models (LLMs) rely heavily on a growing collection of benchmarks and on aggregate benchmark scores, yet it remains unclear what this comparison actually captures, and what these scores reveal about models' underlying capabilities. Here, we propose a new paradigm for LLM evaluation, by asking whether benchmark performance reflects many independent abilities, or rather relies on a small number of shared dimensions. To answer this, we apply Factor Analysis (FA) to a massive performance matrix of LLMs versus benchmarks $(60\times44)$ revealing an \emph{intrinsically low-rank} structure of that matrix. That is, a small number of latent factors captures most of the structure in the full task space. This low-rank geometry reveals substantial redundancy across existing tasks and explains why many benchmarks appear to be measuring overlapping abilities. We further show that these latent factors correspond to coherent, skill-like, dimensions of LLM behavior. Leveraging this latent skill-space, we deliver three practical tools for LLM evaluation and downstream users: (i)~identifying redundant tasks, (ii)~profiling new models using a small subset of tasks, and (iii)~selecting models aligned with desired skill profiles. Our method provides a solid alternative to the de-facto standard of a single aggregate score, and establishes an interpretable and practical framework for understanding and benchmarking LLM core capabilities.

From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

Honglin He, Yukai Ma, Brad Squicciarini, Wayne Wu, Bolei Zhou — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2507.22028v2 Announce Type: replace Abstract: Navigation foundation models trained on massive web-scale data enable agents to generalize across diverse environments and embodiments. However, these models, which are trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in real-world urban navigation, where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing (S2E) learning framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pretraining on offline videos and post-training through reinforcement learning. It maintains the model's generalizability acquired from large-scale real-world videos while enhancing its interactivity through reinforcement learning in simulation environments. Specifically, we introduce two innovations: (1) an Anchor-Guided Distribution Matching strategy for offline pretraining, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and (2) a Residual-Attention Module for reinforcement learning, which obtains reactive behaviors from simulation environments without erasing the model's pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models.

Modality-Aware Feature Matching in Visual and Vision-Language Applications: A Comprehensive Survey

Weide Liu, Wei Zhou, Jun Liu, Ping Hu, Jun Cheng, Jungong Han, Weisi Lin — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2507.22791v2 Announce Type: replace Abstract: Feature matching is a cornerstone task in computer vision, essential for applications such as image retrieval, stereo matching, 3D reconstruction, and SLAM. This survey comprehensively reviews modality-based feature matching, exploring traditional handcrafted methods and emphasizing contemporary deep learning approaches across various modalities, including RGB images, depth images, 3D point clouds, LiDAR scans, medical images, and vision-language interactions. Traditional methods, leveraging detectors like Harris corners and descriptors such as SIFT and ORB, demonstrate robustness under moderate intra-modality variations but struggle with significant modality gaps. Contemporary deep learning-based methods, exemplified by detector-free strategies like CNN-based SuperPoint and transformer-based LoFTR, substantially improve robustness and adaptability across modalities. We highlight modality-aware advancements, such as geometric and depth-specific descriptors for depth images, sparse and dense learning methods for 3D point clouds, attention-enhanced neural networks for LiDAR scans, and specialized solutions like the MIND descriptor for complex medical image matching. Cross-modal applications, particularly in medical image registration and vision-language tasks, underscore the evolution of feature matching to handle increasingly diverse data interactions.

Authorship Attribution in Multilingual Machine-Generated Texts

Lucio La Cava, Dominik Macko, R\'obert M\'oro, Ivan Srba, Andrea Tagarelli — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2508.01656v2 Announce Type: replace Abstract: As Large Language Models (LLMs) have reached human-like fluency and coherence, distinguishing machine-generated text (MGT) from human-written content becomes increasingly difficult. While early efforts in MGT detection have focused on binary classification, the growing landscape and diversity of LLMs require a more fine-grained yet challenging authorship attribution (AA), i.e., being able to identify the precise generator (LLM or human) behind a text. However, AA remains nowadays confined to a monolingual setting, with English being the most investigated one, overlooking the multilingual nature and usage of modern LLMs. In this work, we introduce the problem of Multilingual Authorship Attribution, which involves attributing texts to human or multiple LLM generators across diverse languages. Focusing on 18 languages -- covering multiple families and writing scripts -- and 8 generators (7 LLMs and the human-authored class), we investigate the multilingual suitability of monolingual AA methods in terms of their cross-lingual transferability, and the impact of generators on attribution performance. Our results reveal that while certain monolingual AA methods can be adapted to multilingual settings, significant limitations and challenges remain, particularly in transferring across diverse language families, underscoring the complexity of multilingual AA and the need for more robust approaches to better match real-world scenarios.

The KG-ER Conceptual Schema Language

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2508.02548v3 Announce Type: replace Abstract: We propose KG-ER, a conceptual schema language for knowledge graphs that describes the structure of knowledge graphs independently of their representation (relational databases, property graphs, RDF) while helping to capture the semantics of the information stored in a knowledge graph.

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

Md Raisul Kibria, S\'ebastien Lafond, Janan Arslan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2508.04427v2 Announce Type: replace Abstract: Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that most studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. To address these gaps, we not only synthesize findings from the surveyed works but also incorporate a complementary analysis that integrates recent and emerging advances driving multimodal explainability. Based on these insights, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible multimodal AI systems, with explainability at their core.

Retrieval-Augmented Foundation Models for Water Level Prediction in the Everglades

Rahuul Rangaraj, Jimeng Shi, Rajendra Paudel, Giri Narasimhan, Yanzhao Wu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2508.04888v2 Announce Type: replace Abstract: Accurate water level forecasting in the Everglades is essential for flood mitigation, drought management, water resource planning, and biodiversity conservation. While recent time-series foundation models have shown strong performance on generic tasks (represented in their pre-training), their effectiveness in domain-specific applications remains insufficiently understood. In this work, we curate a domain-specific dataset for water-level forecasting in the Everglades and observe that the performance of current state-of-the-art models remains limited. To address this gap, we leverage a retrieval-augmented mechanism that retrieves analogous multivariate hydrological episodes from an external archive of historical observations to enrich the input context of those pre-trained models. We study two retrieval strategies, statistical similarity-based retrieval and mutual information-based retrieval, and analyze how incorporating retrieved historical contexts affects predictive performance. Extensive experiments show that retrieval augmentation consistently improves long-horizon water level forecasts and yields disproportionately larger gains during extreme events, which is particularly critical for environmental decision-making. Our study provides empirical evidence that analog-based retrieval can benefit pretrained time-series foundation models in environmental science, offering practical insights into their strengths, limitations, and failure modes when applied to hydrological forecasting in the Everglades. Although evaluated in the Everglades, the proposed framework is general and can be applied to other hydrological systems given time series data. The code and data have been made publicly available at https://github.com/rahuul2992000/WaterRAF.

Adaptive Model-Predictive Control of a Soft Continuum Robot Using a Physics-Informed Neural Network Based on Cosserat Rod Theory

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2508.12681v3 Announce Type: replace Abstract: Dynamic control of soft continuum robots (SCRs) holds great potential for expanding their applications, but remains a challenging problem due to the high computational demands of accurate dynamic models. While data-driven approaches like Koopman-operator-based methods have been proposed, they typically lack adaptability and cannot reconstruct the full robot shape, limiting their applicability. This work introduces a real-time-capable nonlinear model-predictive control (MPC) framework for SCRs based on a domain-decoupled physics-informed neural network (DD-PINN) with adaptable bending stiffness. The DD-PINN serves as a surrogate for the dynamic Cosserat rod model with a speed-up factor of up to 44,000. It is also used within an unscented Kalman filter for estimating the model states and bending compliance from end-effector position measurements. We implement a nonlinear evolutionary MPC running at 70 Hz on the GPU. In simulation, it demonstrates accurate tracking of dynamic trajectories and setpoint control with end-effector position errors below 3 mm (2.3\% of the actuator's length). In real-world experiments, the controller achieves similar accuracy and accelerations up to 3.55 m/s2.

The Urysohn Machine: A Metric-Topological Model of Computation

Xin Li — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2508.14143v2 Announce Type: replace Abstract: We introduce the Urysohn Machine, an effective model of classification-oriented computation in which metric separation, frontier structure, and contraction are explicit parts of the computational state. Its basic object is a \emph{Urysohn Triple}: a support region, a target partition, and a separating classifier stored in a reusable Metric Library. The topological foundation is a constructive Urysohn Realization theorem for finite simplicial settings. It builds separators from dyadic ladders of nested polyhedral regions and equips their frontiers with a chain-level calculus: frontiers are cycles, and shells between levels have boundaries given by differences of frontiers. This construction yields two related complexity measures: decision-boundary width, the geometric measure of a single classifier's boundary, and Urysohn width, the total frontier mass represented by a library or realization. We prove an Amortized Separation Theorem showing that approximating a boundary of width to accuracy requires a number of simple basis triples proportional to boundary width and inversely proportional to resolution, under explicit boundary-footprint assumptions. We also introduce a contrastive separation operator whose graph-cut functional consistently estimates decision-boundary width from sampled metric data, while its Laplacian spectrum certifies class-component structure and conductance. Finally, we analyze the dynamic Urysohn ladder and prove four guarantees: separability under quotient collapse, stability of committed frontiers, bounded capacity under contraction, and scalability with quotient distance. Together, these results give a metric-topological account of classification complexity, amortized inference, and compositional reuse that preserves classical computability while exposing geometric structure hidden by purely symbolic descriptions.

Enhancing Energy and Spectral Efficiency in IoT-Cellular Networks via Active SIM-Equipped LEO Satellites

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2508.17149v2 Announce Type: replace Abstract: This paper investigates a low Earth orbit (LEO) satellite communication system enhanced by an active stacked intelligent metasurface (ASIM), mounted on the backplate of the satellite solar panels to efficiently utilize limited onboard space and reduce the main satellite power amplifier requirements. The system serves multiple ground users via rate-splitting multiple access (RSMA) and IoT devices through a symbiotic radio network. Multi-layer sequential processing in the ASIM improves effective channel gains and suppresses inter-user interference, outperforming active RIS and beyond-diagonal RIS designs. Three optimization approaches are evaluated: block coordinate descent with successive convex approximation (BCD-SCA), model-assisted multi-agent constraint soft actor-critic (MA-CSAC), and multi-constraint proximal policy optimization (MCPPO). Simulation results show that BCD-SCA converges fast and stably in convex scenarios without learning, MCPPO achieves rapid initial convergence with moderate stability, and MA-CSAC attains the highest long-term spectral and energy efficiency in large-scale networks. Energy-spectral efficiency trade-offs are analyzed for different ASIM elements, satellite antennas, and transmit power. Overall, the study demonstrates that integrating multi-layer ASIM with suitable optimization algorithms offers a scalable, energy-efficient, and high-performance solution for next-generation LEO satellite communications.

DiffCoord: Differentiable Coordination for Distributed Multi-Agent Trajectory Optimization

Bingheng Wang, Yichao Gao, Tianchen Sun, Shanker Ajay, Lin Zhao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2509.01630v3 Announce Type: replace Abstract: Integrating the Alternating Direction Method of Multipliers (ADMM) with Differential Dynamic Programming (DDP) provides a scalable framework for distributed multi-agent trajectory optimization. In practice, ADMM is typically truncated for computational efficiency, tightly coupling parameters that would otherwise separately govern coordination quality and task performance. In this paper, we propose Differentiable Coordination (DiffCoord), a unified framework that jointly meta-learns these coupled parameters for the truncated ADMM-DDP pipeline. These parameters are generated by agent-wise neural networks for task adaptation, and the same networks are shared among isomorphic agents to enable scalability to varying agent counts. We achieve efficient meta-learning by differentiating the ADMM-DDP pipeline end-to-end. Notably, this yields an auxiliary ADMM-LQR distributed gradient solver that computes and coordinates meta-gradients with respect to these parameters. This solver inherits the computational structure of the pipeline, enabling reuse of key computation results and efficient parallelization over agents and along trajectory horizons. We validate DiffCoord through numerical and physical experiments on a cooperative aerial transport system, where it reconfigures quadrotor formations for safe 6-DoF load manipulation in tight spaces. It adapts robustly to varying team sizes and load dynamics, while reducing per-agent gradient computation time by up to 70% compared with state-of-the-art trajectory-gradient methods.

Equivariant Flow Matching for Symmetry-Breaking Bifurcation Problems

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2509.03340v4 Announce Type: replace Abstract: Bifurcation phenomena in nonlinear dynamical systems often lead to multiple coexisting stable solutions, particularly in the presence of symmetry breaking. Deterministic machine learning models are unable to capture this multiplicity, averaging over solutions and failing to represent lower-symmetry outcomes. In this work, we formalize the use of generative AI, specifically flow matching, as a principled way to model the full probability distribution over bifurcation outcomes. Our approach builds on existing techniques by combining flow matching with equivariant architectures and an optimal-transport-based coupling mechanism. We generalize equivariant flow matching to a symmetric coupling strategy that aligns predicted and target outputs under group actions, allowing accurate learning in equivariant settings. We validate our approach on a range of systems, from simple conceptual systems to physical problems such as buckling beams and the Allen--Cahn equation. The results demonstrate that the approach accurately captures multimodal distributions and symmetry-breaking bifurcations. Moreover, our results demonstrate that flow matching significantly outperforms non-probabilistic and variational methods. This offers a principled and scalable solution for modeling multistability in high-dimensional systems.

GetNetUPAM: Ecologically Informed Nested Cross-Validation and Noise-Robust Attention for Marine Bioacoustic Monitoring

Nicholas R. Rasmussen, Rodrigue Rizk, Longwei Wang, KC Santosh — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2509.04682v2 Announce Type: replace Abstract: Deploying reliable bioacoustic monitoring systems requires models that generalize under high-noise, low-SNR conditions and evaluation protocols that expose deployment-relevant failure modes, gaps largely unaddressed in current UPAM practice. Intrinsic noise, variable propagation, and mixed biological and anthropogenic sources induce distribution shifts that conventional models and single-split evaluations obscure, inflating performance and masking instability. We introduce GetNetUPAM, a hierarchical nested cross-validation framework that uses the nested stage to quantify model stability rather than tune for inflated hold-out scores. By partitioning data into site-year blocks, GetNetUPAM preserves ecological heterogeneity and forces each outer fold to represent a distinct environmental regime, preventing overfitting to localized noise or sensor artifacts. Inner stratified folds measure generalization across the full UPAM signal distribution, enforcing strict separation between model development and the outer held-out deployment condition. Using GetNetUPAM, we evaluate the Adaptive Resolution Pooling and Attention Network (ARPA-N), a CNN architecture for irregular spectrogram dimensions. ARPA-N integrates CBAM spatial attention as a learned noise suppressor, producing attention maps that localize true call structure and avoid the global, non-biological cues exploited by standard CNNs on long-window data. Under GetNetUPAM, ARPA-N generalizes robustly across diverse environmental regimes. In the zero-training support Balleny Islands region, it reduces false positives per hour by over an order of magnitude (approximately 10x) at fixed 90 percent recall, yielding consistently improved metrics across folds. These advances provide a reproducible benchmark and move UPAM toward scalable, deployment-reliable ecological monitoring.

PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design

Andy Xu, Rohan Desai, Larry Wang, Ethan Ritz, Gabriel Hope — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2509.07150v4 Announce Type: replace Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising approach to improve correctness in LLMs, however, in many scientific problems, the objective is not necessarily to produce the correct answer, but instead to produce a diverse array of candidates which satisfy a set of constraints. We study this challenge in the context of materials generation. To this end, we introduce PLaID++, an LLM post-trained for stable and property-guided crystal generation. We find that performance hinges on our crystallographic representation and reward formulation. First, we introduce a compact, symmetry-informed Wyckoff text representation which improves computational efficiency and encourages generalization from physical priors. Second, we demonstrate that temperature scaling acts as an entropy regularizer which counteracts mode collapse and encourages exploration. By encoding symmetry constraints directly into text and guiding model outputs towards desirable chemical space, PLaID++ generates structures that are thermodynamically stable, unique, and novel at a $\sim$50\% greater rate than prior methods and conditionally generates structures with desired space group properties. Our work demonstrates the potential of adapting post-training techniques from natural language processing to materials design, paving the way for targeted and efficient discovery of novel materials.

Towards Efficient and Secure Cloud-Assisted Autonomous Systems: A Review of Architectures, Algorithms, Security, and Deployment Challenges

Yasir Ali, Tayyab Manzoor, Huan Yang, Asif Ali, Yuanqing Xia — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2509.09299v2 Announce Type: replace Abstract: Networked Control Systems (NCSs) have been instrumental in realizing fully connected and responsive intelligent environments within the context of real-time virtual control and management. However, traditional NCSs face considerable challenges in handling the vast amounts of data generated by large-scale control applications, particularly in terms of data acquisition, storage, and computational processing. To address these challenges, the emergence of cloud computing and advancements in control theory have empowered the new paradigm known as Cloud Control Systems (CCSs). Recently, CCSs have received substantial attention from industries for their potential properties, such as large-scale data management, complex computations, and data-centric optimized decisions. This study presents an extensive review of recent progress in CCSs spanning over multiple studies published between 2012 and 2025. Specifically, the focus is on providing a taxonomy of the current findings in CCS research, encompassing various perspectives, such as its efficient implementations in industrial automation, security and privacy considerations, and cloud-based control techniques. Each category is examined in depth through selected state-of-the-art analyses of different approaches and contrasting methodologies. Furthermore, we discuss future directions aimed at designing more efficient and practical CCSs. The insights gained from this study can help researchers, practitioners, and decision-makers in their domain for effective CCS design and deployment.

GLIDE: A Coordinated Aerial-Ground Framework for Search and Rescue in Unknown Environments

Seth Farrell, Chenghao Li, Henrik I. Christensen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2509.14210v4 Announce Type: replace Abstract: We present a cooperative aerial-ground search-and-rescue (SAR) framework that pairs two unmanned aerial vehicles (UAVs) with an unmanned ground vehicle (UGV) to achieve rapid victim localization and obstacle-aware navigation in unknown environments. We dub this framework Guided Long-horizon Integrated Drone Escort (GLIDE), highlighting the UGV's reliance on UAV guidance for long-horizon planning. In our framework, a goal-searching UAV executes real-time onboard victim detection and georeferencing to nominate goals for the ground platform, while a terrain-scouting UAV flies ahead of the UGV's planned route to provide mid-level traversability updates. The UGV fuses aerial cues with local sensing to perform time-efficient A* planning and continuous replanning as information arrives. Additionally, we present a hardware demonstration (using a GEM e6 golf cart as the UGV and two X500 UAVs) to evaluate end-to-end SAR mission performance and include simulation ablations to assess the planning stack in isolation from detection. Empirical results demonstrate that explicit role separation across UAVs, coupled with terrain scouting and guided planning, improves reach time and navigation safety in time-critical SAR missions.

Strategic Analysis of Just-In-Time Liquidity Provision in Concentrated Liquidity Market Makers

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2509.16157v2 Announce Type: replace Abstract: Liquidity providers (LPs) are essential figures in the operation of automated market makers (AMMs); in exchange for transaction fees, LPs lend the liquidity that allows AMMs to operate. While many prior works have studied the incentive structures of LPs in general, we currently lack a principled understanding of a special class of LPs known as Just-In-Time (JIT) LPs. These are strategic agents who momentarily supply liquidity for a single swap, in an attempt to extract disproportionately high fees relative to the remaining passive LPs. This paper provides the first formal, transaction-level model of JIT liquidity provision for a widespread class of AMMs known as Concentrated Liquidity Market Makers (CLMMs), as seen in Uniswap V3, for instance. We characterize the landscape of price impact and fee allocation in these systems, formulate and analyze a non-linear optimization problem faced by JIT LPs, and prove the existence of an optimal strategy. By fitting our optimal solution for JIT LPs to real-world CLMMs, we observe that in liquidity pools (particularly those with risky assets), there is a significant gap between observed and optimal JIT behavior. Existing JIT LPs often fail to account for price impact; doing so, we estimate they could increase earnings by up to 69% on average over small time windows. We also show that JIT liquidity, when deployed strategically, can improve market efficiency by reducing slippage for traders, albeit at the cost of eroding average passive LP profits by up to 44% per trade.

Structuring The Future: Diffusion LLM Speculative Decoding via Calibrated Draft Graphs

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2509.18085v4 Announce Type: replace Abstract: Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token-generation rates. To unlock this potential, we present Spiffy, a speculative decoding algorithm to accelerate dLLM inference while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to dLLMs. Spiffy performs auto-speculation to eliminate the overheads of an independent draft model, structuring draft states in the form of a novel directed draft graph to take advantage of the bidirectional, blockwise nature of dLLM generation. These draft graphs are calibrated offline to maximize acceptance rates and are dynamically pruned during inference for improved computational efficiency. We present a detailed formulation of Spiffy and demonstrate its ability to accelerate LLaDA, Dream, and SDAR models in combination with KV caching and threshold-based dynamic unmasking leading to up to $8.6\times$ reduction in model inferences and $6.3\times$ acceleration in token rate.

Metriplectic Conditional Flow Matching for Dissipative Dynamics

Ali Baheri, Lars Lindemann — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2509.19526v2 Announce Type: replace Abstract: Metriplectic conditional flow matching (MCFM) learns dissipative dynamics without violating first principles. Neural surrogates often inject energy and destabilize long-horizon rollouts; MCFM instead builds the conservative-dissipative split into both the vector field and a structure preserving sampler. MCFM trains via conditional flow matching on short transitions, avoiding long rollout adjoints. In inference, a Strang-prox scheme alternates a symplectic update with a proximal metric step, ensuring discrete energy decay; an optional projection enforces strict decay when a trusted energy is available. We provide continuous and discrete time guarantees linking this parameterization and sampler to conservation, monotonic dissipation, and stable rollouts. On a controlled mechanical benchmark, MCFM yields phase portraits closer to ground truth and markedly fewer energy-increase and positive energy rate events than an equally expressive unconstrained neural flow, while matching terminal distributional fit.

Skeleton Sparsification and Densification Scale-Spaces

Julia Gierke, Pascal Peter — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2509.21398v2 Announce Type: replace Abstract: The Hamilton-Jacobi skeleton, also known as the medial axis, is a powerful shape descriptor that represents binary objects in terms of the centres of maximal inscribed discs. Despite its broad applicability, the medial axis suffers from sensitivity to noise: Minor boundary variations can lead to disproportionately large and undesirable expansions of the skeleton. Classical pruning methods mitigate this shortcoming by systematically removing extraneous skeletal branches. This sequential simplification of skeletons resembles the principle of sparsification scale-spaces that embed images into a family of reconstructions from increasingly sparse pixel representations. We combine both worlds by introducing skeletonisation scale-spaces: They leverage sparsification of the medial axis to achieve hierarchical simplification of shapes. Unlike conventional pruning, our framework inherently satisfies key scale-space properties such as hierarchical architecture, controllable simplification, and equivariance to geometric transformations. We provide a rigorous theoretical foundation in both continuous and discrete formulations and extend the concept further with densification. By growing the skeleton successively instead of shrinking it, we allow inverse progression from coarse to fine scales. Densification scale-spaces can even reach beyond the original skeleton to produce overcomplete shape representations with relevancy for practical applications. Through proof-of-concept experiments, we demonstrate the effectiveness of our framework for practical tasks including robust skeletonisation, shape compression, and stiffness enhancement for additive manufacturing.

C-QUERI: Congressional Questions, Exchanges, and Responses in Institutions Dataset

Manjari Rudra, Daniel Magleby, Sujoy Sikdar — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2509.21548v2 Announce Type: replace Abstract: Questions in political interviews and hearings serve strategic purposes beyond information gathering including advancing partisan narratives and shaping public perceptions. However, these strategic aspects remain understudied due to the lack of large-scale datasets for studying such discourse. Congressional hearings provide an especially rich and tractable site for studying political questioning: Interactions are structured by formal rules, witnesses are obliged to respond, and members with different political affiliations are guaranteed opportunities to ask questions, enabling comparisons of behaviors across the political spectrum. We develop a pipeline to extract question-answer pairs from unstructured hearing transcripts and construct a novel dataset of committee hearings from the 108th--117th Congress. Our analysis reveals systematic differences in questioning strategies across parties, by showing the party affiliation of questioners can be predicted from their questions alone. Our dataset and methods not only advance the study of congressional politics, but also provide a general framework for analyzing question-answering across interview-like settings.

BrainPro: Towards Large-scale Brain State-aware EEG Representation Learning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2509.22050v2 Announce Type: replace Abstract: Electroencephalography (EEG) reflects underlying brain states, whose activities are distributed across brain regions and manifest as spatial patterns on the scalp. Learning these spatially structured, state-related patterns requires consistent spatial representations across datasets. However, existing EEG foundation models are typically based on self-attention, which does not preserve location-specific information and struggles to align signals recorded with different channel configurations. Moreover, brain states contain both shared and state-specific regional activity, suggesting that learning neurophysiologically plausible, state-aware representations can complement the shared representations targeted by current models and improve downstream decoding. To address these limitations, we propose BrainPro, a large EEG model that combines a retrieval-based spatial learning mechanism for cross-layout spatial alignment with a brain state-decoupling module that learns both shared and state-specific representations through parallel encoders and region-aware reconstruction. Pre-trained on a large EEG corpus, BrainPro achieves state-of-the-art performance across nine public BCI datasets spanning emotion, motor, speech, stress, mental disease, and attention tasks. Analyses of spatial filters, channel-drop robustness, and encoder contributions further validate the effectiveness of its spatial alignment and state-aware pathways. These results show that BrainPro achieves improved interpretability of learned spatial patterns and produces representations that benefit diverse EEG decoding tasks.

Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2509.25787v5 Announce Type: replace Abstract: Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks. Furthermore, the framework demonstrates significant flexibility, allowing it to be stacked with pre-trained IQA models to bolster generalization on unseen datasets. Codes and checkpoints will be available at https://github.com/bytedance/EvoQuality.

Coarse scrambling for Sobol' and Niederreiter sequences

Kosuke Suzuki — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2510.02111v2 Announce Type: replace Abstract: We introduce coarse scrambling, a novel randomization for digital sequences that permutes blocks of digits in a mixed-radix representation. This construction is designed to preserve the powerful $(0,\mathbb{e},d)$-sequence property of the underlying points. For sufficiently smooth integrands, we prove that this method achieves the canonical $O(n^{-3+\epsilon})$ variance decay rate, matching that of standard Owen's scrambling. Crucially, we show that its maximal gain coefficient grows only logarithmically with dimension, $O(\log d)$, thus providing theoretical robustness against the curse of dimensionality affecting scrambled Sobol' sequences. Numerical experiments validate these findings and illustrate a practical trade-off: while Owen's scrambling is superior for integrands sensitive to low-dimensional projections, coarse scrambling is competitive for functions with low effective truncation dimension.

Unraveling Syntax: Language Modeling and the Substructure of Grammars

Laura Ying Schulz, Daniel Mitropolsky, Tomaso Poggio — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2510.02524v3 Announce Type: replace Abstract: While language models achieve impressive results, their learning dynamics are far from understood. Many domains of interest -- such as natural language syntax, coding languages, arithmetic -- are captured by context-free grammars (CFGs). In this work, we extend prior work on neural language modeling of CFGs in a novel direction: how language modeling behaves with respect to CFG substructure, namely subgrammars. We define subgrammars, and prove a set of fundamental theorems connecting language modeling and subgrammars. We show that language modeling loss recurses linearly over its top-level subgrammars; applied recursively, the loss decomposes into losses for "irreducible" subgrammars. Under additional assumptions, and empirically, parametrized models learn subgrammars in parallel, unlike children who first master simple substructures. We find that subgrammar pretraining can improve final performance, but only for tiny models relative to the grammar, while alignment analyses show that pretraining consistently leads to internal representations that better reflect the grammar's substructure.

GAE: Unleashing Physical Potential of VLM with Generalizable Action Expert

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2510.03896v2 Announce Type: replace Abstract: Vision-language models demonstrate strong reasoning and planning abilities, yet grounding these predictions into precise robot actions remains a central challenge. Existing Vision-Language-Action methods typically entangle reasoning and action generation, leading to limited generalization. We propose Generalizable Action Expert (GAE), a task-agnostic model that converts sparse geometric plans into dense robot actions. Our approach introduces a sparse geometric interface: the VLM predicts sparse 3D waypoints representing high-level intention, while GAE maps these waypoints together with real-time point cloud observations to continuous action trajectories. GAE is pretrained on a large-scale pointcloud-trajectory dataset comprising 150k trajectories from both simulation and real-world robots. To further improve efficiency and generalization, we introduce an Action Pre-training, Pointcloud Fine-tuning (APPF) scheme that decouples learning action dynamics from geometry grounding. After pretraining, GAE is frozen and reused across downstream tasks, requiring only lightweight fine-tuning of the VLM to produce the sparse interface. Experiments show that our method achieves strong performance and generalization across diverse visual domains, camera viewpoints, and natural language instructions.

Active Semantic Perception

Huayi Tang, Pratik Chaudhari — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2510.05430v2 Announce Type: replace Abstract: We develop an approach for active semantic perception, which refers to using the semantics of the scene for tasks such as exploration. We build a compact, multi-layer scene graph that can represent large, complex indoor environments at various levels of abstraction, e.g., nodes corresponding to rooms, objects, walls, windows etc., as well as fine-grained details of their geometry. We develop a procedure based on large language models (LLMs) to sample new plausible scene graphs of unobserved regions that are consistent with partial observations of the scene. We develop a procedure to compute the information gain of a potential waypoint upon this scene graph to enable sophisticated spatial reasoning: for example, of the two doors that lead out of the living room, one probably leads to the kitchen and the other to the bedroom. We evaluate our approach in realistic 3D indoor apartments in simulation and also on a Unitree Go 2 robot in the real world. Qualitative and quantitative analysis shows that our approach can pin down high-level and low-level semantic information in the environment quickly and more accurately than existing approaches.

Typestate via Revocable Capabilities

Songlin Jia, Craig Liu, Siyuan He, Haotian Deng, Yuyan Bao, Tiark Rompf — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2510.08889v2 Announce Type: replace Abstract: Managing stateful resources safely and expressively is a longstanding challenge in programming languages, especially in the presence of aliasing. For example, scope-based constructs like Java's synchronized blocks offer ease of reasoning, but they restrict expressiveness and parallelism. Conversely, imperative, flow-sensitive approaches enable fine-grained control, but they require sophisticated typestate analyses and often burden programmers with explicit state tracking. In this work, we present a novel approach that unifies the ease of scoped reasoning with the expressiveness of imperative typestate management. Our design extends traditional flow-insensitive capability mechanisms to a flow-sensitive setting. In particular, we decouple capability lifetimes from lexical scopes, allowing functions to receive, revoke, or return capabilities in a flow-sensitive manner, building on existing mechanisms for the safety and ergonomics of scoped capability programming. We implement our approach as an extension to the Scala 3 compiler, leveraging path-dependent types and implicit resolution to enable concise, statically safe, and expressive typestate programming. Our prototype generically supports a wide range of patterns, including file operations, advanced locking protocols, DOM construction, and session types, showing that expressive and safe typestate management can be achieved with minimal extensions to an existing language with capability support.

Toward General Digraph Contrastive Learning: A Dual Spatial Perspective

Zhengyu Wu, Daohan Su, Yang Zhang, Xunkai Li, Rong-Hua Li, Guoren Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2510.16311v2 Announce Type: replace Abstract: Graph Contrastive Learning (GCL) has emerged as a powerful tool for extracting consistent representations from graphs, independent of labeled information. However, existing methods predominantly focus on undirected graphs, disregarding the pivotal directional information that is fundamental and indispensable in real-world networks (e.g., social networks and recommendations).In this paper, we introduce S2-DiGCL, a novel framework that emphasizes spatial insights from complex and real domain perspectives for directed graph (digraph) contrastive learning. From the complex-domain perspective, S2-DiGCL introduces personalized perturbations into the magnetic Laplacian to adaptively modulate edge phases and directional semantics. From the real-domain perspective, it employs a path-based subgraph augmentation strategy to capture fine-grained local asymmetries and topological dependencies. By jointly leveraging these two complementary spatial views, S2-DiGCL constructs high-quality positive and negative samples, leading to more general and robust digraph contrastive learning. Extensive experiments on 7 real-world digraph datasets demonstrate the superiority of our approach, achieving SOTA performance with 4.41% improvement in node classification and 4.34% in link prediction under both supervised and unsupervised settings.

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2510.16380v2 Announce Type: replace Abstract: As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models

Emily Chang, Niyati Bafna — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2510.16928v3 Announce Type: replace Abstract: Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.

Classport: Designing Runtime Dependency Introspection for Java

Serena Cofano, Daniel Williams, Aman Sharma, Martin Monperrus — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2510.20340v4 Announce Type: replace Abstract: Runtime introspection of dependencies, i.e., the ability to observe which dependencies are currently used during program execution, is fundamental for Software Supply Chain security. Yet, Java has no support for it. We solve this problem with Classport, a blueprint and system that embeds dependency information into Java class files, enabling the retrieval of dependency information at runtime. We evaluate Classport on six real-world projects, demonstrating the feasibility in identifying dependencies at runtime.

A mathematical study of the excess growth rate

Steven Campbell, Ting-Kam Leonard Wong — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2510.25740v2 Announce Type: replace Abstract: The excess growth rate, defined as the gap in Jensen's inequality for the logarithm, is a fundamental functional in portfolio theory. In this paper, we present a mathematical study motivated by information theory. We begin by establishing its properties and showing that it has rich connections with information theoretic concepts such as the Helmholtz free energy, L. Campbell's measure of average code length and large deviations. Our main results consist of three axiomatic characterization theorems of the excess growth rate, in terms of (i) the relative entropy, (ii) the gap in Jensen's inequality, and (iii) the logarithmic divergence that generalizes the Bregman divergence. Furthermore, we study maximization of the excess growth rate and compare it with the growth optimal portfolio. Our results not only provide theoretical justifications of the significance of the excess growth rate, but also establish new connections between information theory and quantitative finance.

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.02627v4 Announce Type: replace Abstract: We introduce DecompSR, decomposed spatial reasoning, a large benchmark dataset (over 5m datapoints) and generation framework designed to analyse compositional spatial reasoning ability. The generation of DecompSR allows users to independently vary several aspects of compositionality, namely: productivity (reasoning depth), substitutivity (entity and linguistic variability), overgeneralisation (input order, distractors) and systematicity (novel linguistic elements). DecompSR is built procedurally in a manner which makes it is correct by construction, which is independently verified using a symbolic solver to guarantee the correctness of the dataset. DecompSR is comprehensively benchmarked across a host of Large Language Models (LLMs) where we show that LLMs struggle with productive and systematic generalisation in spatial reasoning tasks whereas they are more robust to linguistic variation. DecompSR provides a provably correct and rigorous benchmarking dataset with a novel ability to independently vary the degrees of several key aspects of compositionality, allowing for robust and fine-grained probing of the compositional reasoning abilities of LLMs.

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Claudio Giusti, Luca Guarnera, Sebastiano Battiato — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.04260v3 Announce Type: replace Abstract: The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal-leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates Closed-set classification with a density-based Open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Acting in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13\%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability both between real images and known generators, and between known and unseen ones. The codebase is available at the following link: https://github.com/claudiunderthehood/Proto-LeakNet .

DWM-RO: Decentralized World Models with Reasoning Offloading for SWIPT-enabled Satellite-Terrestrial HetNets

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.05972v3 Announce Type: replace Abstract: Wireless networks are undergoing a paradigm shift toward massive connectivity with energy-efficient operation, driving the integration of satellite-terrestrial architectures with simultaneous wireless information and power transfer (SWIPT). Optimizing transmit beamforming and power splitting in such systems faces formidable challenges, e.g., time-varying channels and multi-tier interference, which create a complex decision landscape where conventional model-free multi-agent reinforcement learning (MARL) suffers from sample inefficiency due to rarely-encountered state transitions and poor coordination as decentralized agents act independently. This paper proposes the Decentralized World Model with Reasoning Offloading (DWM-RO) framework to address these fundamental limitations. Specifically, each agent employs a world model to learn compact predictive representations of environment dynamics, enabling imagination-based policy training that dramatically reduces required environment interactions. An uncertainty-aware offloading gate monitors local interference levels and model reconstruction errors to trigger selective edge coordination. When activated, a lightweight latent decorrelation mechanism at the edge refines agents' strategic representations, guiding them toward orthogonal actions that minimize resource conflicts. Extensive simulations demonstrate that DWM-RO converges 5 times faster than state-of-the-art baselines while achieving 34.7% higher spectral efficiency and reducing constraint violations by 40%. In dense network scenarios with 10 users, DWM-RO maintains violation rates below 20% while baselines exceed 70%, validating superior robustness.

Miniature Testbed for Validating Multi-Agent Cooperative Autonomous Driving

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.11022v2 Announce Type: replace Abstract: Cooperative autonomous driving, which extends vehicle autonomy by enabling real-time collaboration between vehicles and smart roadside infrastructure, remains a challenging yet essential problem. However, none of the existing testbeds employ smart infrastructure equipped with sensing, edge computing, and communication capabilities. To address this gap, we design and implement a 1:15-scale miniature testbed, CIVAT, for validating cooperative autonomous driving, consisting of a scaled urban map, autonomous vehicles with onboard sensors, and smart infrastructure. The proposed testbed integrates V2V and V2I communication with the publish-subscribe pattern through a shared Wi-Fi and ROS2 framework, enabling information exchange between vehicles and infrastructure to realize cooperative driving functionality. As a case study, we validate the system through infrastructure-based perception and intersection management experiments.

The modified Physics-Informed Hybrid Parallel Kolmogorov--Arnold and Multilayer Perceptron Architecture with domain decomposition

Qiumei Huang, Xu Wang, Yu Zhao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.11228v3 Announce Type: replace Abstract: In this work, we propose a modified Hybrid Parallel Kolmogorov--Arnold Network and Multilayer Perceptron Physics-Informed Neural Network to overcome the high-frequency and multiscale challenges inherent in Physics-Informed Neural Networks. This proposed model features a trainable weighting parameter to optimize the convex combination of outputs from the Kolmogorov--Arnold Network and the Multilayer Perceptron, thus maximizing the networks' capabilities to capture different frequency components. Furthermore, we adopt an overlapping domain decomposition technique to decompose complex problems into subproblems, which alleviates the challenge of global optimization. Benchmark results demonstrate that our method reduces training costs and improves computational efficiency compared with manual hyperparameter tuning in solving high-frequency multiscale problems.

Discretization, Uniform-in-Time Estimations and Approximation of Invariant Measures for Nonlinear Stochastic Differential Equations with Non-Uniform Dissipativity

Shan Huang, Xiaoyue Li — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.12124v2 Announce Type: replace Abstract: The approximation of invariant measures for nonlinear ergodic stochastic differential equations (SDEs) is a central problem in scientific computing, with important applications in stochastic sampling, physics, and ecology. We first propose an easily applicable explicit Truncated Euler-Maruyama (TEM) scheme and prove its numerical ergodicity in the $L^p$-Wasserstein distance ($p\geqslant 1$). Furthermore, by combining truncation techniques with the coupling method, we establish a uniform-in-time $1/2$-order convergence rate in moments for the TEM scheme. Additionally, leveraging the exponential ergodicity of both the numerical and exact solutions, we derive a $1/2$-order convergence rate for the invariant measures of the TEM scheme and the exact solution in the $L^1$-Wasserstein distance. Finally, two numerical experiments are conducted to validate our theoretical results.

Can Small GenAI Language Models Rival Large Language Models in Understanding Application Behavior?

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.12576v2 Announce Type: replace Abstract: Generative AI (GenAI) models, particularly large language models (LLMs), have transformed multiple domains, including natural language processing, software analysis, and code understanding. Their ability to analyze and generate code has enabled applications such as source code summarization, behavior analysis, and malware detection. In this study, we systematically evaluate the capabilities of both small and large GenAI language models in understanding application behavior, with a particular focus on malware detection as a representative task. While larger models generally achieve higher overall accuracy, our experiments show that small GenAI models maintain competitive precision and recall, offering substantial advantages in computational efficiency, faster inference, and deployment in resource-constrained environments. We provide a detailed comparison across metrics such as accuracy, precision, recall, and F1-score, highlighting each model's strengths, limitations, and operational feasibility. Our findings demonstrate that small GenAI models can effectively complement large ones, providing a practical balance between performance and resource efficiency in real-world application behavior analysis.

Examining the Usage of Generative AI Models in Student Learning Activities for Software Programming

Rufeng Chen, Shuaishuai Jiang, Jiyun Shen, AJung Moon, Lili Wei — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.13271v2 Announce Type: replace Abstract: The rise of Generative AI (GenAI) tools like ChatGPT has created new opportunities and challenges for computing education. Existing research has primarily focused on GenAI's ability to complete educational tasks and its impact on student performance, often overlooking its effects on knowledge gains. In this study, we investigate how GenAI assistance compares to conventional online resources in supporting knowledge gains across different proficiency levels. We conducted a controlled user experiment with 24 undergraduate students of two different levels of programming experience (beginner, intermediate) to examine how students interact with ChatGPT while solving programming tasks. We analyzed task performance, conceptual understanding, and interaction behaviors. Our findings reveal that generating complete solutions with GenAI significantly improves task performance, especially for beginners, but does not consistently result in knowledge gains. Importantly, usage strategies differ by experience: beginners tend to rely heavily on GenAI toward task completion often without knowledge gain in the process, while intermediates adopt more selective approaches. We find that both over-reliance and minimal use result in weaker knowledge gains overall. Based on our results, we call on students and educators to adopt GenAI as a learning rather than a problem solving tool. Our study highlights the urgent need for guidance when integrating GenAI into programming education to foster deeper understanding.

nlKrylov: A Unified Framework for Nonlinear GCR-type Krylov Subspace Methods

Tom Werner, Ning Wan, Agnieszka Miedlar — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.14713v2 Announce Type: replace Abstract: In this paper, we introduce a unified framework for nonlinear Krylov subspace methods (\textit{nlKrylov}) to solve systems of nonlinear equations. Building on classical GCR-like/type linear Krylov solvers such as GMRESR, we generalize these approaches to nonlinear problems via nested algorithmic structures. We present rigorous convergence results for problems, relying on relaxed assumptions that avoid the need for exact line searches. The framework is further extended to matrix-valued root finding problems using global nonlinear Krylov approaches. Extensive numerical experiments validate the theoretical insights and demonstrate the robustness and efficiency of our proposed algorithms.

Shallow neural network yields regularization for ill-posed inverse problems

Lan Wang, Qiao Zhu, Bangti Jin, Ye Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.16171v2 Announce Type: replace Abstract: In this paper, we develop a regularization theory for neural network approximations of general ill-posed operator equations with noisy data. Within the framework of iterative regularization, we introduce two expanding neural network methods (ENNs) under different a priori assumptions on the exact solution. Instead of prescribing a fixed architecture, ENNs adaptively select the number of neurons through an a posteriori stopping rule, so that the selected network size serves as a regularization parameter balancing approximation accuracy and stability with respect to data noise. We prove the regularization properties of the proposed ENNs and establish quantitative relationships between the selected network size and the noise level. Within the framework of variational regularization, we propose a neural network-based Tikhonov scheme and derive both convergence and convergence-rate results under mild assumptions. The resulting estimates account for the noise level, the network size, and the underlying smoothness expressed through general variational source conditions, thereby allowing greater flexibility than existing results. Numerical experiments demonstrate the effectiveness and robustness of the proposed algorithms. In particular, they show that, for highly noisy data, relatively small network architectures can already produce stable reconstructions, whereas excessively large architectures may degrade stability due to overfitting.

QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

Adam Lilja, Ji Lan, Junsheng Fu, Lars Hammarstrand — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.17221v2 Announce Type: replace Abstract: Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/

Learning Visually Interpretable Oscillator Networks for Soft Continuum Robots from Video

Henrik Krauss, Johann Licher, Naoya Takeishi, Annika Raatz, Takehisa Yairi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.18322v4 Announce Type: replace Abstract: Learning soft continuum robot (SCR) dynamics from video offers flexibility but existing methods lack interpretability or rely on prior assumptions. Model-based approaches require prior knowledge and manual design. We bridge this gap by introducing: (1) The Attention Broadcast Decoder (ABCD), a plug-and-play module for autoencoder-based latent dynamics learning that generates pixel-accurate attention maps localizing each latent dimension's contribution while filtering static backgrounds, enabling visual interpretability via spatially grounded latents and on-image overlays. (2) Visual Oscillator Networks (VONs), a 2D latent oscillator network coupled to ABCD attention maps for on-image visualization of learned masses, coupling stiffness, and forces, thereby enabling mechanical interpretability. We validate our approach on single- and double-segment SCRs, demonstrating that ABCD-based models significantly improve multi-step prediction accuracy with 5.8x error reduction for Koopman operators and 3.5x for oscillator networks on a two-segment robot. VONs autonomously discover a chain structure of oscillators. This fully data-driven approach yields compact, mechanically interpretable models with potential relevance for future control applications.

Navigating Gigapixel Pathology Images with Large Multimodal Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.19652v2 Announce Type: replace Abstract: Recent advances in large multimodal models have allowed for the development of interactive chat models that can converse and reason about pathology whole-slide images (WSIs). However, existing slide-level chat systems are often highly specialized, typically compressing WSIs into fixed slide-level embeddings or relying on multi-component pipelines, which can lose multi-scale detail and limit generalizability beyond the target task. We present GIANT (Gigapixel Image Agent for Navigating Tissue), a simple, training-free approach that lets general-purpose multimodal models navigate WSIs on their own, iteratively selecting multi-magnification crops and aggregating evidence over time. To evaluate generalizability in WSI question answering and to promote reproducibility, we introduce MultiPathQA, a benchmark suite spanning five clinical challenges and 934 questions over 868 unique WSIs. This includes a new set of 128 pathologist-authored multiple-choice questions designed to mirror real diagnostic search and multi-scale reasoning. Using GPT-5, GIANT outperforms models specialized for pathology question answering, achieving state-of-the-art performance on four out of five benchmarks.

Design Criteria for SGD Preconditioners: Local Conditioning, Noise Floors, and Basin Stability

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.19716v2 Announce Type: replace Abstract: Stochastic Gradient Descent (SGD) often slows in the late stage of training due to anisotropic curvature and gradient noise. We analyze preconditioned SGD in the geometry induced by a symmetric positive definite matrix $\mathbf{M}$, deriving bounds in which both the convergence rate and the stochastic noise floor are governed by $\mathbf{M}$-dependent quantities: the rate through an effective condition number in the $\mathbf{M}$-metric, and the floor through the product of that condition number and the preconditioned noise level. For nonconvex objectives, we establish a preconditioner-dependent basin-stability guarantee: when smoothness and basin size are measured in the $\mathbf{M}$-norm, the probability that the iterates remain in a well-behaved local region admits an explicit lower bound. This perspective is particularly relevant in Scientific Machine Learning (SciML), where achieving small training loss under stochastic updates is closely tied to physical fidelity, numerical stability, and constraint satisfaction. The framework applies to both diagonal/adaptive and curvature-aware preconditioners and yields a simple design principle: choose $\mathbf{M}$ to improve local conditioning while attenuating noise. Experiments on a quadratic diagnostic and three SciML benchmarks validate the predicted rate-floor behavior.

DiskChunGS: Large-Scale 3D Gaussian SLAM Through Chunk-Based Memory Management

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.23030v2 Announce Type: replace Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated impressive results for novel view synthesis with real-time rendering capabilities. However, integrating 3DGS with SLAM systems faces a fundamental scalability limitation: methods are constrained by GPU memory capacity, restricting reconstruction to small-scale environments. We present DiskChunGS, a scalable 3DGS SLAM system that overcomes this bottleneck through an out-of-core approach that partitions scenes into spatial chunks and maintains only active regions in GPU memory while storing inactive areas on disk. Our architecture integrates seamlessly with existing SLAM frameworks for pose estimation and loop closure, enabling globally consistent reconstruction at scale. We validate DiskChunGS on indoor scenes (Replica, TUM-RGBD), urban driving scenarios (KITTI), and resource-constrained Nvidia Jetson platforms. Our method uniquely completes all 11 KITTI sequences without memory failures while achieving superior visual quality, demonstrating that algorithmic innovation can overcome the memory constraints that have limited previous 3DGS SLAM methods.

Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores

Nikhil Rout, Blaise Tine — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.00053v3 Announce Type: replace Abstract: Efficient mixed-precision MMA operations are critical for accelerating deep learning workloads on GPGPUs. However, existing open-source Tensor Core implementations rely on discrete arithmetic unit designs, leading to high latency, accumulated rounding errors, and poor resource utilization. To address these challenges, we propose Ten-Four, a configurable mixed-precision fused dot product unit integrating both floating-point and integer arithmetic pipelines within a unified architecture, implemented as part of the open-source RISC-V-based Vortex GPGPU's Tensor Core Unit extension. It supports low-precision multiplication in TF32/FP16/BF16/FP8/BF8/INT8/INT4 with higher-precision FP32/INT32 accumulation, native Microscaling (MX) support, and sparse lane clock-gating for dynamic power reduction, while matching NVIDIA Tensor Core numerical accuracy. Ten-Four achieves 4-cycle latency at 300 MHz Fmax on the Xilinx U55C FPGA, delivering 130.368 GFLOPS peak throughput per Tensor Core and 2.7x-7.9x speedup over equivalent Berkeley HardFloat and FPnew based implementations at less than 60% the area cost. ASIC synthesis in 7nm FinFET achieves 2.771 TFLOPS/W peak efficiency at 1.58 GHz Fmax.

Reasoning about concurrent loops and recursion with rely-guarantee rules

Ian J. Hayes, Larissa A. Meinicke, Cliff B. Jones — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.06242v2 Announce Type: replace Abstract: The objective of this paper is to present general, mechanically verified, refinement rules for reasoning about recursive programs and while loops in the context of concurrency. We make use of the rely-guarantee approach to concurrency that facilitates reasoning about interference from concurrent threads in a compositional manner. Recursive programs can be defined as fixed points over a lattice of commands and hence we develop laws for reasoning about fixed points. Loops can be defined in terms of fixed points and hence the laws for recursion can be applied to develop laws for loops. Unlike many approaches to concurrency, we do not assume that expression evaluation is atomic.

Accurate Models of NVIDIA Tensor Cores

Faizan A. Khattak, Mantas Mikaitis — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.07004v4 Announce Type: replace Abstract: Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput over the software-based matrix multiplication, the multipliers are increasingly used outside of AI, to accelerate various applications in scientific computing. However, matrix multipliers targeted at AI are at present not compliant with IEEE 754 floating-point arithmetic behaviour, with different vendors offering different numerical features. This leads to non-reproducible results across different generations of GPU architectures, at the matrix multiply-accumulate instruction level. To study numerical characteristics of matrix multipliers - such as rounding behaviour, accumulator width, normalization points, extra carry bits, and others - test vectors are typically constructed. Yet, these vectors may or may not distinguish between different hardware models, and due to limited hardware availability, their reliability across many different platforms remains largely untested. We present software models for emulating the inner product behavior of low- and mixed-precision matrix multipliers in the V100, A100, H100 and B200 data center GPUs in most supported input formats of interest to mixed-precision algorithm developers: 8-, 16-, and 19-bit floating point. These matrix multiplier models are first approximated by determining the numerical features via test vectors designed to trigger outputs sensitive to bit level differences in the implementation, followed by semi-exhaustive comparison (randomised input vectors of $10^7$ values) between the models and the actual GPU matrix multipliers - this process is repeated until the model is bit accurate.

Measurement Plasticity: Sensor-Level Adaptation for Vision-Language Models

Boyeong Im, Wooseok Lee, Yoojin Kwon, Hyung-Sin Kim — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.12571v3 Announce Type: replace Abstract: We propose Multi-View Physical-prompt (MVP) for Test-Time Adaptation (TTA), a forward-only framework that moves TTA from tokens to photons by treating the camera exposure triangle (i.e., ISO, shutter speed, and aperture) as physical prompts. At inference, MVP acquires selected multiple physical views using a source-affinity score, evaluates digitally augmented variants of each retained view and filters the lowest-entropy predictions, and aggregates predictions with hard voting. This selection-then-vote design is simple, calibration-friendly, and requires no gradients or model modifications. On ImageNet-ES and ImageNet-ES-Diverse, MVP outperforms digital-only TTA on both Auto-Exposure and a combination with conventional sensor control. MVP remains effective under reduced parameter candidates that lower capture latency, demonstrating its practicality.

Adaptable Segmentation Pipeline for Diverse Brain Tumors with Radiomic-Guided Subtyping and Lesion-Wise Model Ensemble

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.14648v2 Announce Type: replace Abstract: Robust and generalizable segmentation of brain tumors on multi-parametric magnetic resonance imaging (MRI) remains difficult because tumor types differ widely. The BraTS 2025 Lighthouse Challenge benchmarks segmentation methods on diverse high-quality datasets of adult and pediatric tumors: multi-consortium international pediatric brain tumor segmentation (PED), preoperative meningioma tumor segmentation (MEN), meningioma radiotherapy segmentation (MEN-RT), and segmentation of pre- and post-treatment brain metastases (MET). We present a flexible, modular, and adaptable pipeline that improves segmentation performance by selecting and combining state-of-the-art models and applying tumor- and lesion-specific processing before and after training. Radiomic features extracted from MRI help detect tumor subtype, ensuring a more balanced training. Custom lesion-level performance metrics determine the influence of each model in the ensemble and optimize post-processing that further refines the predictions, enabling the workflow to tailor every step to each case. On the BraTS testing sets, our pipeline achieved performance comparable to top-ranked algorithms across multiple challenges. These findings confirm that custom lesion-aware processing and model selection yield robust segmentations yet without locking the method to a specific network architecture. Our method has the potential for quantitative tumor measurement in clinical practice, supporting diagnosis and prognosis.

Improving Pre-trained Adult Glioma Segmentation Models Using only Post-processing Techniques

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.14937v2 Announce Type: replace Abstract: Gliomas are the most common malignant brain tumors in adults and are among the most lethal. Despite aggressive treatment, the median survival rate is less than 15 months. Accurate multiparametric MRI (mpMRI) tumor segmentation is critical for surgical planning, radiotherapy, and disease monitoring. While deep learning models have improved the accuracy of automated segmentation, large-scale pre-trained models generalize poorly and often underperform, producing systematic errors such as false positives, label swaps, and slice discontinuities in slices. These limitations are further compounded by unequal access to GPU resources and the growing environmental cost of large-scale model training. In this work, we propose adaptive post-processing techniques to refine the quality of glioma segmentations produced by large-scale pretrained models developed for various types of tumors. We demonstrated the techniques in multiple BraTS 2025 segmentation challenge tasks, with the ranking metric improving by 14.9 % for the sub-Saharan Africa challenge and 0.9% for the adult glioma challenge. This approach promotes a shift in brain tumor segmentation research from increasingly complex model architectures to efficient, clinically aligned post-processing strategies that are precise, computationally fair, and sustainable.

HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

Yi Zhou, Haohao Qu, Yunqing Liu, Shanru Lin, Le Song, Wenqi Fan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.15133v3 Announce Type: replace Abstract: Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine-grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a continuous-valued diffusion head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for joint sequence-structure modeling. It captures inter-token dependencies across modalities through a unified absorbing diffusion process, and estimates per-token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive results demonstrate that HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks. Furthermore, our method can perform on par with state-of-the-art multimodal pLMs, despite being developed under limited computational resources (i.e., less than one-tenth the budget for modality extension fine-tuning). It highlights the viability of simultaneously estimating categorical and continuous distributions within a unified language model architecture, offering a promising alternative direction for multimodal pLMs.

From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.15134v2 Announce Type: replace Abstract: A goal of interpretability is to recover disentangled representations of latent concepts (features) from the activations of neural networks. The quality of features is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear to what extent common featurization methods such as sparse autoencoders (SAEs) and probes disentangle one concept from another. We propose a multi-concept evaluation setting using concepts including sentiment, domain, voice, and tense. We evaluate how well featurizers produce disentangled representations of each concept, observing that features are typically sensitive to only one concept, but also that concepts are distributed across many features. Then, we steer these features, measuring whether each concept is independently manipulable, and whether features interact. Even in idealized settings, steering a feature often affects many concepts, despite a near absence of interaction effects. These results suggest that correlational metrics are insufficient to establish steering selectivity, and that demonstrating that two features operate in separate spaces is insufficient to claim that they will be selective for one concept. These results underscore the importance of multi-concept evaluations in interpretability research.

Structured Visualization Design Knowledge for Grounding Generative Reasoning and Situated Feedback

P\'eter Ferenc Gyarmati, Dominik Moritz, Torsten M\"oller, Laura Koesten — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.20306v3 Announce Type: replace Abstract: Automated visualization design navigates a tension between symbolic systems and generative models. Constraint solvers enforce structural and perceptual validity, but the rules they require are difficult to author and too rigid to capture situated design knowledge. Large language models require no formal rules and can reason about contextual nuance, but they prioritize popular conventions over empirically grounded best practices. We address this tension by proposing a cataloging scheme that structures visualization design knowledge as natural-language guidelines with semantically typed metadata. This allows experts to author knowledge that machines can query. An expert study ($N=18$) indicates that practitioners routinely adapt heuristics to situational factors such as audience and communicative intent. To capture this reasoning, guideline sections specify not only advice but also the contexts where it applies, exceptions that invalidate it, and the sources from which it derives. We demonstrate the scheme's expressiveness by cataloging 744 guidelines drawn from cognitive science, accessibility standards, data journalism, and research on rhetorical aspects of visual communication. We embed guideline sections in a vector space, opening the knowledge itself to structural analysis. This reveals conflicting advice across sources and transferable principles between domains. Rather than replacing constraint-based tools, our scheme provides what they lack: situated guidance that generative systems can retrieve to ground their reasoning, users can verify against cited sources, and experts can author as knowledge evolves.

The State of the SBOM Tool Ecosystems: A Comparative Analysis of SPDX and CycloneDX

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.21781v2 Announce Type: replace Abstract: Software Bills of Materials (SBOMs) improve software release transparency by documenting components and dependencies, but their practical value depends on the tools that generate, analyze, and manage them. This paper compares the tool ecosystems of the two dominant SBOM formats: SPDX and CycloneDX. We analyze 108 open-source and 62 proprietary SBOM tools, compare ecosystem-level health metrics across 470 SPDX and 171 CycloneDX tools, examine 36,990 issue reports from open-source tools, and study the top 250 open-source projects using each format. Our results show that CycloneDX-using projects often exhibit stronger developer engagement and selected project health indicators, while SPDX benefits from a larger, more mature tool ecosystem and broader industry adoption. These findings highlight the complementary strengths of both ecosystems and identify opportunities for improving SBOM tooling across formats.

Men and Women Survivors in Science: A Comprehensive Analysis

Marek Kwiek, Lukasz Szymula — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.22140v2 Announce Type: replace Abstract: We followed scientists who started publishing in 2000 and who continued publishing until 2020-2023 (N = 41,424). These survivors in science authored 2 million articles (N = 2,089,097) with more than 70 million cited references (N = 73,118,395) and worked in 38 OECD countries. Using a raw Scopus dataset, we examined gender disparities in publishing intensity, international collaboration, journal selection, productivity, citations, team formation, and publishing breaks in 16 STEMM and social science disciplines. Several author-level metrics were computed. Our data show a gender productivity gap for both lifetime scholarly output and annual journal prestige-normalized productivity. Surprisingly, in the context of extant literature, the data do not show a gender international collaboration gap, a gender journal selection gap, a gender citation gap, or a gender team formation gap. Men were on average 23% more productive than women cumulatively in 2000-2023 and 19% more productive in the last 5 years studied (2019-2023). Men and women published in equally prestigious journals, received the same number of citations (field-normalized), and worked in equally sized teams. In all, 80% of scientists in STEMM disciplines and 70% in the social sciences had published every year. Our data indicate interesting disciplinary differences in gender disparities.

Cluster Aggregated GAN (CAG): A Cluster-Based Hybrid Model for Appliance Pattern Generation

Zikun Guo, Adeyinka. P. Adedigba, Rammohan Mallipeddi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.22287v3 Announce Type: replace Abstract: Synthetic appliance data are essential for developing non-intrusive load monitoring algorithms and enabling privacy preserving energy research, yet the scarcity of labeled datasets remains a significant barrier. Recent GAN-based methods have demonstrated the feasibility of synthesizing load patterns, but most existing approaches treat all devices uniformly within a single model, neglecting the behavioral differences between intermittent and continuous appliances and resulting in unstable training and limited output fidelity. To address these limitations, we propose the Cluster Aggregated GAN framework, a hybrid generative approach that routes each appliance to a specialized branch based on its behavioral characteristics. For intermittent appliances, a clustering module groups similar activation patterns and allocates dedicated generators for each cluster, ensuring that both common and rare operational modes receive adequate modeling capacity. Continuous appliances follow a separate branch that employs an LSTM-based generator to capture gradual temporal evolution while maintaining training stability through sequence compression. Extensive experiments on the UVIC smart plug dataset demonstrate that the proposed framework consistently outperforms baseline methods across metrics measuring realism, diversity, and training stability, and that integrating clustering as an active generative component substantially improves both interpretability and scalability. These findings establish the proposed framework as an effective approach for synthetic load generation in non-intrusive load monitoring research.

HiGR: Industrial-Scale Hierarchical Generative Slate Recommendation Framework in Tencent

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.24787v5 Announce Type: replace Abstract: Slate recommendation, which presents users with a ranked item list in a single display, is ubiquitous across mainstream online platforms. While recent generative recommendation methods have shown strong potential in modeling item sequences with semantic IDs, directly applying them to industrial-scale slate recommendation faces a fundamental disconnect: entangled SID spaces confound high-level list planning, fine-grained autoregressive decoding over long sequences limits semantic planning efficiency, and token-level objectives misalign with holistic slate quality. In this paper, we propose HiGR, an industrial-scale hierarchical generative framework for slate recommendation that bridges this disconnect through a co-designed pipeline. First, HiGR learns structured SIDs via a Prefix-Contrastive Residual Quantized VAE (PCRQ-VAE). By enforcing high-level prefixes to capture shared semantics, PCRQ-VAE creates a controllable discrete space that acts as a prerequisite for efficient planning. Leveraging this structured space, our Hierarchical Slate Decoder (HSD) shifts autoregressive modeling from entangled token-level decoding to coarse-grained preference embeddings. This design significantly reduces inference latency while allowing explicit global slate structure planning. Finally, this stable planning space enables an ORPO-based listwise alignment mechanism to optimize triple-objective implicit feedback-ranking fidelity, genuine user interest, and diversity. Extensive offline experiments show that HiGR outperforms state-of-the-art baselines by over 10% in offline recommendation quality while achieving a $5\times$ inference speedup. Online A/B tests on Tencent platforms further improve watch time by 1.22% and video plays by 1.73%. HiGR has been deployed on multiple Tencent platform surfaces, serving hundreds of millions of users and proving its industrial-scale applicability.

Geometric and Quantum Kernel Methods for Predicting Skeletal Muscle Outcomes in chronic obstructive pulmonary disease

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.00921v3 Announce Type: replace Abstract: Chronic obstructive pulmonary disease (COPD) affects hundreds of millions of people worldwide, and skeletal-muscle dysfunction is clinically important. Quantum machine learning is increasingly explored for biomedical prediction, but its value in small biomarker cohorts requires benchmarking against strong classical baselines. We analysed a cigarette-smoke COPD cohort of 213 animals with blood and bronchoalveolar-lavage biomarkers to predict tibialis anterior muscle weight, muscle quality, and force. We developed a kernel-geometric quantum hybrid method in which synthetic symmetric positive definite (SPD) references are mapped through a reproducing kernel Hilbert space, compressed using train-only random projection, normalised, and supplied to low-dimensional quantum regression circuits. We benchmarked this approach against classical ridge/kernel models, SPD relational representations, and quantum-kernel regression (QKR). All methods were evaluated using condition-stratified repeated cross-validation. The largest numerical improvement was observed for muscle weight, where the proposed method had the numerically lowest mean root mean squared error (RMSE), approximately 1.8% below the best classical comparator; paired fold-level testing did not establish statistically significant superiority after Holm adjustment, but the endpoint is biologically meaningful. The method also had the numerically lowest mean RMSE for muscle quality. For force, biomarker-only Ridge performed best, suggesting a more linear endpoint structure.

FedBiCross: Personalized One-Shot Federated Learning on Medical Images

Yuexuan Xia, Yinghao Zhang, Yalin Liu, Hong-Ning Dai, Yong Xia — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.01901v4 Announce Type: replace Abstract: Data-free knowledge distillation-based one-shot federated learning (OSFL) trains a model in a single communication round without sharing raw data, making OSFL attractive for privacy-sensitive medical applications. However, existing methods aggregate predictions from all clients to form a global teacher. Under non-IID data, conflicting predictions dilute each other during averaging, yielding less informative soft labels that weaken distillation. We propose FedBiCross, a personalized OSFL framework with three stages: (1) clustering clients by model output similarity to form coherent sub-ensembles, (2) bi-level cross-cluster optimization that learns adaptive weights to selectively leverage beneficial cross-cluster knowledge while suppressing negative transfer, and (3) personalized distillation for client-specific adaptation. Experiments on four medical image datasets demonstrate that FedBiCross consistently outperforms state-of-the-art baselines across different non-IID degrees.

Why Commodity WiFi Sensors Fail at Multi-Person Gait Identification: A Systematic Analysis Using ESP32

Oliver Custance, Saad Khan, Simon Parkinson — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.02177v2 Announce Type: replace Abstract: WiFi Channel State Information (CSI) has shown promise for single-person gait identification, raising interest in its use for contactless biometrics, continuous authentication, and passive identification. However, the feasibility of multi-person identification on low-cost commodity devices remains unclear. A critical question is whether weak multi-person performance is primarily an algorithmic limitation, or whether it reflects a more fundamental sensing ceiling on commodity WiFi hardware. We address this question through a systematic empirical study using commodity ESP32 WiFi sensors. We evaluated six different signal separation methods--FastICA, SOBI, PCA-ICA, NMF, Wavelet, and Tensor decomposition--across seven scenarios spanning 1-10 people in both controlled and realistic indoor environments. To investigate beyond classification accuracy, we introduce three diagnostic metrics: intra-subject variability (ISV), inter-subject distinguishability (ISD), and performance degradation rate (PDR). In all methods, performance remains moderate (39%-56% accuracy), with limited evidence that algorithmic choice alone solves the problem. The best-performing method, NMF, reaches 56% accuracy, while all methods exhibit extremely high feature-space overlap (97%-99%), unstable within-subject representations, and marked environmental sensitivity. These findings suggest that, under commodity ESP32 CSI constraints, dense multi-person gait identification is limited more by sensing quality and spatial diversity than by the chosen separation algorithm. Our results have direct implications for security and privacy: they call into question the practicality of commodity WiFi CSI as a robust multi-user biometric primitive for authentication, while also placing important bounds on the passive identification capabilities achievable with low-cost off-the-shelf WiFi hardware.

Decentralized Autoregressive Generation

Stepan Maschan, Haoxuan Qu, Jun Liu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.03184v3 Announce Type: replace Abstract: The decentralization of autoregressive generation has attracted considerable attention in recent years as a solution to scaling bottlenecks. However, despite promising empirical results, this paradigm currently lacks rigorous theoretical justification. In this work, we formally establish the theoretical equivalence between decentralized and centralized training. To achieve this, we adapt the Discrete Flow Matching framework for autoregressive generation, leveraging its inherent properties to demonstrate that global models naturally decompose into independent experts. Finally, we conduct extensive experiments across diverse multimodal benchmarks, empirically validating that decentralized training maintains competitive parity with standard centralized architectures.

CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

Ao Sun, Xiaoyu Wang, Zhe Tan, Yu Li, Jiachen Zhu, Shu Su, Yuheng Jia — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.04885v2 Announce Type: replace Abstract: As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from \textbf{Mean Collapse}, converging to a generic average that fails to represent diverse groups. We attribute this to \textbf{Cultural Sparsity}, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textbf{\textsc{CuMA}} (\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters), a framework that frames alignment as a \textbf{conditional capacity separation} problem. By incorporating demographic-aware routing, \textsc{CuMA} internalizes a \textit{Latent Cultural Topology} to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textsc{CuMA} achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Crucially, our analysis confirms that \textsc{CuMA} effectively mitigates mean collapse, preserving cultural diversity. Our code is available at https://github.com/Throll/CuMA.

When Smaller Wins: Dual-Stage Distillation and Pareto-Guided Compression of Liquid Neural Networks for Edge Battery Prognostics

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.06227v3 Announce Type: replace Abstract: Battery management systems increasingly require accurate battery health prognostics under strict on-device constraints. This paper presents DLNet, a practical framework with dual-stage distillation of liquid neural networks that turns a high-capacity model into compact and edge-deployable models for battery health prediction. DLNet first applies Euler discretization to reformulate liquid dynamics for embedded compatibility. It then performs dual-stage knowledge distillation to transfer the teacher model's temporal behavior and recover it after further compression. Pareto-guided selection under joint error-cost objectives retains student models that balance accuracy and efficiency. We evaluate DLNet on a widely used dataset and validate real-device feasibility on an Arduino Nano 33 BLE Sense using int8 deployment. The final deployed student achieves a low error of 0.0066 when predicting battery health over the next 100 cycles, which is 15.4% lower than the teacher model. It reduces the model size from 616 kB to 94 kB with 84.7% reduction and takes 21 ms per inference on the device. These results support a practical smaller wins observation that a small model can match or exceed a large teacher for edge-based prognostics with proper supervision and selection. Beyond batteries, the DLNet framework can extend to other industrial analytics tasks with strict hardware constraints.

EyeTheia: A Lightweight and Accessible Eye-Tracking Toolbox

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.06279v2 Announce Type: replace Abstract: We introduce EyeTheia, a lightweight and open deep learning pipeline for webcam-based gaze estimation, designed for browser-based experimental platforms and real-world cognitive and clinical research. EyeTheia enables real-time gaze tracking using only a standard laptop webcam, combining MediaPipe-based landmark extraction with a convolutional neural network inspired by iTracker and optional user-specific fine-tuning. We investigate two complementary strategies: adapting a model pretrained on mobile data and training the same architecture from scratch on a desktop-oriented dataset. Validation results on MPIIFaceGaze show comparable performance between both approaches prior to calibration, while lightweight user-specific fine-tuning consistently reduces gaze prediction error. We further evaluate EyeTheia in a realistic Dot-Probe task and compare it to the commercial webcam-based tracker SeeSo SDK. Results indicate strong agreement in left-right gaze allocation during stimulus presentation, despite higher temporal variability. Overall, EyeTheia provides a transparent and extensible solution for low-cost gaze tracking, suitable for scalable and reproducible experimental and clinical studies. The code, trained models, and experimental materials are publicly available.

Hellinger Multimodal Variational Autoencoders

Huyen Vo, Isabel Valera — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.06572v4 Announce Type: replace Abstract: Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from H\"older pooling with $\alpha=0.5$, which corresponds to the unique symmetric member of the $\alpha\text{-divergence}$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.

The Issue with Special Issues: when Guest Editors Publish in Support of Self

Paolo Crosetto, Pablo G\'omez Barreiro, Mark Austin Hanson — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.07563v2 Announce Type: replace Abstract: The recent exceptional growth in special issues has led to the largest delegation of editorial power in the history of scientific publishing. Has this power been used responsibly? We provide the first systematic analysis of endogeny, the practice of publishing articles in ones own special issue. While moderate levels of endogeny are common, excessive endogeny constitutes scientific misconduct, as it stems from a clear conflict of interest. We define special issues containing more than 33% endogeny as SI-hacked. We build a dataset of over 100,000 special issues published in 2015-2025 by five leading publishers. The large majority of guest editors engage in endogeny responsibly, if at all. Nonetheless, despite endogeny policies by publishers and indexers, SI-hacking is endemic. All journals heavily relying on special issues host SI-hacking; more than 1,000 hacked SIs are published each year, hosting tens of thousands of endogenous articles. Egregious SI-hacking is rare, editors exceeding endogeny thresholds mostly to the extent that publishers allow them to. This is not good news, as it reflects a widespread normalisation of guest editor conflicts of interests. Fortunately, SI-hacking can be solved by enforcing existing common sense policies. We provide data and analyses needed for indexers and regulators to act.

Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.09693v3 Announce Type: replace Abstract: Structure-based and ligand-based computational drug design have traditionally relied on disjoint data sources and modeling assumptions, limiting their joint use at scale. In this work, we introduce Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe), a single contrastive geometric model that unifies structure- and ligand-based training. ConGLUDe couples a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites with a fast ligand encoder, removing the need for predefined pockets. By aligning ligands with both global protein representations and multiple candidate binding sites through contrastive learning, ConGLUDe supports ligand-conditioned pocket prediction in addition to virtual screening and target fishing, while being trained jointly on protein-ligand complexes and large-scale bioactivity data. Across diverse benchmarks, ConGLUDe achieves competitive zero-shot virtual screening performance, substantially outperforms existing methods on a challenging target fishing task, and demonstrates state-of-the-art ligand-conditioned pocket selection. These results highlight the advantages of unified structure-ligand training and position ConGLUDe as a step toward general-purpose foundation models for drug discovery.

Disturbance Attenuation Regulator II: Stage Bound Finite Horizon Solution

Davide Mannini, James B. Rawlings — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.10869v2 Announce Type: replace Abstract: This paper develops a generalized finite horizon recursive solution to the discrete time stage bound disturbance attenuation regulator (StDAR) for state feedback control. This problem addresses linear dynamical systems subject to stage bound disturbances, i.e., disturbance sequences constrained independently at each time step through stagewise squared two-norm bounds. The term generalized indicates that the results accommodate arbitrary initial states. By combining game theory and dynamic programming, this work derives a recursive solution for the optimal state feedback policy. The optimal policy is nonlinear in the state and requires solving a tractable convex optimization for the Lagrange multiplier vector at each stage; the control is then explicit. For systems with constant stage bound, the problem admits a steady-state optimization expressed as a tractable linear matrix inequality (LMI) whose empirical computational cost is approximately cubic in $n$. Numerical examples illustrate the properties of the solution. This work provides a complete feedback solution to the StDAR for arbitrary initial states. Companion papers address the signal bound disturbance attenuation regulator (SiDAR): the finite horizon solution in Part~I-A and convergence properties in Part~I-B.

NOVA: NOise-aware Verbal Confidence CAlibration for Robust Large Language Models in RAG Systems

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.11004v3 Announce Type: replace Abstract: Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance especially when noisy contexts are retrieved. Specifically, contradictory or irrelevant evidence tends to exacerbate the model's overconfidence issue. To address this, we propose NOVA Rules (NOise-Aware Verbal Confidence CAlibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NOVA, a noise-aware calibration framework that synthesizes supervision from ~2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NOVA equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NOVA yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NOVA paves the way for both accurate and epistemically reliable LLMs.

Asymptotically Optimal Tests for One- and Two-Sample Problems

Arick Grootveld, Biao Chen, Venkata Gandikota — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.11727v3 Announce Type: replace Abstract: In this work, we revisit the one- and two-sample testing problems: binary hypothesis testing in which one or both distributions are unknown. For the one-sample test, we provide a more streamlined proof of the asymptotic optimality of Hoeffding's likelihood ratio test, which is equivalent to the threshold test of the relative entropy between the empirical distribution and the nominal distribution. The new proof offers an intuitive interpretation and naturally extends to the two-sample test where we show that a similar form of Hoeffding's test, namely a threshold test of the relative entropy between the two empirical distributions is also asymptotically optimal. A strong converse for the two-sample test is also obtained.

AfroScope: A Framework for Studying the Linguistic Landscape of Africa

Sang Yun Kwon, AbdelRahim Elmadany, Muhammad Abdul-Mageed — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.13346v3 Announce Type: replace Abstract: Language Identification (LID), the task of determining the language of a given text, is a fundamental preprocessing step that shapes the reliability of downstream NLP applications. While recent work has expanded African LID, existing systems remain limited in both language coverage and fine-grained discrimination among closely related languages and varieties. We introduce AfroScope, a unified framework for African LID that includes AfroScope-Data, a dataset covering 640 languages, and AfroScope-Models, a suite of strong LID models with broad African language coverage. To address persistent confusions among closely related languages, we propose a hierarchical classification approach that leverages AfroScope-Mirror, a specialized embedding model for targeted disambiguation, improving macro-F1 by 1.57 points on the confusable subset compared to our best base model. We further analyze cross-lingual transfer and domain effects, showing how language-family structure, script compatibility, and domain coverage shape LID performance. We position African LID as an enabling technology for large-scale measurement of Africa's linguistic landscape in digital text, and release AfroScope-Data and AfroScope-Models online.

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.13591v2 Announce Type: replace Abstract: Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., image and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities, including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 13 recent advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, MiMo-V2-Pro and GPT-5.2 lead in duration and step efficiency, respectively, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04\% to 11.30\%. Overall, while current data science agents perform well on structured data and routine data analysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions.

Multitrace M\"uller Boundary Integral Equation for Electromagnetic Scattering by Composite Objects

Van Chien Le, Kristof Cools — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.13823v2 Announce Type: replace Abstract: This paper introduces a boundary integral equation for time-harmonic electromagnetic scattering by composite dielectric objects. The formulation extends the classical M\"uller equation to composite structures through the global multitrace method. The key ingredient enabling this extension is the use of the Stratton-Chu representation in complementary region, also known as the extinction property, which augments the off-diagonal blocks of the interior representation operator. The resulting block system is composed entirely of second-kind operators. A Petrov-Galerkin (mixed) discretization using Rao-Wilton-Glisson trial functions and Buffa-Christiansen test functions is employed, yielding linear systems that remain well conditioned on dense meshes and at low frequencies without the need for additional stabilization. This reduces computational costs associated with matrix-vector multiplications and iterative solving. Numerical experiments demonstrate the accuracy of the method in computing field traces and derived quantities.

Epistemic Constitutionalism Or: how to avoid coherence bias

Michele Loi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.14295v4 Announce Type: replace Abstract: Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Yet their belief-forming behavior is governed by implicit, uninspected epistemic policies. This paper argues for an epistemic constitution for AI: explicit, contestable meta-norms that regulate how systems form and express beliefs. Source attribution bias provides the motivating case: I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument's content. When models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance. I argue for the Liberal approach, sketch a constitutional core of eight principles and four orientations, and propose that AI epistemic governance requires the same explicit, contestable structure we now expect for AI ethics.

Data-driven Lake Water Quality Forecasting for Time Series with Missing Data using Machine Learning

Rishit Chatterjee, Tahiya Chowdhury — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.15503v2 Announce Type: replace Abstract: Volunteer-led lake monitoring yields irregular, seasonal time series with many gaps arising from ice cover, weather-related access constraints, and occasional human errors, complicating forecasting and early warning of harmful algal blooms. We study Secchi Disk Depth (SDD) forecasting on a 30-lake, data-rich subset drawn from three decades of in-situ records collected across Maine lakes. Missingness is handled via Multiple Imputation by Chained Equations (MICE), and we evaluate performance with a normalized Mean Absolute Error (nMAE) metric for cross-lake comparability. Among six candidates, ridge regression provides the best mean test performance. Using ridge regression, we then quantify the minimal sample size, showing that under a backward, recent-history protocol, the model reaches within 5% of full-history accuracy with approximately 176 training samples per lake on average. We also identify a minimal feature set, where a compact four-feature subset matches the thirteen-feature baseline within the same 5% tolerance. Bringing these results together, we introduce a joint feasibility function that identifies the minimal training history and fewest predictors sufficient to achieve the target of staying within 5% of the complete-history, full-feature baseline. In our study, meeting the 5% accuracy target required about 64 recent samples and just one predictor per lake, highlighting the practicality of targeted monitoring. Hence, our joint feasibility strategy unifies recent-history length and feature choice under a fixed accuracy target, yielding a simple, efficient rule for setting sampling effort and measurement priorities for lake researchers.

Kareus: Joint Reduction of Dynamic and Static Energy in Large Model Training

Ruofan Wu, Jae-Won Chung, Mosharaf Chowdhury — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.17654v2 Announce Type: replace Abstract: The computing demand of AI is growing at an unprecedented rate, but energy supply is not keeping pace. As a result, energy has become an expensive and contended resource that requires explicit management and optimization. Although recent works have made significant progress in large model training optimization, they focus on optimizing either dynamic or static energy consumption. We find that fine-grained kernel scheduling and frequency scaling jointly and interdependently impact both dynamic and static energy consumption. Based on this finding, we design Kareus, a training system that pushes the time-energy tradeoff frontier by optimizing both aspects. Kareus decomposes the intractable joint optimization problem into local, partition-based subproblems. It then uses a multi-pass multi-objective optimization algorithm to find execution schedules that push the time-energy tradeoff frontier. Compared to the state of the art, Kareus reduces training energy by up to 28.3% at the same training time, or reduces training time by up to 27.5% at the same energy consumption.

Beyond Speedups: Hardware-Aware Evaluation of Evolutionary Algorithms on GPUs

Xinmeng Yu, Tao Jiang, Ran Cheng, Yaochu Jin, Kay Chen Tan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.18446v3 Announce Type: replace Abstract: Evolutionary algorithms (EAs) are increasingly executed on graphics processing units (GPUs) to exploit population-level parallelism. This shift changes the resource model under which EAs are designed and evaluated. However, many GPU-based EA studies still focus mainly on implementation-level speedup after porting CPU-oriented algorithms to GPUs, providing limited insight into how algorithmic mechanisms, function-evaluation (FE) budgets, population scales, and hardware utilization jointly affect optimization behavior. In response, this paper goes beyond speedup measurement and studies the scaling behavior of EAs on GPUs from a hardware-aware evaluation perspective. We evaluate 16 representative EAs on 30 benchmark problems across CPU and GPU platforms, covering single-objective optimization, multi-objective optimization, numerical benchmarks, and neuroevolution tasks. The study leads to four findings. First, GPU acceleration is highly heterogeneous across algorithms because different evolutionary mechanisms expose different degrees of batched computation, memory regularity, and synchronization. Second, FE-budgeted evaluation remains useful for measuring sample efficiency, but it provides only a limited observation window under GPU execution; time-budgeted evaluation is therefore necessary for assessing practical time-to-solution and long-horizon search behavior. Third, GPU effectiveness depends on scaling regimes induced by problem dimension and population size, where parallelism may be underutilized, effective, or saturated. Fourth, GPU execution makes very large populations practically affordable, and several evolutionary mechanisms can convert this increased population scale into improved optimization performance. These results indicate that GPU parallelism should not be treated only as a post hoc acceleration tool, but as part of the evaluation and design assumptions of scalable EAs.

HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.19072v3 Announce Type: replace Abstract: Large Language models (LLMs) have shown strong capabilities in code review automation, such as review comment generation, yet they suffer from hallucinations -- where the generated review comments are ungrounded in the actual code -- poses a significant challenge to the adoption of LLMs in code review workflows. To address this, we explore effective and scalable methods for a hallucination detection in LLM-generated code review comments without the reference. In this work, we design HalluJudge that aims to assess the grounding of generated review comments based on the context alignment. HalluJudge includes four key strategies ranging from direct assessment to structured multi-branch reasoning (e.g., Tree-of-Thoughts). We conduct a comprehensive evaluation of these assessment strategies across Atlassian's enterprise-scale software projects to examine the effectiveness and cost-efficiency of HalluJudge. Furthermore, we analyze the alignment between HalluJudge's judgment and developer preference of the actual LLM-generated code review comments in the real-world production. Our results show that the hallucination assessment in HalluJudge is cost-effective with an F1 score of 0.85 and an average cost of $0.009. On average, 67% of the HalluJudge assessments are aligned with the developer preference of the actual LLM-generated review comments in the online production. Our results suggest that HalluJudge can serve as a practical safeguard to reduce developers' exposure to hallucinated comments, fostering trust in AI-assisted code reviews.

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.19827v4 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.

From Digital to Physical: Digital Agents as Autonomous Coaches for Physical Intelligence

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.21570v2 Announce Type: replace Abstract: The field of Embodied AI is witnessing a rapid evolution toward general-purpose robotic systems, fueled by high-fidelity simulation and large-scale data collection. However, this scaling capability remains severely bottlenecked by a reliance on labor-intensive manual oversight from intricate reward shaping to hyperparameter tuning across heterogeneous backends. Inspired by LLMs' success in software automation and science discovery, we introduce \textsc{EmboCoach-Bench}, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies. Spanning 32 expert-curated RL and IL tasks, our framework posits executable code as the universal interface. We move beyond static generation to assess a dynamic closed-loop workflow, where agents leverage environment feedback to iteratively draft, debug, and optimize solutions, spanning improvements from physics-informed reward design to policy architectures such as diffusion policies. Extensive evaluations yield three critical insights: (1) autonomous agents can qualitatively surpass human-engineered baselines by 26.5\% in average success rate; (2) agentic workflow with environment feedback effectively strengthens policy development and substantially narrows the performance gap between open-source and proprietary models; and (3) agents exhibit self-correction capabilities for pathological engineering cases, successfully resurrecting task performance from near-total failures through iterative simulation-in-the-loop debugging. Ultimately, this work establishes a foundation for self-evolving embodied intelligence, accelerating the paradigm shift from labor-intensive manual tuning to scalable, autonomous engineering in embodied AI field.

ReactEMG Stroke: Healthy-to-Stroke Few-shot Adaptation for sEMG-Based Intent Detection

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.22090v2 Announce Type: replace Abstract: Surface electromyography (sEMG) is a promising control signal for assist-as-needed hand rehabilitation after stroke, but detecting intent from paretic muscles often requires lengthy, subject-specific calibration and remains brittle to variability. We propose a healthy-to-stroke adaptation pipeline that initializes an intent detector from a model pretrained on large-scale able-bodied sEMG, then fine-tunes it for each stroke participant using only a small amount of subject-specific data. Using a newly collected dataset from three individuals with chronic stroke, we compare adaptation strategies (head-only tuning, parameter-efficient LoRA adapters, and full end-to-end fine-tuning) and evaluate on held-out test sets that include realistic distribution shifts such as within-session drift, posture changes, and armband repositioning. Across conditions, healthy-pretrained adaptation consistently improves stroke intent detection relative to both zero-shot transfer and stroke-only training under the same data budget; the best adaptation methods improve average transition accuracy from 0.42 to 0.61 and raw accuracy from 0.69 to 0.78. These results suggest that transferring a reusable healthy-domain EMG representation can reduce calibration burden while improving robustness for real-time post-stroke intent detection. Our project website, video, code, and dataset are available at: https://roamlab.github.io/reactemg-stroke/.

Language Model Circuits Are Sparse in the Neuron Basis

Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.22594v2 Announce Type: replace Abstract: The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques which decompose the neuron basis into more interpretable units of model computation, such as sparse autoencoders (SAEs). However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that MLP neurons are as sparse a feature basis as SAEs. We use this finding to develop an end-to-end gradient-based attribution pipeline for circuit tracing on the MLP neuron basis, which surfaces causally effective neurons on a variety of tasks. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city-state-capital task from (Lindsey et al., 2025), we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g. mapping a city to its state), and can be steered to change the model's output. This work thus advances automated interpretability of language models without imposing additional training costs.

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.00122v3 Announce Type: replace Abstract: In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplored research direction remains dense visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing methods primarily focus on English scenarios and images with relatively sparse text, and thus cannot adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose VDE Bench (Visual Doc Edit Bench), a rigorously human annotated and evaluated benchmark specifically designed to assess the performance of image editing models on bilingual Chinese-English and complex visual document editing tasks. The benchmark comprises a high quality dataset of 942 instruction based image editing samples, whose seed images encompass dense Chinese and English text documents including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a novel evaluation framework that systematically quantifies editing performance at the OCR parsing level, thereby enabling fine grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative image editing models. Human verification demonstrates a high degree of consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating the performance of image editing models on bilingual dense text visual documents.

Semantic-Aware Command and Control Transmission for Multi-UAVs

Boya Li, Xiaonan Liu, Dongzhu Liu, Dusit Niyato, Zhu Han — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.00142v2 Announce Type: replace Abstract: Uncrewed aerial vehicles (UAVs) have played an important role in the low-altitude economy and have been used in various applications. However, with the increasing number of UAVs and explosive wireless data, the existing bit-oriented communication network has approached the Shannon capacity, which cannot satisfy the quality of service (QoS) with ultra-reliable low-latency communication (URLLC) requirements for command and control (C\&C) transmission in bit-oriented UAV communication networks. To address this issue, we propose a novel semantic-aware C\&C transmission for multi-UAVs under limited wireless resources. Specifically, we leverage semantic similarity to measure the variation in C\&C messages for each UAV over continuous transmission time intervals (TTIs) and capture the correlation of C\&C messages among UAVs, enabling multicast transmission. Based on the semantic similarity and the importance of UAV commands, we design a trigger function to quantify the QoS of UAVs. Then, to maximize the long-term QoS and exploit multicast opportunities of C\&C messages induced by semantic similarity, we develop a proximal policy optimization (PPO) algorithm to jointly determine the transmission mode (unicast/multicast/idle) and the allocation of limited resource blocks (RBs) between a base station (BS) and UAVs. Experimental results show that our proposed semantic-aware framework significantly increases transmission efficiency and improves effectiveness compared with bit-oriented UAV transmission.

Standardized Methods and Recommendations for Green Federated Learning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.00343v2 Announce Type: replace Abstract: Federated learning (FL) enables collaborative model training over privacy-sensitive, distributed data, but its environmental impact is difficult to compare across studies due to inconsistent measurement boundaries and heterogeneous reporting. We present a practical carbon-accounting methodology for FL CO2e tracking using NVIDIA NVFlare and CodeCarbon for explicit, phase-aware tasks (initialization, per-round training, evaluation, and idle/coordination). To capture non-compute effects, we additionally estimate communication emissions from transmitted model-update sizes under a network-configurable energy model. We validate the proposed approach on two representative workloads: CIFAR-10 image classification and retinal optic disk segmentation. In CIFAR-10, controlled client-efficiency scenarios show that system-level slowdowns and coordination effects can contribute meaningfully to carbon footprint under an otherwise fixed FL protocol, increasing total CO2e by 8.34x (medium) and 21.73x (low) relative to the high-efficiency baseline. In retinal segmentation, swapping GPU tiers (H100 vs.\ V100) yields a consistent 1.7x runtime gap (290 vs. 503 minutes) while producing non-uniform changes in total energy and CO2e across sites, underscoring the need for per-site and per-round reporting. Overall, our results support a standardized carbon accounting method that acts as a prerequisite for reproducible 'green' FL evaluation. Our code is available at https://github.com/Pediatric-Accelerated-Intelligence-Lab/carbon_footprint.

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.00462v4 Announce Type: replace Abstract: Transforming a large language model (LLM) into a vision-language model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens encodes a large text corpus and stores contextualized token representations for each token in that corpus. Visual token representations are then compared to these contextualized representations and the top-nearest neighbor representations serve as descriptions of the visual token. We evaluate this method on 15 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations and open up new directions for analyzing the latent representations of LLMs.

LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States

Yeqin Zhang, Yunfei Wang, Jiaxuan Chen, Ke Qin, Yizheng Zhao, Cam-Tu Nguyen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.01572v2 Announce Type: replace Abstract: Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ($W_O$) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.

Low-order CR--RT equilibrated-flux certification for semilinear problems on anisotropic meshes

Hiroki Ishizaka — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.01636v2 Announce Type: replace Abstract: We develop a low-order Crouzeix--Raviart--Raviart--Thomas (CR--RT) equilibrated-flux certification workflow for finite element approximations of semilinear diffusion--reaction problems, with particular emphasis on anisotropic mesh settings. Given a computed conforming finite element state $\tilde u_h$, the certification process is reduced to three computable quantities required by a Newton--Kantorovich argument: a dual-norm residual bound, a stability constant for the Fr\'echet derivative, and a Lipschitz bound for the derivative in a neighborhood of $\tilde u_h$. These components yield an explicit radius $\rho>0$, ensuring that the exact solution exists locally and uniquely within the ball $B(\tilde u_h,\rho)\subset V$. The residual bound is obtained from an $H(\mathrm{div})$-conforming $\mathbb{RT}^0$ certificate flux reconstructed through a Marini-type CR--RT route. The purpose of this route is not to replace general higher-order or local mixed equilibrated reconstructions, but to provide an explicit low-order construction whose algebraic structure is transparent on anisotropic simplicial meshes. Within the certified neighborhood, we further enclose selected quantities of interest $\mathcal J(u)$; the baseline enclosure follows from the verified inclusion, while an adjoint-based correction sharpens the resulting intervals. The numerical experiments report the behavior of the computable certification quantities for monotone semilinear models, including anisotropic mesh tests. Unless interval or outward-rounded scalar post-processing is explicitly used, the reported computations should be understood as floating-point evaluations of the derived rigorous estimators.

Extending the Law of Intersegmental Coordination: Implications for Powered Prosthetic Controls

Elad Siman Tov, Nili E. Krausz — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.02181v2 Announce Type: replace Abstract: Powered prostheses are capable of providing net positive work to amputees and have advanced in the past two decades. However, reducing amputee metabolic cost of walking remains an open problem. The Law of Intersegmental Coordination (ISC) has been observed across gaits and previously implicated in energy expenditure of walking, yet it has rarely been analyzed or applied within the context of lower-limb amputee gait. This law states that the elevation angles of the thigh, shank and foot over the gait cycle covary. In this work, we developed a method to analyze intersegmental coordination for lower-limb 3D kinematic data, to simplify ISC analysis. Moreover, inspired by motor control, biomechanics and robotics literature, we used our method to extend ISC to a new law of coordination of moments. We find these Elevation Space Moments (ESM), and present results showing a moment-based coordination for able bodied gait. We also analyzed ISC for amputee gait with powered and passive prostheses, and found that while elevation angles remained planar, the ESM lacked planar coordination. We present an ISC-driven powered prosthetic control framework, using healthy coordination as a constraint to predict the shank angles/moments to compensate for alterations due to a passive foot. We developed the ISC3d toolbox that is freely available online, which may be used to compute kinematic and kinetic ISC in 3D. This provides a means to further study the role of coordination in gait and may help address fundamental questions of the neural control of human movement.

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.04208v2 Announce Type: replace Abstract: Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.

Generalized Schr\"odinger Bridge on Graphs

Panagiotis Theodoropoulos, Juno Nam, Evangelos Theodorou, Jaemoo Choi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.04675v2 Announce Type: replace Abstract: Transportation on graphs is a fundamental challenge across many domains, where decisions must respect topological and operational constraints. Despite the need for actionable policies, existing graph-transport methods lack this expressivity. They rely on restrictive assumptions, fail to generalize across sparse topologies, and scale poorly with graph size and time horizon. To address these issues, we introduce Generalized Schr\"odinger Bridge on Graphs (GSBoG), a novel scalable data-driven framework for learning executable controlled continuous-time Markov chain (CTMC) policies on arbitrary graphs under state cost augmented dynamics. Notably, GSBoG learns trajectory-level policies, avoiding dense global solvers and thereby enhancing scalability. This is achieved via a likelihood optimization approach, satisfying the endpoint marginals, while simultaneously optimizing intermediate behavior under state-dependent running costs. Extensive experimentation on challenging real-world graph topologies shows that GSBoG reliably learns accurate, topology-respecting policies while optimizing application-specific intermediate state costs, highlighting its broad applicability and paving new avenues for cost-aware dynamical transport on general graphs.

Trojan Attacks on Neural Network Controllers for Robotic Systems

Farbod Younesi, Walter Lucia, Amr Youssef — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.05121v2 Announce Type: replace Abstract: Neural network controllers are increasingly deployed in robotic systems for tasks such as trajectory tracking and pose stabilization. However, their reliance on potentially untrusted training pipelines or supply chains introduces significant security vulnerabilities. This paper investigates backdoor (Trojan) attacks against neural controllers, using a differential-drive mobile robot platform as a case study. In particular, assuming that the robot's tracking controller is implemented as a neural network, we design a lightweight, parallel Trojan network that can be embedded within the controller. This malicious module remains dormant during normal operation but, upon detecting a highly specific trigger condition defined by the robot's pose and goal parameters, compromises the primary controller's wheel velocity commands, resulting in undesired and potentially unsafe robot behaviours. We provide a proof-of-concept implementation of the proposed Trojan network, which is validated through simulation under two different attack scenarios. The results confirm the effectiveness of the proposed attack and demonstrate that neural network-based robotic control systems are subject to potentially critical security threats.

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.07106v2 Announce Type: replace Abstract: Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet extending them to jointly produce speech and 3D facial animation remains largely unexplored despite its importance for natural human-computer interaction. A key challenge is the mismatch between the discrete semantic reasoning of LLMs and the dense temporal dynamics required for 3D facial motion. We propose Expressive Omni (Ex-Omni), an open-source model that augments OLLMs with native speech-accompanied 3D facial animation. Ex-Omni decouples semantic reasoning from temporal generation through a blendshape-aware speech unit generator and a blendshape decoder, where speech units provide temporal scaffolding and hidden speech representations carry facially relevant cues. We further introduce a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection, as well as InstructS2SF-1200K, a dataset consisting of 1200K samples for pre-training. Extensive experiments show that Ex-Omni maintains competitive speech understanding and generation ability while achieving better audio-visual synchronization and lower face-generation latency than cascaded pipelines.

Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.07294v4 Announce Type: replace Abstract: With the increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. Furthermore, these benchmarks do not disentangle whether errors arise from retrieval failures, generation inaccuracies, domain-specific reasoning mistakes, or misinterpretation of the query or context, making it difficult to precisely diagnose performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirroring financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60% and 14.35% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This degradation is associated with increased comparison hallucinations, temporal and entity mismatches, and is further reflected in declines in reasoning quality and factual consistency--limitations that existing benchmarks have yet to formally categorize or quantify.

Stability and error analysis of fully discrete original energy-dissipative and length-preserving scheme for the Landau-Lifshitz-Gilbert equation

Binghong Li, Xiaoli Li, Cheng Wang, Jiang Yang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.07571v2 Announce Type: replace Abstract: The Landau-Lifshitz-Gilbert (LLG) equation, regarded as a gradient flow with manifold constraint, is the fundamental model describing magnetization dynamics in ferromagnetic materials. It is well known that the normalized tangent plane method is able to simultaneously achieve the non-convex manifold constraint and original energy dissipation. However, the associated computational cost of this numerical approach is exceedingly high. By contrast, the projection method is more straightforward to implement, while it often compromises the inherent energy dissipative property of the continuous model, and the error analysis turns out to be even more challenging. In this work, we first construct a linear and fully discrete finite difference numerical scheme, based on the projection method for the LLG equation, which is capable of simultaneously preserving the non-convex manifold constraint $|\mathbf{m}| = 1$ and an unconditional original energy dissipation. In the error analysis, the classical theoretical technique becomes ineffective, due to the presence of the nonlinear Laplacian term, which in turn poses a significant challenge. To overcome this subtle difficulty, we carefully rewrite the numerical method in an equivalent weak form, in which a point-wise length preserving feature of the numerical solution plays an essential role. As a result of these estimates in the reformulated weak form, an optimal convergence rate could be theoretically established. In our knowledge, this numerical method is the first linear algorithm that preserves the following combined theoretical properties: (i) point-wise length preservation, (ii) unconditional original energy dissipation, (iii) a theoretical justification of convergence analysis and optimal rate error estimate.

On Sequence-to-Sequence Models for Automated Log Parsing

Adam Sorrenti, Andriy Miranskyy — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.07698v2 Announce Type: replace Abstract: Context: Log parsing is a critical standard operating procedure in software systems, enabling monitoring, anomaly detection, and failure diagnosis. However, automated log parsing remains challenging due to heterogeneous log formats, distribution shifts between training and deployment data, and the brittleness of rule-based approaches. Objectives: This study aims to systematically evaluate how sequence modelling architecture, representation choice, sequence length, and training data availability influence automated log parsing performance and computational cost. Methods: We conduct a controlled empirical study comparing four sequence modelling architectures: Transformer, Mamba state-space, monodirectional LSTM, and bidirectional LSTM models. In total, 396 models are trained across multiple dataset configurations and evaluated using relative Levenshtein edit distance with statistical significance testing. Results: Transformer achieves the lowest mean relative edit distance (0.111), followed by Mamba (0.145), mono-LSTM (0.186), and bi-LSTM (0.265), where lower values are better. Mamba provides competitive accuracy with substantially lower computational cost. Character-level tokenization generally improves performance, sequence length has negligible practical impact on Transformer accuracy, and both Mamba and Transformer demonstrate stronger sample efficiency than recurrent models. Conclusion: Overall, Transformers reduce parsing error by 23.4%, while Mamba is a strong alternative under data or compute constraints. These results also clarify the roles of representation choice, sequence length, and sample efficiency, providing practical guidance for researchers and practitioners.

GEMSS: A Variational Bayesian Method for Discovering Multiple Sparse Solutions in Classification and Regression Problems

Kate\v{r}ina Henclov\'a, V\'aclav \v{S}m\'idl — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.08913v2 Announce Type: replace Abstract: High-dimensional, underdetermined and highly correlated systems are common in data science practice, especially when analyzing physical measurements. In such settings, feature selection poses a fundamental challenge because multiple distinct sparse subsets may explain the response equally well. Their identification is crucial not only for predictive modeling but also for generating domain-specific insights into the underlying mechanisms. Yet, conventional methods typically isolate a single solution, obscuring the full spectrum of plausible explanations. This work introduces GEMSS (Gaussian Ensemble for Multiple Sparse Solutions), a variational algorithm designed to simultaneously discover multiple, diverse sparse feature combinations. The method employs a structured spike-and-slab prior for sparsity, a mixture of Gaussians to approximate the intractable multimodal posterior, and a Jaccard-based penalty to further control solution diversity. A single objective function is optimized via stochastic gradient descent. The method is tested on 128 comprehensive experiments by a novel benchmarking framework designed to generate artificial problems with multiple sparse solutions of equal predictive properties. This allows us to measure the retrieval of ground truth features rather than only evaluating predictive performance -- characteristics more fitting to our practical needs. A comparative analysis shows that GEMSS consistently outperforms five prominent feature selection methods adapted through the ALFESE framework. Finally, we demonstrate practical usability through 3 challenging real-world datasets from metabolomics and physical chemistry: GEMSS successfully isolates multiple distinct yet quality solutions. GEMSS is available as a PyPI package 'gemss'. The corresponding repository github.com/kat-er-ina/gemss/ includes the full codebase and a free, no-code application GEMSS Explorer.

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.09379v3 Announce Type: replace Abstract: Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

Allure of Craquelure: A Variational-Generative Approach to Crack Detection in Paintings

Laura Paul, Holger Rauhut, Martin Burger, Samira Kabri, Tim Roith — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.09730v2 Announce Type: replace Abstract: Recent advances in imaging technologies, deep learning and numerical performance have enabled non-invasive detailed analysis of artworks, supporting their documentation and conservation. In particular, automated detection of craquelure in digitized paintings is crucial for assessing degradation and guiding restoration, yet remains challenging due to the possibly complex scenery and the visual similarity between cracks and crack-like artistic features such as brush strokes or hair. We propose a hybrid approach that models crack detection as an inverse problem, decomposing an observed image into a crack-free painting and a crack component. A deep generative model is employed as powerful prior for the underlying artwork, while crack structures are captured using a Mumford--Shah-type variational functional together with a crack prior. Joint optimization yields a pixel-level map of crack localizations in the painting.

Adaptive-Horizon Conflict-Based Search for Closed-Loop Multi-Agent Path Finding

Jiarui Li, Federico Pecora, Runyu Zhang, Gioele Zardini — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.12024v2 Announce Type: replace Abstract: MAPF is a core coordination problem for large robot fleets in automated warehouses and logistics. Existing approaches are typically either open-loop planners, which generate fixed trajectories and struggle to handle disturbances, or closed-loop heuristics without reliable performance guarantees, limiting their use in safety-critical deployments. This paper presents ACCBS, a closed-loop algorithm built on a finite-horizon variant of CBS with a horizon-changing mechanism inspired by iterative deepening in MPC. ACCBS dynamically adjusts the planning horizon based on the available computational budget, and reuses a single constraint tree to enable seamless transitions between horizons. As a result, it produces high-quality feasible solutions quickly while being asymptotically optimal as the budget increases, exhibiting anytime behavior. Extensive case studies demonstrate that ACCBS combines flexibility to disturbances with strong performance guarantees, effectively bridging the gap between theoretical optimality and practical robustness for large-scale robot deployment.

Hierarchical Successor Representation for Robust Transfer

Changmin Yu, M\'at\'e Lengyel — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.12753v2 Announce Type: replace Abstract: The successor representation (SR) provides a powerful framework for decoupling predictive dynamics from rewards, enabling rapid generalisation across reward configurations. However, the classical SR is limited by its inherent policy dependence: policies change due to ongoing learning, environmental non-stationarities, and changes in task demands, making established predictive representations obsolete. Furthermore, in topologically complex environments, SRs suffer from spectral diffusion, leading to dense and overlapping features that scale poorly. Here we propose the Hierarchical Successor Representation (HSR) for overcoming these limitations. By incorporating temporal abstractions into the construction of predictive representations, HSR learns stable state features which are robust to task-induced policy changes. Applying non-negative matrix factorisation (NMF) to the HSR yields a sparse, low-rank state representation that facilitates highly sample-efficient transfer to novel tasks in multi-compartmental environments. Further analysis reveals that HSR-NMF discovers interpretable topological structures, providing a policy-agnostic hierarchical map that effectively bridges model-free optimality and model-based flexibility. Beyond providing a useful basis for task-transfer, we show that HSR's temporally extended predictive structure can also be leveraged to drive efficient exploration, effectively scaling to large, procedurally generated environments.

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.13379v2 Announce Type: replace Abstract: LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at https://github.com/CHATS-lab/ToolShield.

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.14367v2 Announce Type: replace Abstract: The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

Lyapunov-Based PI-Like Control for Robust Trajectory Tracking of a Four-Wheel Independently Driven and Steered Robot: Design and Experimental Validation

Branimir \'Caran, Vladimir Mili\'c, Marko \v{S}vaco, Bojan Jerbi\'c — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.15424v2 Announce Type: replace Abstract: In this paper, a Lyapunov-based synthesis of a PI-like controller is proposed for robust trajectory tracking of an independently driven and steered four-wheel mobile robot. For the robot considered in this work, an explicit structurally verified mathematical model is used to enable systematic controller design with rigorous stability guarantees suitable for real time implementation. An augmented Lyapunov-based practical stability analysis is developed for the combined velocity-error and integral-error dynamics of the inner loop, yielding explicit bounds and sufficient conditions for practical stability and uniform ultimate boundedness of the combined velocity-error and integral-error state. The resulting control law retains a PI-like structure with model-based feedforward compensation, making it suitable for implementation on standard embedded platforms while improving robustness against configuration dependent residual dynamics and unmodelled effects. The effectiveness and robustness of the proposed design are demonstrated experimentally on a four-wheel independently steered and independently driven mobile robot platform, under both horizontal and vertical operating conditions and benchmarked against a PI controller and a sliding-mode controller.

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

Mirae Kim, Seonghun Jeong, Youngjun Kwak — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.18154v2 Announce Type: replace Abstract: Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.

Programmable Property-Based Testing

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.18545v2 Announce Type: replace Abstract: Property-based testing (PBT) is a popular technique for establishing confidence in software, where users write properties -- i.e., executable specifications -- that can be checked many times in a loop by a testing framework. In modern PBT frameworks, properties are usually written in shallowly embedded domain-specific languages, and their definition is tightly coupled to the way they are tested. Such frameworks often provide convenient configuration options to customize aspects of the testing process, but users are limited to precisely what library authors had the prescience to allow for when developing the framework; if they want more flexibility, they may need to write a new framework from scratch. We propose a new, deeper language for properties based on a mixed embedding that we call deferred binding abstract syntax, which reifies properties as a data structure and decouples them from the property runners that execute them. We implement this language in Rocq and Racket, leveraging the power of dependent and dynamic types, respectively. Finally, we showcase the flexibility of this new approach by rapidly prototyping a variety of property runners, highlighting domain-specific testing improvements that can be unlocked by more programmable testing.

CRAG: Can 3D Generative Models Help 3D Assembly?

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.22629v3 Announce Type: replace Abstract: Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Project Page: https://ai4ce.github.io/CRAG/

Black-Box PWPP Is Not Turing-Closed

Pavel Hub\'a\v{c}ek — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.23809v3 Announce Type: replace Abstract: We establish that adaptive collision-finding queries are strictly more powerful than non-adaptive ones by proving that the complexity class PWPP (Polynomial Weak Pigeonhole Principle) is not closed under adaptive Turing reductions in the black-box setting. Previously, PWPP was known to be closed under non-adaptive Turing reductions (Je\v{r}\'abek 2016). We demonstrate this black-box separation by introducing the NESTED-COLLISION problem, a natural collision-finding problem defined on a pair of shrinking functions. We show that while this problem is solvable via two adaptive calls to a PWPP oracle, it cannot be solved via an efficient black-box non-adaptive reduction to the canonical PWPP-complete problem COLLISION.

TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.00025v2 Announce Type: replace Abstract: Direct Preference Optimization (DPO) is an effective and widely adopted approach for offline alignment but is poorly matched to ontology-driven structured prediction, where preferred and rejected JSON objects often differ in only a few schema-defining tokens. In this low-edit-distance regime, sequence-level DPO spreads gradient mass across non-critical serialization tokens (gradient dilution) and can reduce likelihood on rare, under-confident preferred schema tokens (token erosion). To address these limitations, we first develop a confusion-aware preference-construction strategy that augments expert-curated ambiguity patterns with empirical structured-error modes estimated from validation-set SFT predictions, synthesizing minimally perturbed, schema-valid negatives that focus preference learning on realistic ontology-level decision errors. We then introduce Token-Adaptive Barrier Preference Optimization (TAB-PO), a post-SFT objective for token-critical structured generation. TAB-PO adds a confidence-gated token-level barrier that applies supervised anchoring to under-confident schema tokens. On the public SciERC scientific information extraction task, evaluated with Llama/Qwen models from 1.5B to 70B, TAB-PO improves ontology-critical semantic-label and relational-linking metrics over SFT by 11.59% on average, wins 100% of comparisons against the strongest token-level and sequence-level DPO variants on these metrics, and surpasses leading frontier models by 14.71%, while delivering strong gains in textual grounding.

EgoMoD: Predicting Global Maps of Dynamics from Local Egocentric Observations

Iacopo Catalano, David Morilla-Cabello, Jorge Pena-Queralta, Eduardo Montijano — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.00167v2 Announce Type: replace Abstract: Efficient navigation in dynamic environments requires anticipating how motion patterns evolve beyond the robot's immediate perceptual range, enabling preemptive rather than purely reactive planning in crowded scenes. Maps of Dynamics (MoDs) offer a structured representation of motion tendencies in space useful for long-term global planning, but constructing them traditionally requires global environment observations over extended periods of time. We introduce EgoMoD, the first approach that learns to predict future MoDs directly from short egocentric video clips collected during robot operation. Our method learns to infer environment-wide motion tendencies from local dynamic cues using a video- and pose-conditioned architecture trained with MoDs computed from external observations as privileged supervision, allowing local observations to serve as predictive signals of global motion structure. Thanks to this, we offer the capacity to forecast future motion dynamics over the whole environment rather than merely extend past patterns in the robot's field of view. As a site-specific dynamic prior, EgoMoD replaces the external global sensing infrastructure required by prior MoD methods at inference time with standard onboard sensors. Experiments in large simulated environments show that EgoMoD predicts future MoDs under limited observability, while evaluation with real images showcases its zero-shot transferability to real systems.

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.00610v3 Announce Type: replace Abstract: While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgment scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. Code is available at GitHub (https://github.com/Haiwen-Xia/CMI-RewardBench). Model weights: CMI-RM (https://huggingface.co/HaiwenXia/CMI-RM). Datasets: CMI-Pref-Pseudo (https://huggingface.co/datasets/HaiwenXia/cmi-pref-pseudo) and CMI-Pref (https://huggingface.co/datasets/HaiwenXia/cmi-pref)

Structured vs. Unstructured Pruning: An Exponential Gap

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.02234v3 Announce Type: replace Abstract: The Strong Lottery Ticket Hypothesis (SLTH) states that large, randomly initialized neural networks contain sparse subnetworks capable of approximating a target function at initialization without training, suggesting that pruning alone is sufficient. Pruning methods are typically classified as unstructured, where individual weights can be removed from the network, and structured, where parameters are removed according to specific patterns, as in neuron pruning. Existing theoretical results supporting the SLTH rely almost exclusively on unstructured pruning, showing that logarithmic overparameterization suffices to approximate simple target networks. In contrast, neuron pruning has received limited theoretical attention, despite its practical appeal for direct hardware speedups. In this work, we consider the problem of approximating a single bias-free ReLU neuron by pruning hidden units of a randomly initialized two-layer ReLU network, effectively isolating the intrinsic limitations of neuron pruning. We show that achieving an $\varepsilon$-approximation requires a starting network size of $\Omega(1/\varepsilon)$ for neuron pruning, whereas weight pruning succeeds with only $O(\log(1/\varepsilon))$ hidden units, revealing an exponential separation between the two approaches.

PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition

Jinseop Lee, Byoungho Lee, Gichul Yoo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.05965v3 Announce Type: replace Abstract: We present PROBE (PRobabilistic Occupancy BEV Encoding), a learning-free LiDAR place recognition descriptor that models each BEV cell's occupancy as a Bernoulli random variable. Rather than relying on discrete point-cloud perturbations, PROBE analytically marginalizes over continuous Cartesian translations via the polar Jacobian, yielding a distance-adaptive angular uncertainty $\sigma_\theta = \sigma_t / r$ in $\mathcal{O}(R{\cdot}S)$ time. The primary parameter $\sigma_t$ represents the expected translational uncertainty in meters, a sensor-independent physical quantity that enhances cross-sensor generalization while reducing the need for extensive per-dataset tuning. Pairwise similarity combines a Bernoulli-KL Jaccard with exponential uncertainty gating and FFT-based height cosine similarity for rotation alignment. Evaluated on four datasets spanning four diverse LiDAR types, PROBE achieves the highest accuracy among handcrafted descriptors in multi-session evaluation and competitive single-session performance relative to both handcrafted and supervised baselines. The source code and supplementary materials are available at https://sites.google.com/view/probe-pr.

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.06652v2 Announce Type: replace Abstract: Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations--cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.

Efficient Neighbourhood Search in 3D Point Clouds Through Space-Filling Curves and Linear Octrees

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.06771v3 Announce Type: replace Abstract: This work presents an efficient approach for neighbourhood searching in 3D point clouds by combining spatial reordering leveraging Space-Filling Curves (SFC), specifically Morton and Hilbert curves, with a linear Octree implementation. We also propose specialised search algorithms for fixed-radius and kNN queries, based on our linear Octree structures. Additionally, we introduce the novel concept of kNN locality histogram, which can be easily computed to characterise locality in data accesses, and we found to be directly related to cache misses and search performance. Our experiments reveal that SFC reordering significantly improves access to spatial data, reducing the number of cache misses from 25% to 75% and runtime by up to 50%. Moreover, we compare our proposal with several widely used Octree and KDTree implementations. Our method achieves a significant reduction in search time, up to 10$\times$ faster than existing solutions. Additionally, we analysed the performance of our neighbourhood searches (parallelised using OpenMP), demonstrating high scalability with the number of cores and the problem size. Notably, we observed a speedup of up to $36\times$ when executing fixed-radius searches in a system with 40 cores. The results obtained indicate that our methods provide a robust and efficient solution for applications that require fast access to large-scale 3D point neighbour sets.

Discontinuous Galerkin approximation of a nonlinear multiphysics problem arising in ultrasound-enhanced drug delivery

Femke de Wit, Vanja Nikoli\'c — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.08415v2 Announce Type: replace Abstract: Motivated by simulations of ultrasound-enhanced drug delivery, this work presents the numerical analysis of a mathematical model that captures the influence of ultrasound waves on the diffusivity of the drug. The system under study consists of the Westervelt wave equation, accounting for the nonlinear propagation of ultrasound, coupled to a convection-diffusion equation modeling the drug concentration. In particular, drug delivery is affected by ultrasound through a pressure-dependent diffusion coefficient. The Westervelt equation is supplemented by linear absorbing boundary conditions as a means of reducing spurious reflections off the boundaries of computational domains. For spatial discretization of this multiphysics system, we employ a discontinuous Galerkin approach on simplicial meshes. Under suitable assumptions on the exact pressure and the mesh size, we first establish well-posedness, non-degeneracy, and optimal convergence rates in the energy norm for the semi-discrete pressure subproblem. The smallness of the semi-discrete pressure is then used to establish the well-posedness and convergence of the wave--convection-diffusion system under suitable regularity of the exact concentration. Finally, theoretical findings are illustrated through numerical experiments.

Echo2ECG: Enhancing ECG Representations with Cardiac Morphology from Multi-View Echos

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.08505v2 Announce Type: replace Abstract: Electrocardiography (ECG) is a low-cost, widely used modality for diagnosing electrical abnormalities like atrial fibrillation by capturing the heart's electrical activity. However, it cannot directly measure cardiac morphological phenotypes, such as left ventricular ejection fraction (LVEF), which typically require echocardiography (Echo). Predicting these phenotypes from ECG would enable early, accessible health screening. Existing self-supervised methods suffer from a representational mismatch by aligning ECGs to single-view Echos, which only capture local, spatially restricted anatomical snapshots. To address this, we propose Echo2ECG, a multimodal self-supervised learning framework that enriches ECG representations with the heart's morphological structure captured in multi-view Echos. We evaluate Echo2ECG as an ECG feature extractor on two clinically relevant tasks that fundamentally require morphological information: (1) classification of structural cardiac phenotypes across three datasets, and (2) retrieval of Echo studies with similar morphological characteristics using ECG queries. Our extracted ECG representations consistently outperform those of state-of-the-art unimodal and multimodal baselines across both tasks, despite being 18x smaller than the largest baseline. These results demonstrate that Echo2ECG is a robust, powerful ECG feature extractor. Our code is accessible at https://github.com/michelleespranita/Echo2ECG.

On the Reliability of Cue Conflict and Beyond

Pum Jun Kim, Seung-Ah Lee, Seongho Park, Dongyoon Han, Jaejun Yoo — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.10834v3 Announce Type: replace Abstract: Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-conflict benchmark has been influential in probing shape-texture preference and in motivating the insight that stronger, human-like shape bias is often associated with improved in-domain performance. However, we find that the current stylization-based instantiation can yield unstable and ambiguous bias estimates. Specifically, stylization may not reliably instantiate perceptually valid and separable cues nor control their relative informativeness, ratio-based bias can obscure absolute cue sensitivity, and restricting evaluation to preselected classes can distort model predictions by ignoring the full decision space. Together, these factors can confound preference with cue validity, cue balance, and recognizability artifacts. We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis. REFINED-BIAS constructs balanced, human- and model- recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric, enabling fairer cross-model comparisons. Across diverse training regimes and architectures, REFINED-BIAS enables fairer cross-model comparison, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions, resolving inconsistencies that prior cue-conflict evaluations could not reliably disambiguate.

Differentiable Thermodynamic Phase-Equilibria for Machine Learning

Karim K. Ben Hicham, Moreno Ascani, Jan G. Rittig, Alexander Mitsos — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.11249v3 Announce Type: replace Abstract: Accurate prediction of phase equilibria remains a central challenge in chemical engineering. Physics-consistent machine learning methods that incorporate thermodynamic structure into neural networks have recently shown strong performance for activity-coefficient modeling. However, extending such approaches to equilibrium data arising from an extremum principle, such as liquid-liquid equilibria, remains difficult. Here we present DISCOMAX, a differentiable algorithm for phase-equilibrium calculation that guarantees thermodynamic consistency at both training and inference, only subject to a user-specified discretization. The method combines discrete enumeration of feasible phase states with masked softmax aggregation in the backward pass, with the propagation of the true equilibrium state in the forward pass, using a straight-through gradient estimator to enable physics-consistent end-to-end learning of neural \gls{gE}-models. We show that this approach bears analogy to statistical thermodynamics, and we evaluate it on binary liquid-liquid equilibrium data where it outperforms existing surrogate-based methods, while offering a general framework for learning from different kinds of equilibrium data.

ARROW: Augmented Replay for RObust World models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.11395v3 Announce Type: replace Abstract: Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.

Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents

Sky Chenwei Wan, Yifei Y. Wang, Tianjun Hou, Xiqing Chang, Aymeric Jan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.11479v3 Announce Type: replace Abstract: Time Series Event Detection (TSED) aims to localize semantically meaningful events in time series data, with critical applications in high-stakes domains. Unlike statistical anomalies, events are often defined by natural-language descriptions with internal temporal-logic structures across multiple physical channels. However, in real-world settings, dense event annotations are expensive to obtain, making purely supervised learning difficult. We introduce Language-guided TSED, a setting where a model is given textual event descriptions and must ground them to intervals in multivariate signals with little or no labeled data. To address this problem, we propose Event Logic Tree (ELT), a knowledge representation framework that converts linguistic descriptions into structured temporal logic over signal primitives. Building on ELT, we present SELA, a neuro-symbolic VLM agent framework that iteratively grounds primitives from signal visualizations and composes them under ELT constraints, producing both event intervals and faithful tree-structured explanations. We further release a real-world benchmark across energy and climate domains with expert knowledge and annotations. Experiments show that SELA improves over supervised fine-tuning and existing zero/few-shot time series reasoning baselines.

CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.11863v2 Announce Type: replace Abstract: The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.

Mixing Makes Markovian Contexts Cheap for Linear Bandits

Kaan Buyukkalayci, Osama Hanna, Christina Fragouli — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.12530v2 Announce Type: replace Abstract: Recent work shows that when contexts are drawn i.i.d., linear contextual bandits can be reduced to single-context linear bandits. This ``contexts are cheap'' perspective is highly advantageous, as it allows for sharper finite-time analyses and leverages mature techniques from the linear bandit literature, such as those for misspecification and adversarial corruption. However, this reduction crucially relies on the independence of contexts and does not extend to settings with temporally correlated (e.g., Markovian) contexts, which arise frequently in practice. Motivated by applications with temporally correlated availability, we extend this perspective to linear bandits with Markovian context processes, where the action set evolves via an exogenous Markov chain. Our main contribution is a reduction that applies under uniform geometric ergodicity. We construct a stationary surrogate action set to solve the problem using a standard linear bandit oracle, employing a delayed-update scheme to control the bias induced by the nonstationary conditional context distributions. We further provide a phased algorithm for unknown stationary distributions that learns the surrogate mapping online. In both settings, we obtain a high-probability worst-case regret bound matching that of the underlying linear bandit oracle in sufficiently fast mixing regimes. We then validate our results on a real-world instance, where we show practical gains over a LinUCB baseline.

Towards One-for-All Anomaly Detection for Tabular Data

Shiyuan Li, Yixin Liu, Yu Zheng, Xiaofeng Cao, Shirui Pan, Heng Tao Shen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.14407v2 Announce Type: replace Abstract: Tabular anomaly detection (TAD) aims to identify samples that deviate from the majority in tabular data and is critical in many real-world applications. However, existing methods follow a ``one model for one dataset (OFO)'' paradigm, which relies on dataset-specific training and thus incurs high computational cost and yields limited generalization to unseen domains. To address these limitations, we propose OFA-TAD, a generalist one-for-all (OFA) TAD framework that only requires one-time training on multiple source datasets and can generalize to unseen datasets from diverse domains on-the-fly. To realize one-for-all tabular anomaly detection, OFA-TAD extracts neighbor-distance patterns as transferable cues, and introduces multi-view neighbor-distance representations from multiple transformation-induced metric spaces to mitigate the transformation sensitivity of distance profiles. To adaptively combine multi-view distance evidence, a Mixture-of-Experts (MoE) scoring network is employed for view-specific anomaly scoring and entropy-regularized gated fusion, with a multi-strategy anomaly synthesis mechanism to support training under the one-class constraint. Extensive experiments on 34 datasets from 14 domains demonstrate that OFA-TAD achieves superior anomaly detection performance and strong cross-domain generalizability under the strict OFA setting. The source code is available at https://github.com/Shiy-Li/OFA-TAD.

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.14482v3 Announce Type: replace Abstract: We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling.

Disentangling Dynamical Systems: Causal Representation Learning Meets Local Sparse Attention

Markus W. Baumgartner, Anson Lei, Joe Watson, Ingmar Posner — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.14483v2 Announce Type: replace Abstract: Parametric system identification methods estimate the parameters of explicitly defined physical systems from data. Yet, they remain constrained by the need to provide an explicit function space, typically through a predefined library of candidate functions chosen via available domain knowledge. In contrast, deep learning can demonstrably model systems of broad complexity with high fidelity, but black-box function approximation typically fails to yield explicit descriptive or disentangled representations revealing the structure of a system. We develop a novel identifiability theorem, leveraging causal representation learning, to uncover disentangled representations of system parameters without structural assumptions. We derive a graphical criterion specifying when system parameters can be uniquely disentangled from raw trajectory data, up to permutation and diffeomorphism. Crucially, our analysis demonstrates that global causal structures provide a lower bound on the disentanglement guarantees achievable when considering local state-dependent causal structures. We instantiate system parameter identification as a variational inference problem, leveraging a sparsity-regularised transformer to uncover state-dependent causal structures. We empirically validate our approach across four synthetic domains, demonstrating its ability to recover highly disentangled representations that baselines fail to recover. Corroborating our theoretical analysis, our results confirm that enforcing local causal structure is often necessary for full identifiability.

Point-Identification of a Robust Predictor Under Latent Shift with Imperfect Proxies

Zahra Rahiminasab, Reza Soumi, Arto Klami, Samuel Kaski — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.15158v3 Announce Type: replace Abstract: Addressing the domain adaptation problem becomes more challenging when distribution shifts across domains stem from latent confounders that affect both covariates and outcomes. Existing proxy-based approaches that address latent shift rely on a strong completeness assumption to uniquely determine (point-identify) a robust predictor. Completeness requires that proxies have sufficient information about variations in latent confounders. For imperfect proxies the mapping from confounders to the space of proxy distributions is non-injective, and multiple latent confounder values can generate the same proxy distribution. This breaks the completeness assumption and observed data are consistent with multiple potential predictors (set-identified). To address this, we introduce latent equivalent classes (LECs). LECs are defined as groups of latent confounders that induce the same conditional proxy distribution. We show that point-identification for the robust predictor remains achievable as long as multiple domains differ sufficiently in how they mix proxy-induced LECs to form the robust predictor. This domain diversity condition is formalized as a cross-domain rank condition on the mixture weights, which is substantially weaker assumption than completeness. We introduce the Proximal Quasi-Bayesian Active learning (PQAL) framework, which actively queries a small, targeted set of diverse domains that satisfy this rank condition. PQAL can recover the point-identified predictor, demonstrates robustness to varying degrees of shift and outperforms previous methods on synthetic data and semi-synthetic dSprites, IHDP, ACS Folktables datasets.

Safety Case Patterns for VLA-based driving systems: Insights from SimLingo

Gerhard Yu, Fuyuki Ishikawa, Oluwafemi Odu, Alvine Boaye Belle — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.16013v3 Announce Type: replace Abstract: Vision-Language-Action (VLA)-based driving systems represent a significant paradigm shift in autonomous driving since, by combining traffic scene understanding, linguistic interpretation, and action generation, these systems enable more flexible, adaptive, and instruction-responsive driving behaviors. However, despite their growing adoption and potential to support socially responsible autonomous driving as well as understanding high-level human instructions, VLA-based driving systems may exhibit new types of hazardous behaviors. For instance, the integration of open-ended natural language inputs (e.g., user or navigation instructions) into the multimodal control loop may lead to unpredictable and unsafe behaviors that could endanger vehicle occupants and pedestrians. Hence, assuring the safety of these systems is crucial to help build trust in their operations. To support this, we propose a novel safety case design approach called RAISE. Our approach introduces novel patterns tailored to instruction-based driving systems such as VLA-based driving systems, an extension of Hazard Analysis and Risk Assessment (HARA) detailing safe scenarios and their outcomes, and a design technique to create the safety cases of VLA-based driving systems. A case study on SimLingo illustrates how our approach can be used to construct rigorous, evidence-based safety claims for this emerging class of autonomous driving systems.

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.21563v5 Announce Type: replace Abstract: Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles, but reinforcement learning for such systems is limited by credit assignment: shared terminal rewards obscure individual contributions and can encourage free-riding. We introduce two optimizer-agnostic credit assignment methods for converting joint outcomes into agent-specific learning signals. Counterfactual Credit for Policy Optimization (CCPO) estimates an agent's marginal contribution by comparing the realized joint outcome with a counterfactual outcome where that agent is removed. Self-Evaluated Credit for Policy Optimization (SEPO) uses constrained self- and peer-evaluations as a verifier-anchored credit signal while keeping the external task outcome dominant. Both operate at the reward-construction layer rather than as policy optimizers, producing role-specific rewards or advantages for GRPO, GSPO, or REINFORCE++. We instantiate these credit signals in a sequential Think--Solve setting and evaluate them on mathematical reasoning benchmarks. Results show that explicit credit assignment often improves dual-agent reasoning, especially on MATH500 and several out-of-distribution settings, while gains vary across models and datasets. Our code is available at: https://github.com/bhai114/ccpo.

OccAny: Generalized Unconstrained Urban 3D Occupancy

Anh-Quan Cao, Tuan-Hung Vu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.23502v2 Announce Type: replace Abstract: Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images. Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii) Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets. Our code is available at https://github.com/valeoai/OccAny .

Cross-Model Disagreement as a Label-Free Correctness Signal

Matt Gorbett, Suman Jana — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.25450v2 Announce Type: replace Abstract: Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model's own uncertainty -- such as token entropy or confidence scores -- but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator -- a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model's generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model's surprise at the generating model's answer tokens, and Cross-Model Entropy (CME), which measures the verifying model's uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.

Etna: An Evaluation Platform for Property-Based Testing

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.27002v2 Announce Type: replace Abstract: Property-based testing is a mainstay of functional programming, boasting a rich literature, an enthusiastic user community, and an abundance of tools~ -- so many, indeed, that new users may have difficulty choosing. Moreover, any given framework may support a variety of strategies for generating test inputs; even experienced users may wonder which are better in any given situation. Sadly, the PBT literature, though long on creativity, is short on rigorous comparisons to help answer such questions. We present ETNA, a platform for empirical evaluation and comparison of PBT techniques. ETNA incorporates a number of popular PBT frameworks and testing workloads from the literature, and its extensible architecture makes adding new ones easy, while handling the technical drudgery of performance measurement. To illustrate its benefits, we use ETNA to carry out several experiments with popular PBT approaches in Rocq, Haskell, OCaml, Racket, and Rust, allowing users to more clearly understand best practices and tradeoffs.

Coupling Scenario-Based Grid Simulations with State Estimation: Measurement Requirements for Low-Voltage Networks under the German Energy Transition Pathway

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.28945v2 Announce Type: replace Abstract: Increasing penetration of electric vehicles, heat pumps, and rooftop photovoltaics is creating thermal and voltage stress in low-voltage distribution grids. This work links the German Federal Government energy transition pathway (2025-2045) with state estimation performance requirements, evaluated on two SimBench reference networks across three equipment quality levels (good, medium, poor) and three VDE Forum Netztechnik/Netzbetrieb (VDE FNN) measurement constellations that differ in the availability of transformer and feeder-level instrumentation. Within this work's analysis, congestion is caused exclusively by transformer overloading and voltage-band violations. No individual line exceeds its thermal rating (maximum: 89.5%). Equipment quality governs congestion onset for a given deployment trajectory: under good equipment, congestion remains absent through 2045, under medium equipment it emerges from 2035 (3/6 scenarios), under poor equipment from 2025 (6/6). Without transformer instrumentation, median voltage estimation errors reach 6-42% regardless of smart meter penetration. Adding a single transformer measurement reduces errors by an order of magnitude, achieving median errors of 0.5-1.7%. In urban networks, transformer-level instrumentation meets the VDE FNN voltage accuracy target (99th percentile voltage error below 2%) in all configurations. In rural networks under poor equipment, the target is approached but not met. These findings motivate prioritizing transformer instrumentation as an effective first step for grid observability and supplementing the current consumption-driven metering rollout with risk-based deployment criteria linked to local congestion exposure.

Variational Graph Neural Networks for Uncertainty Quantification in Inverse Problems

David Gonzalez, Alba Muixi, Beatriz Moya, Elias Cueto — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.29515v2 Announce Type: replace Abstract: The increasingly wide use of deep machine learning techniques in computational mechanics has significantly accelerated simulations of problems that were considered unapproachable just a few years ago. However, in critical applications such as Digital Twins for engineering or medicine, fast responses are not enough; reliable results must also be provided. In certain cases, traditional deterministic methods may not be optimal as they do not provide a measure of confidence in their predictions or results, especially in inverse problems where the solution may not be unique or the initial data may not be entirely reliable due to the presence of noise, for instance. Classic deep neural networks also lack a clear measure to quantify the uncertainty of their predictions. In this work, we present a variational graph neural network (VGNN) architecture that integrates variational layers into its architecture to model the probability distribution of weights. Unlike computationally expensive full Bayesian networks, our approach strategically introduces variational layers exclusively in the decoder, allowing us to estimate cognitive uncertainty and statistical uncertainty at a relatively lower cost. In this work, we validate the proposed methodology in two cases of solid mechanics: the identification of the value of the elastic modulus with nonlinear distribution in a 2D elastic problem and the location and quantification of the loads applied to a 3D hyperelastic beam, in both cases using only the displacement field of each test as input data. The results show that the model not only recovers the physical parameters with high precision, but also provides confidence intervals consistent with the physics of the problem, as well as being able to locate the position of the applied load and estimate its value, giving a confidence interval for that experiment.

DCD: Domain-Oriented Design for Controlled Retrieval-Augmented Generation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.07590v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) is widely used to ground large language models in external knowledge sources. However, when applied to heterogeneous corpora and multi-step queries, Naive RAG pipelines often degrade in quality due to flat knowledge representations and the absence of explicit workflows. In this work, we introduce DCD (Domain-Collection-Document), a domain-oriented design to structure knowledge and control query processing in RAG systems without modifying the underlying language model. The proposed approach relies on a hierarchical decomposition of the information space and multi-stage routing based on structured model outputs, enabling progressive restriction of both retrieval and generation scopes. The architecture is complemented by smart chunking, hybrid retrieval, and integrated validation and generation guardrail mechanisms. We describe the DCD architecture and workflow and discuss evaluation results on synthetic evaluation dataset, highlighting their impact on robustness, factual accuracy, and answer relevance in applied RAG scenarios.

WOMBET: World Model-Based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

Mintae Kim, Koushil Sreenath — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.08958v3 Announce Type: replace Abstract: Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose World Model-Based Experience Transfer (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.

AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.08983v2 Announce Type: replace Abstract: Spatial reasoning is a fundamental capability for embodied intelligence, especially for fine-grained manipulation tasks such as robotic assembly. Recent methods based on vision-language models (VLMs) largely rely on coarse 2D perception and struggle to perform accurate reasoning over complex 3D geometry. To address this limitation, we propose AssemLM, a spatial multimodal large language model for robotic assembly that integrates assembly manuals, point clouds, and textual instructions to predict task-critical 6D assembly poses with explicit geometric understanding. To bridge raw 3D perception and high-level linguistic reasoning, AssemLM employs a specialized point cloud encoder to extract fine-grained geometric and rotational features for accurate 3D spatial reasoning in assembly tasks. In addition, we introduce AssemBench, a large-scale benchmark for assembly-oriented spatial reasoning with over 900K multimodal samples and precise 6D pose annotations, extending evaluation from 2D grounding to full 3D geometric inference. Extensive experiments and real-robot evaluations demonstrate that AssemLM achieves state-of-the-art 6D pose reasoning performance and effectively supports fine-grained, multi-step assembly tasks in real-world settings. Code, models, and the AssemBench dataset will be made publicly available.

BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.10389v2 Announce Type: replace Abstract: Terminology substitution errors in clinical notes, where one medical term is replaced by a linguistically valid but clinically different term, pose a persistent challenge for automated error detection in healthcare. We introduce BLUEmed, a multi-agent debate framework augmented with hybrid Retrieval-Augmented Generation (RAG) that combines evidence-grounded reasoning with multi-perspective verification for clinical error detection. BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. We evaluate BLUEmed on a clinical terminology substitution detection benchmark under both zero-shot and few-shot prompting with multiple backbone models spanning proprietary and open-source families. Experimental results show that BLUEmed achieves the best accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) under few-shot prompting, outperforming both single-agent RAG and debate-only baselines. Further analyses across six backbone models and two prompting strategies confirm that retrieval augmentation and structured debate are complementary, and that the framework benefits most from models with sufficient instruction-following and clinical language understanding.

From Planning to Revision: How AI Writing Support at Different Stages Alters Ownership

Katy Ilonka Gero, Tao Long, Carly Schnitzler, Paramveer Dhillon — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.11009v2 Announce Type: replace Abstract: Although AI assistance can improve writing quality, it can also decrease feelings of ownership. Ownership in writing has important implications for attribution, rights, norms, and cognitive engagement, and designers of AI support systems may want to consider how system features may impact ownership. We investigate how the stage at which AI support for writing is provided (planning, drafting, or revising) changes ownership. In a study of short essay writing (between subjects, n = 253) we find that while any AI assistance decreased ownership, planning support only minimally decreased ownership, while drafting support saw the largest decrease. This variation maps onto the amount of text and ideas contributed by AI, where more text and ideas from AI decreased ownership. Notably, an AI-generated draft based on participants' own outline resulted in significantly more AI-contributed ideas than AI support for planning. At the same time, more AI contributions improved essay quality. We propose that writers, educators, and designers consider writing stage when introducing AI assistance.

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.12002v2 Announce Type: replace Abstract: Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser's token distributions conditioned on the generator's response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator's response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization. Code: https://github.com/princeton-pli/Self-Distillation-Zero.

Allocating Human Oversight in AI-Enabled Analytics

Zikun Ye, Jiameng Lyu, Rui Tao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.12497v2 Announce Type: replace Abstract: Organizations increasingly deploy AI as a low-cost prediction layer in customer-facing decision processes, including demand sensing, service-quality monitoring, product testing, and market research, but AI-generated signals are unevenly reliable across tasks, products, and customer segments. Firms therefore still need scarce human validation (labels, audits, survey responses, or follow-up measurements) to anchor AI outputs to ground truth. Because human ground truth is itself noisy, varying across labelers and even across repeated judgments, the firm must collect and average several human labels per task, which makes human validation costly. We study how to allocate a limited human-validation budget across many AI-assisted tasks when reliability is heterogeneous and unknown before deployment. We cast this within tuned prediction-powered inference. Each human label both sharpens the AI-assisted estimate and reveals the task's rectification difficulty, the variance that remains after the AI prediction is optimally used as a control variate. If difficulties were known, the optimal allocation would follow a Neyman square-root rule; because they are unknown, we propose a policy based on upper confidence bounds that learns them online and steers validation toward tasks where AI is least reliable. We prove that the policy's terminal efficiency loss relative to the oracle allocation vanishes as the budget grows. In synthetic experiments and a real digital-twin survey with 68 tasks and over 2000 respondents, it closes most of the gap to the oracle when reliability is heterogeneous, outperforming uniform and epsilon-greedy allocation; on the survey data it also outperforms explore-then-commit pilot designs and cuts uniform's 10--12% gap to 2--6%. The value of AI depends not only on model accuracy but also on the operational policy that targets human oversight where AI errors matter most.

ASTER: Latent Pseudo-Anomaly Generation for Unsupervised Time-Series Anomaly Detection

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.13924v3 Announce Type: replace Abstract: Time-series anomaly detection (TSAD) is critical in domains such as industrial monitoring, healthcare, and cybersecurity, but it remains challenging due to rare and heterogeneous anomalies and the scarcity of labelled data. This scarcity makes unsupervised approaches predominant, yet existing methods often rely on reconstruction or forecasting, which struggle with complex data, or on embedding-based approaches that require domain-specific anomaly synthesis and fixed distance metrics. We propose ASTER, a framework that generates pseudo-anomalies directly in the latent space, avoiding handcrafted anomaly injections and the need for domain expertise. A latent-space decoder produces tailored pseudo-anomalies to train a Transformer-based anomaly classifier, while a pre-trained LLM enriches the temporal and contextual representations of this space. Experiments on three benchmark datasets show that ASTER achieves state-of-the-art performance and sets a new standard for LLM-based TSAD.

Nonlinear backstepping with saturation for low-thrust station-keeping of libration point orbits

Ant\'onio Nunes, S\'ergio Br\'as, Pedro Batista — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.15028v2 Announce Type: replace Abstract: This paper presents a novel nonlinear backstepping control law for continuous, low-thrust station-keeping in the Earth-Moon system. Quasi-periodic libration point orbits are targeted under a high-fidelity model of the dynamics. Almost global uniform exponential stability guarantees are attained, as shown through Lyapunov's stability theory. Saturation of the actuators is formally included in the controller design, such that these guarantees hold even in the event of saturation. The relationship between saturation threshold, control gains, and deviation is studied and an optimal procedure for gain selection is discussed. The control solution is tested numerically through a Monte Carlo analysis over representative application cases, subject to operational errors, constraints, and external perturbations. Station-keeping under actuation saturation is validated considering a conservative threshold for typical electric propulsion systems.

A Survey on Long-Term Memory Security in LLM Agents: Attacks, Defenses, and Governance Across the Memory Lifecycle

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.16548v2 Announce Type: replace Abstract: The emergence of writable, cross-session persistent memory in LLM agents introduces a qualitatively different threat landscape from conventional input-centric security concerns, characterized by three properties: persistence, statefulness, and propagation. To systematically characterize this landscape, we propose a Memory Lifecycle Framework that organizes attacks, defenses, and their cross-phase dependencies along two axes: six lifecycle phases (Write, Store, Retrieve, Execute, Share & Propagate, Forget & Rollback) and four security objectives (Integrity, Confidentiality, Availability, Governance). This analysis in turn exposes the need for formal security guarantees at the system level, motivating Verifiable Memory Governance(VMG), a framework of five architectural primitives that specifies what verifiable mechanisms a long-term-memory system must provide to maintain auditable, recoverable control over its memory state. Our analysis indicates that robust Long-Term Memory (LTM) security cannot be retrofitted at retrieval or execution time alone, but must be anchored in storage-time provenance, versioning, and policy-aware retention from the outset.

The Query Channel: Information-Theoretic Limits of Masking-Based Explanations

Erciyes Karakaya, Ozgur Ercetin — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.16689v2 Announce Type: replace Abstract: Masking-based post-hoc explanation methods, such as KernelSHAP and LIME, estimate local feature importance by querying a black-box model under randomized perturbations. This paper formulates this procedure as communication over a query channel, where the latent explanation acts as a message and each masked evaluation is a channel use. Within this framework, the complexity of the explanation is captured by the entropy of the hypothesis class, while the query interface supplies information at a rate determined by an identification capacity per query. We derive a strong converse showing that, if the explanation rate exceeds this capacity, the probability of exact recovery necessarily converges to one in error for any sequence of explainers and decoders. We also prove an achievability result establishing that a sparse maximum-likelihood decoder attains reliable recovery when the rate lies below capacity. A Monte Carlo estimator of mutual information yields a non-asymptotic query benchmark that we use to compare optimal decoding with Lasso- and OLS-based procedures that mirror LIME and KernelSHAP. Experiments reveal a range of query budgets where information theory permits reliable explanations but standard convex surrogates still fail. Finally, we interpret super-pixel resolution and tokenization for neural language models as a source-coding choice that sets the entropy of the explanation and show how Gaussian noise and nonlinear curvature degrade the query channel, induce waterfall and error-floor behavior, and render high-resolution explanations unattainable.

Implementing CPSLint: A Data Validation and Sanitisation Tool for Industrial Cyber-Physical Systems

Uraz Odyurt, \"Omer Sayilir, Mari\"elle Stoelinga, Vadim Zaytsev — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.18191v2 Announce Type: replace Abstract: Raw datasets are often too large and unstructured to work with directly, and require a data preparation phase. The domain of industrial Cyber-Physical Systems (CPSs) is no exception, as raw data typically consists of large time-series data collections that log the system's status at regular time intervals. The processing of such raw data is often carried out using ad hoc, case-specific, one-off Python scripts, often neglecting aspects of readability, reusability, and maintainability. In practice, this can cause professionals such as data scientists to write similar data preparation scripts for each case, requiring them to do much repetitive work. We introduce CPSLint, a Domain-Specific Language (DSL) designed to support the data preparation process for industrial CPS. CPSLint raises the level of abstraction to the point where both data scientists and domain experts can perform the data preparation task. We leverage the fact that many raw data collections in the industrial CPS domain require similar actions to render them suitable for data-centric workflows. In our DSL one can express the data preparation process in just a few lines of code. CPSLint is a publicly available tool applicable for any case involving time-series data collections in need of sanitisation.

Reasoning Models Know What's Important, and Encode It in Their Activations

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.18307v2 Announce Type: replace Abstract: Language models often solve complex tasks by generating long reasoning chains, consisting of many steps with varying importance. While some steps are crucial for generating the final answer, others are removable. Determining which steps matter most, and why, remains an open question central to understanding how models process reasoning. We investigate if this question is best approached through model internals or through tokens of the reasoning chain itself. We find that model activations contain more information than tokens for identifying important reasoning steps. Crucially, by training probes on model activations to predict importance, we show that models encode an internal representation of step importance, even prior to the generation of subsequent steps. The internal representations of importance in different models yield high agreement on which steps are important. The representation is distributed across layers, and does not correlate with surface-level features, such as a step's relative position or its length. Our findings suggest that analyzing activations can reveal aspects of reasoning that surface-level approaches fundamentally miss, indicating that reasoning analyses should look into model internals.

Machine Learning-based Two-Stage Graph Sparsification for the Travelling Salesman Problem

Bo-Cheng Lin, Yi Mei, Mengjie Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.20236v2 Announce Type: replace Abstract: High-performance TSP solvers such as Lin-Kernighan-Helsgaun (LKH) search within a \emph{candidate graph} -- a small subset of edges pre-selected for the solver -- rather than over the complete graph. The two leading sparsification heuristics, $\alpha$-Nearest and POPMUSIC, each fall short of the density-coverage balance: $\alpha$-Nearest is dense with stable recall, while POPMUSIC is sparser but its recall degrades with scale. Their union closes the recall gap while remaining far below the complete graph in density, leaving room for further reduction. Existing learning-based sparsifiers score edges on the complete graph, an approach that is expensive and largely limited to Euclidean instances. We propose a two-stage method that inverts this logic. Stage~1 takes the union of $\alpha$-Nearest and POPMUSIC, achieving near-perfect recall at ${\sim}6N$ edges. Crucially, the union annotates each edge with its \emph{source provenance} -- whether it was endorsed by $\alpha$-Nearest, POPMUSIC, or both. Stage~2 trains a lightweight classifier on these annotated edges and prunes the lowest-scoring ones. Because dual-source edges are almost always optimal, the learning problem reduces to filtering the single-source subset -- a substantially easier task than classifying all $O(N^2)$ edges from scratch. Across four distance types, five spatial distributions, and problem sizes from 50 to 500, the pipeline reduces candidate-graph density by $37$-$47\%$ while retaining ${\geq}99.69\%$ of optimal-tour edges, and matches or exceeds the coverage of recent Euclidean-only neural sparsifiers at lower density at TSP500.

Lexicographic Minimum-Violation Motion Planning using Signal Temporal Logic

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.20428v2 Announce Type: replace Abstract: Motion planning for autonomous vehicles often requires satisfying multiple conditionally conflicting specifications. In situations where not all specifications can be met simultaneously, minimum-violation motion planning maintains system operation by minimizing violations of specifications in accordance with their priorities. Signal temporal logic (STL) provides a formal language for rigorously defining these specifications and enables the quantitative evaluation of their violations. However, a total ordering of specifications yields a lexicographic optimization problem, which is typically computationally expensive to solve using standard methods. We address this problem by transforming the multi-objective lexicographic optimization problem into a single-objective scalar optimization problem using non-uniform quantization and bit-shifting. Specifically, we extend a deterministic model predictive path integral (MPPI) solver to efficiently solve optimization problems without quadratic input cost. Additionally, a novel predicate-robustness measure that combines spatial and temporal violations is introduced. Our results show that the proposed method offers an interpretable and scalable solution for lexicographic STL minimum-violation motion planning within a single-objective solver framework.

BSViT: A Burst Spiking Vision Transformer for Expressive and Efficient Visual Representation Learning

Hongxiang Peng, Dewei Bai, Hong Qu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.23165v2 Announce Type: replace Abstract: Spiking Vision Transformers (S-ViTs) offer a promising framework for energy-efficient visual learning. However, existing designs remain limited by two fundamental issues: the restricted information capacity of binary spike coding and the dense token interactions introduced by global self-attention. To address these challenges, this work proposes BSViT, a burst spiking-driven Vision Transformer featuring a Dual-Channel Burst Spiking Self-Attention (DBSSA) mechanism. DBSSA encodes queries with binary spikes and keys with burst spikes to enhance representational capacity. The value pathway adopts dual excitatory and inhibitory binary channels, enabling signed modulation and richer spike interactions. Importantly, the entire attention operation preserves addition-only computation, ensuring compatibility with energy-efficient neuromorphic hardware. To further reduce spike activity and incorporate spatial priors, a patch adjacency masking strategy is introduced to restrict attention to local neighborhoods, resulting in structure-aware sparsity and reduced computational overhead. In addition, burst spike coding is systematically integrated across the network to increase spike-level representational capacity beyond conventional binary spiking. Extensive experiments on both static and event-based vision benchmarks demonstrate that BSViT consistently outperforms existing spiking Transformers in accuracy while maintaining competitive energy efficiency.

The Pragmatic Persona: Discovering LLM Persona through Bridging Inference

Jisoo Yang, Jongwon Ryu, Minuk Ma, Trung X. Pham, Junyeong Kim — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.24079v2 Announce Type: replace Abstract: Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface-level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse-level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference -- implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small-scale models to 80B-parameter systems, demonstrate that bridging-inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style-based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at https://github.com/JiSoo-Yang/Persona_Bridging.git

Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.24806v2 Announce Type: replace Abstract: Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User Interaction History (UIH). However, the industry-standard "Fat Row" paradigm, which pre-materializes these sequences into every training example, creates a storage and I/O wall where data infrastructure usage exceeds GPU training capacity due to data redundancy that is amplified in multi-tenant environments where models with vastly different sequence length requirements share a union dataset. We present a \emph{versioned late materialization} paradigm that eliminates this redundancy by storing UIH once in a normalized, immutable tier and reconstructing sequences just-in-time during training via lightweight versioned pointers. The system ensures Online-to-Offline (O2O) consistency through a bifurcated protocol that prevents future leakage across both streaming and batch training, while a read-optimized immutable storage layer provides multi-dimensional projection pushdown for heterogeneous model tenants. Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs. Deployed on production DLRMs, the system reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains, serving as the foundational data infrastructure for modern recommendation model architectures, including HSTU and ULTRA-HSTU.

Pseudo-Complex Quantifier Elimination

Nicolas Faro{\ss}, Thomas Sturm — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.26400v2 Announce Type: replace Abstract: We describe the design of a quantifier elimination framework for the complex numbers in the language of ordered rings supplemented with symbols for the imaginary unit, real parts, imaginary parts, and conjugates. Technically, we use a reduction to real quantifier elimination followed by a heuristic reinterpretation of the results within our complex framework. We present computational examples using a prototypical implementation of our approach in our Python-based open-source system Logic1.

Select to Think: Unlocking SLM Potential with Local Sufficiency

Wenxuan Ye, Yangyang Zhang, Xueli An, Georg Carle, Yunpu Ma — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.26940v2 Announce Type: replace Abstract: Small language models (SLMs) offer efficient deployment, yet they often lag behind their larger counterparts (LLMs) in reasoning. Existing remedies either invoke an LLM at points of reasoning divergence, incurring substantial latency and cost, or rely on standard distillation, which is limited by the SLM's capacity to accurately mimic the LLM's complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM's preferred token often resides within the SLM's top-K next-token predictions, even when failing to emerge as the SLM top-1 choice. We therefore propose Select to Think (S2T), which reframes the LLM's role from open-ended generation to selection among the SLM's proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, we introduce S2T-Local, which distills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency. Empirically, a 1.5B SLM's top-8 candidates contain the 32B LLM's choice with a 95% hit rate, and S2T-Local improves the 1.5B SLM's Math Avg. over greedy decoding by 24.1% relative gain, matching the efficacy of 8-path self-consistency with single-trajectory efficiency.

BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.27277v3 Announce Type: replace Abstract: Brain MRI underpins a wide range of neuroscientific and clinical applications, yet most learning-based methods remain task-specific and require substantial labeled data. Here we show that a single self-supervised representation can generalize across heterogeneous brain MRI endpoints. We trained BrainDINO, a self-distilled foundation model, on approximately 6.6 million unlabeled axial slices from 20 datasets encompassing broad variation in population, disease, and acquisition setting. Using a frozen encoder with lightweight task heads, BrainDINO supported transfer across tumor segmentation, neurodegenerative and neurodevelopmental conditions classification, brain age estimation, post-stroke temporal prediction, molecular status prediction, MRI sequence classification, and survival modeling. Across tasks and supervision regimes, BrainDINO consistently equaled or exceeded natural-image and MRI-specific self-supervised baselines, with particularly strong advantages under label scarcity. Representation analyses further showed anatomically organized and pathology-sensitive feature structure in the absence of task-specific supervision. Our findings indicate that large-scale slice-wise self-supervised learning can yield a unified brain MRI representation that supports diverse neuroimaging tasks without volumetric pretraining or full-network fine-tuning, establishing a scalable foundation for robust and data-efficient brain imaging analysis. Code is available at https://github.com/mclwu22/BrainDINO

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

Adam Ishay, Joohyung Lee — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.27960v2 Announce Type: replace Abstract: Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While neuro-symbolic methods attempt to mitigate these issues by coupling LLMs with symbolic reasoners, existing approaches typically rely on monotonic logics (e.g., SMT) that cannot represent defeasible reasoning -- essential components of human cognition. We present "LLM+ASP," a framework that translates natural language into Answer Set Programming (ASP), a nonmonotonic formalism based on stable model semantics. Unlike prior "LLM+ASP" approaches that require manually authored knowledge modules, domain-specific prompts, or evaluation restricted to single problem classes, our framework operates without any per-task engineering and applies uniformly across diverse reasoning tasks. Our system utilizes an automated self-correction loop where structured feedback from the ASP solver enables iterative refinement. Evaluating across six diverse benchmarks, we demonstrate that: (1) stable model semantics allow LLMs to naturally express default rules and exceptions, outperforming SMT-based alternatives by significant margins on nonmonotonic tasks; (2) iterative self-correction is the primary driver of performance, effectively replacing the need for handcrafted domain knowledge; (3) compact in-context reference guides substantially outperform verbose documentation, revealing a "context rot" phenomenon where excessive context hinders constraint adherence.

Optimal Spatio-Temporal Decoupling for Bayesian Conformal Prediction

Yu-Hsueh Fang, Chia-Yen Lee — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.00432v2 Announce Type: replace Abstract: Online conformal prediction must balance fast adaptation to distribution shift against stable coverage: feedback-driven methods react quickly but become volatile, while strongly discounted Bayesian methods lag and inflate intervals at tight coverage. We introduce \textbf{State-Adaptive Bayesian Conformal Prediction (SA-BCP)}, which forms the predictive quantile as a gated convex combination of long-term temporal inertia and local spatial evidence from a kernel density estimate, controlled by a single interpretable evidence threshold $K$. We establish three results: (i) asymptotic marginal validity of the resulting intervals; (ii) a closed-form expression for the MSE-optimal threshold, $K^*_{\mathrm{MSE}}=\alpha(1-\alpha)/M^{\mathcal{T}}$, trading the coverage-indicator (Bernoulli) variance against the temporal structural bias $M^{\mathcal{T}}$; and (iii) a rolling-origin procedure for selecting $K$ online -- consistent under stationarity, with $O(\sqrt{T\log N})$ regret against the best fixed $K$ and, for a segmented variant, a sublinear dynamic-regret bound under bounded drift. Across four financial-volatility and weather datasets, three target coverage levels, and eight baselines (including the strongest recent conditional-quantile methods, SPCI and KOWCPI), SA-BCP attains at-or-above-nominal coverage in most settings while producing substantially sharper intervals -- up to roughly $3\times$ lower Winkler score than discounted Bayesian CP at the tightest coverage -- and a coverage-matched audit confirms these efficiency gains are not an artifact of under-coverage. We disclose one principal limitation: a volatility-specialized conformal-GARCH competitor remains more efficient on its home volatility-base series, though it does not transfer across domains.

Possibilistic Predictive Uncertainty for Deep Learning

Yao Ni, Jeremie Houssineau, Yew-Soon Ong, Piotr Koniusz — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.00600v3 Announce Type: replace Abstract: Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliable epistemic uncertainty modeling. Existing methods for uncertainty modeling face a fundamental dilemma: Bayesian approaches provide principled estimates but remain computationally prohibitive, while efficient second-order predictors lack rigorous connections between their specific objectives and epistemic uncertainty quantification. To resolve this dilemma, we introduce Dirichlet-approximated possibilistic posterior predictions (DAPPr), a principled framework grounded in possibility theory. We define a possibilistic posterior over parameters, project it to the prediction space via supremum operators, and approximate the projected posterior using learnable Dirichlet possibility functions. This projection-and-approximation strategy yields a simple training objective with closed-form solutions. Despite its simplicity, extensive experiments across diverse benchmarks show that DAPPr achieves competitive or superior uncertainty quantification performance over state-of-the-art second-order predictors while maintaining both principled derivation and computational efficiency. Code is available at https://github.com/MaxwellYaoNi/DAPPr.

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.01391v2 Announce Type: replace Abstract: Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to capture the freeform, multi-action interactions between diverse entities which characterize real-world video understanding. Furthermore, the lack of a systematic framework for analyzing model failures across complementary spatio-temporal axes hinders comprehensive evaluation. To address these gaps, we introduce VISTA, a Video Interaction Spatio-Temporal Analysis benchmark designed for open-set, multi-entity and multi-action spatio-temporal understanding in VLMs. VISTA decomposes videos into interpretable entities, their associated actions, and relational dynamics, enabling multi-axis diagnostics and unified assessment of relational, spatial, and temporal understanding. Our benchmark integrates multiple datasets into a single interaction-aware taxonomy and comprises ~12K curated video-query pairs spanning diverse scenes and complexities. We systematically evaluate 11 state-of-the-art VLMs on VISTA, and break down aggregate performance across our taxonomy to reveal shortcomings and pronounced spatio-temporal biases obscured by traditional metrics. By providing detailed, taxonomy-driven diagnostics on a challenging dataset, VISTA offers a nuanced framework to guide advances in model design, pretraining strategies, and evaluation protocols. Overall, VISTA is the first, large-scale, interaction-aware diagnostic benchmark for spatio-temporal understanding in VLMs.

GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models

Zeshang Li, Shuoyang Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.01733v3 Announce Type: replace Abstract: Vision-Language Models (VLMs) hallucinate objects that are not present, and a growing line of work tries to curb this by feeding the model its own generated caption as auxiliary evidence -- assuming that a caption, once available, is something to consume. We show this fails: naively appending a caption can lower accuracy rather than raise it, dropping Qwen2.5-VL-3B$^\dagger$ on HallusionBench by nearly ten points. To understand why, we build \textbf{GD-Probe}, a diagnostic set that pairs a global and a detail question on the same image, so that any difference in caption effect is attributable to the question alone. Caption utility proves to be a \emph{per-query} property: the same caption helps global questions and harms detail ones, through a single mechanism -- an embedded caption competes with the image for attention and pulls the model's evidence onto its own text -- whose sign is set by whether the caption \emph{covers} the queried content. Crucially, this regime is readable from quantities the decoder already emits, with no attention access or grounding. We turn this into \textbf{GEASS} (Gated Evidence-Adaptive Selective Caption Trust), a training-free, logit-level module that decides per query how much of the caption to trust, gating it by the clean path's confidence, weighting it by the entropy reduction it induces, and raising the evidence bar when the two pathways disagree. Across four VLMs and two benchmarks (POPE and HallusionBench), GEASS improves over both vanilla inference and contrastive decoding under a single fixed setting, adding only two forward passes and no parameters.

A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)

Michael Thielscher, Tran Cao Son — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.02249v2 Announce Type: replace Abstract: We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi-agent system after an agent gains the belief in some state property. Based on the standard representation in epistemic planning of agents' beliefs via a single multi-agent Kripke model, we generalize the classical AGM belief revision postulates to the multi-agent setting, with the aim to provide a formal framework for evaluating dynamic epistemic reasoning frameworks in which the beliefs of all agents as the result of actions are computed. As an example of a simple operator that satisfies all of the generalized AGM postulates, we present generalized full-meet multi-agent belief revision. We moreover define a generalization of the standard postulates for iterated revision, present a more sophisticated, event model based revision operator, and discuss the potential issues in defining an epistemic operator on Kripke models that can satisfy all of the generalized postulates for iterated multi-agent belief revision.

FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.03460v3 Announce Type: replace Abstract: Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail in the financial domain, which exhibits unique characteristics. We propose a general 2 x 2 capability taxonomy for TSRMs by crossing 1) single-entity vs. multi-entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain-where the distinction between deterministic assessment and stochastic prediction is particularly critical-as ten financial reasoning tasks, forming the FinTSR-Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR-Bench with distinct chain-of-thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute-in-CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario-Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR-Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario-Aware CoT consistently improves prediction accuracy over standard CoT. Code is available at https://github.com/seunghan96/FinSTaR.

Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.03847v2 Announce Type: replace Abstract: Distributed collaborative intelligence (DCI), encompassing edge-to-edge architectures, federated learning, transfer learning, and swarm systems, creates environments in which emergent risk is structurally unavoidable: locally correct decisions by individual agents compose into globally unacceptable behavioral trajectories under uncertainty. Existing approaches such as constrained optimization, safe reinforcement learning, and runtime assurance evaluate acceptability at the level of individual actions rather than across behavioral trajectories, and none addresses the multi-participant, uncertainty-laden nature of DCI deployments. This paper introduces mechanical conscience (MC), a novel concept and simplified mathematical framework that operationalizes trajectory-level normative regulation for both single-agent and distributed intelligent systems. Mechanical conscience is defined as a supervisory filter that minimally corrects a baseline policy's actions to reduce cumulative deviation from a normatively admissible region, while accounting for epistemic uncertainty. We introduce associated constructs, conscience score, mechanical guilt, and resonant dependability, that provide an interpretable vocabulary and computable governance signals for this emerging field. Core theoretical properties are established: admissibility equivalence, existence of optimal regulation, and monotonic deviation reduction. Illustrative results demonstrate that MC-regulated agents maintain trajectory-level normative acceptability where conventional controllers drift outside admissible bounds, and that the framework naturally extends to suppress interaction-induced emergent risk in multi-agent DCI settings.

A Simple Method for School Choice Lotteries

Yasunori Okumura — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.06721v2 Announce Type: replace Abstract: This note proposes a simple polynomial-time method for constructing an ex ante stable school-choice lottery satisfying equal treatment of equals (ETE). We show that the ETE reassignment of any constrained efficient stable matching is ex ante stable, satisfies ETE, and is not ordinally dominated by any other ex ante stable lottery. We further show that there exists a constrained efficient stable matching whose ETE reassignment is not ordinally dominated by any ex post stable lottery.

The Safety-Aware Denoiser for Text Diffusion Models

Amman Yusuf, Zhejun Jiang, Mijung Park — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.08116v2 Announce Type: replace Abstract: Recent work on text diffusion models offers a promising alternative to autoregressive generation, but controlling their safety remains underexplored. Existing safety approaches are geared toward autoregressive models and typically rely on post-hoc filtering or inference-time interventions. These are inadequate for effectively addressing safety risks in text diffusion models. We propose the Safety-Aware Denoiser (SAD), a safety-guidance framework in text diffusion models. The SAD modifies the iterative denoising process such that the text sample at the final denoising step is steered toward provably safe regions of the text space. This inference-time method can integrate safety constraints into the denoiser, avoiding computationally expensive retraining of the underlying diffusion model and enabling flexible, lightweight safety guidance. We evaluate the safety of the generated text using the SAD, with respect to hazard taxonomy, memorization, and jailbreak. Experimental results show that SAD substantially reduces unsafe generations while preserving generation quality, diversity, and fluency, outperforming existing methods. These results demonstrate that our safety guidance during denoising provides an effective and scalable mechanism for enforcing safety in text diffusion models.

COSMOS: Model-Agnostic Personalized Federated Learning with Clustered Server Models and Pseudo-Label-Only Communication

Ben Rachmut, Luise Ge, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.11165v2 Announce Type: replace Abstract: Federated learning (FL) in heterogeneous environments remains challenging because client models often differ in both architecture and data distribution. While recent approaches attempt to address this challenge through client clustering and knowledge distillation, simultaneously handling architectural and statistical heterogeneity remains difficult. We introduce COSMOS, a model-agnostic framework that enables server-side personalization using only pseudo-label communication. Clients train local models and predict on the public data; the server clusters clients by prediction similarity, trains a cluster-specific model for each group using its own compute, and distills the resulting models back to clients. We provide the first theoretical analysis showing that distillation from the learned cluster models can yield exponential personalization risk contraction, going beyond the convergence-to-stationarity guarantees typically provided in model-agnostic FL. Experiments across benchmarks demonstrate that COSMOS consistently outperforms all model-agnostic FL baselines while remaining competitive with state-of-the-art personalized FL methods. More broadly, our results highlight personalized server-side learning with pseudo-labels as a promising paradigm for scalable and model-agnostic federated learning in highly heterogeneous environments.

Strategic PAC Learnability via Geometric Definability

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.13426v3 Announce Type: replace Abstract: Strategic classification studies learning settings in which individuals can modify their features, at a cost, in order to influence the classifier's decision. A central question is how the sample complexity of the induced (strategic) hypothesis class depends on the complexities of the underlying hypothesis class and the cost structure governing feasible manipulations. Prior work has shown that in several natural settings, such as linear classifiers with norm costs, the induced complexity can be controlled. We begin by showing that such guarantees fail in general - even in simple cases: there exist hypothesis classes of VC dimension $1$ on the real line such that, even under the simplest interval neighborhoods, the induced class has infinite VC dimension. Thus, strategic behavior can turn an easy learning problem into a non-learnable one. To overcome this, we introduce structure via a geometric definability assumption: both the hypothesis class and the cost-induced neighborhood relation can be defined by first-order formulas over $\mathbb{R}_{\mathtt{exp}}$. Intuitively, this means that hypotheses and costs can be described using arithmetic operations, exponentiation, logarithms, and comparisons. This captures a broad range of natural classes and cost functions, including $\ell_p$ distances, Wasserstein distance, and information-theoretic divergences. Under this assumption, we prove that learnability is preserved, with sample complexity controlled by the complexity of the defining formulas.

Elastica++: A high-performance, multiphysics framework for large interacting assemblies of Cosserat rods

Tejaswin Parthasarathy, Seung Hyun Kim, Songyuan Cui, Mattia Gazzola — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.13766v2 Announce Type: replace Abstract: Soft, slender structures are ubiquitous in natural and engineered systems, with broad application potential from biomimetic materials to soft robotics. However, there is a notable lack of computational tools that simultaneously preserve high-fidelity continuum rod mechanics, scale to large interacting ensembles, and remain flexible across diverse biophysical settings. Here we introduce Elastica++, an open-source, high-performance implementation of the Cosserat-rod model for large-scale simulations of slender-body dynamics. Elastica++ combines performance-oriented kernels with shared-memory parallelism to sustain teraflop-scale throughput despite complex discretization domains and physical interactions. The framework further interoperates with external numerical solvers, supporting efficient multiphysics workflows. We demonstrate robustness and breadth through case studies spanning passive nest-like metamaterials, collective active-matter dynamics, cilia carpets, soft magnetic microrobots, and schooling swimmers. Elastica++ thus provides a missing foundation for high-throughput studies of emergent behavior in interacting assemblies of elastic slender structures.

Given, When, Then, Again: Mining Subscenario Refactoring Candidates in Behaviour-Driven Test Suites with ML Classifiers and LLM-Judge Baselines

Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.14568v2 Announce Type: replace Abstract: Context. Behaviour-Driven Development (BDD) test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. SBERT / UMAP / HDBSCAN clustering recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An XGBoost extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss' kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p = 1.5e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, and cross-organisational shared-step candidate, respectively; the figures are stable under a sweep of the classifier decision threshold. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring candidates; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.

A Theory of Training Profit-Optimal LLMs

Sophie Hao, William Merrill — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.16430v3 Announce Type: replace Abstract: Scaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure. While it is established that scaling up LLMs reliably increases model quality (quantified in terms of loss or downstream evaluations), it is unclear how these quality improvements translate to potential revenue, and whether revenue increases would offset costs of larger-scale training and inference. In this work, we develop an economic model for characterizing the rational behavior of an LLM training firm by combining scaling laws with microeconomic theory. Under our model of firm behavior, LLM quality can be increased with more parameters and training tokens, leading to more potential adoption by consumers, who each have a quality threshold for using the LLM. On the other hand, additional parameters and training tokens both incur additional costs. We analyze the profit maximization problem for this model under compute-bound and data-bound regimes. In the compute-bound regime, optimal model size and token budget track hardware efficiency $E$ (FLOPs/\$) at a near-linear rate; total training cost then scales sub-quadratically in $E$. Data efficiency improvements incentivize larger models and training expenditure. When we are limited to $D$ data, profit-optimal training expenditure scales as $D^2/E$, i.e, increase with data and decreases with hardware efficiency (as well as data efficiency). Finally, we analyze practical trends in training expenditure: current trends are consistent with our most permissive model variants in the compute-bound regime, but are not profit-optimal in the data-bound regime or assuming hardware advances will stall. Overall, our results provide a theory of profit-optimal LLM training, providing a foundation for engaging critically with industry statements and supporting long-term economic decision making.

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

Renjie Gu, Kaichen Zhou, Yan Luo, Mengyu Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.16713v2 Announce Type: replace Abstract: Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static visual input into a synthetic multi-view spatial signal. Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld-VLM preserves the original model's linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld-VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld-VLM improves performance by approximately 4 percent on both the What'sUp and VSR benchmarks, suggesting that world-model-guided visual alignment generalizes across model structures and spatial reasoning datasets.

The Range Shrinks, the Threat Remains: Re-evaluating LLM Package Hallucinations on the 2026 Frontier-Model Cohort

Aleksandr Churilov (Independent Researcher) — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.17062v2 Announce Type: replace Abstract: Spracklen et al. (USENIX Security '25) showed that code-generating large language models hallucinate package names that do not exist on PyPI or npm at rates ranging from 5.2% on commercial models to 21.7% on open-source models, creating an attack surface for slopsquatting -- the registration of malicious packages under hallucinated names. We replicate their methodology on five frontier code-capable LLMs released between October 2025 and March 2026: Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4-mini, Gemini 2.5 Pro, and DeepSeek V3.2. Across 199,845 paired Python and JavaScript prompts validated against PyPI and npm master lists, we measure overall hallucination rates between 4.62% (Claude Haiku 4.5) and 6.10% (GPT-5.4-mini) -- an order-of-magnitude compression of the inter-model spread observed by Spracklen, but not a retirement of the threat. Beyond replication, we identify a set of 127 package names (109 on PyPI, 18 on npm) that all five evaluated models invent identically; following coordinated disclosure with PyPI Security and Socket.dev, 53 of these (41 on PyPI, 12 on npm) remain registrable by an attacker after each registry's existing defenses, constituting a model-agnostic supply-chain attack surface that no single-model study can reveal. We further document a Python-over-JavaScript hallucination asymmetry that inverts Spracklen's 2024 finding, identify a Haiku-below-Sonnet inversion within the Anthropic family, and observe a Jaccard-similarity peak between DeepSeek V3.2 and GPT-5.4-mini (J = 0.343) suggestive of shared training-data origins.

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.17770v3 Announce Type: replace Abstract: The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in complex mathematical and logical tasks. However, the field faces \textit{the fundamental gap between token-level behavioral analysis and internal reasoning mechanisms, and the instability of reinforcement learning (RL) for reasoning optimization relying on costly external verifiers}. We identify and formally define \textbf{Entropy-Gradient Inversion}, a robust negative correlation between token entropy and logit gradients that acts as a definitive geometric fingerprint for LRM reasoning capability. Building on this, we propose \textbf{Correlation-Regularized Group Policy Optimization (CorR-PO)}, which embeds this inversion signature into RL reward regularization. Extensive experiments on various reasoning benchmarks across multiple model scales show CorR-PO consistently outperforms state-of-the-art baselines, confirming that stronger inversion directly correlates with superior reasoning performance.

Tangent Blow-Ups for Processing Non-Manifold Geometry

Alice Petrov, Mohammad Sina Nabizadeh, Ana Dodik, Justin Solomon — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.18215v2 Announce Type: replace Abstract: Many geometry processing pipelines implicitly assume their input data is a manifold, or is sampled from one, with a unique tangent plane at every point. Geometric data, however, routinely contains sharp features like edges, corners, self-intersections, branching junctions, and other singularities, rendering standard methods ill-defined at these points. To bring geometry processing to these and other singular spaces, we introduce the ``tangent blow-up,'' a representation inspired by algebraic geometry that restores structure at singularities by lifting to the product of the ambient space and the Grassmannian of tangent planes. After iterating this construction, points that coincide in position but differ in tangent direction, curvature, or higher-order contact become well-separated. We equip the tangent blow-up with a product metric and define discretized differential operators, such as the gradient, divergence, and Laplacian, directly in the lifted domain. We demonstrate our framework across geodesic computation, segmentation, surface parameterization, and curvature estimation.

Attacking the First-Principle: A Black-Box, Query-Free Targeted Mimicry Attack on Binary Function Classifiers

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.18231v2 Announce Type: replace Abstract: Binary function classifiers play a crucial role in maintaining the security and integrity of software systems by detecting malicious code and unauthorized modifications. However, machine learning-based classifiers are vulnerable to adversarial attacks that can evade detection. In this study, we present Kelpie, a novel framework for executing mimicry attacks, a stronger type of targeted evasion attacks, on binary function classifiers in a black-box, zero-query setting. Unlike previous approaches that rely on querying the target classifier to refine untargeted evasion attacks, Kelpie leverages code transformations that preserve the functionality of malicious payloads while causing them to be misclassified as we want. Through extensive experimentation, we demonstrate that Kelpie can successfully execute mimicry attacks against six state-of-the-art binary function classifiers representing different model architectures without requiring direct interaction with them. We further validate our approach with a practical demonstration, involving a keylogger and a wiper concealed within benign-looking functions embedded in an application. This work, to our best knowledge, is the first to demonstrate such a mimicry attack in a black-box, zero-query context, raising important questions about the reliability and security of existing machine learning-based binary function classifiers.

Multi-Token Residual Prediction

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.18817v2 Announce Type: replace Abstract: Diffusion Language Models (DLMs) generate text by iteratively denoising masked token sequences, offering a tradeoff between parallelism and quality compared to autoregressive models. In current practice, the number of tokens decoded per step is controlled by a confidence threshold, and quality degrades monotonically as more tokens are denoised per step. We introduce Multi-token Residual Prediction (MRP), a lightweight module that enables dependency-aware multi-token denoising within a single backbone forward pass. MRP exploits a key property of the denoising process: the logit distributions at adjacent denoising steps are remarkably similar. Rather than running the backbone a second time to obtain the next-step logits, MRP predicts the residual between steps from the backbone's hidden states, effectively denoising more tokens per backbone forward at a fraction of the cost. We apply MRP across the two operating regimes of DLM decoding. In the high-quality-low-throughput static denoising regime, MRP serves as a drafter for speculative decoding: its proposals are verified against the backbone, yielding lossless acceleration of up to 1.4x in SGLang. In the low-quality-high-throughput dynamic denoising regime, MRP instead drives a remasking scheme that revokes over-eager reveals, recovering most of the accuracy lost to aggressive low-threshold decoding and improving accuracy by up to 22.6 points on code generation task HumanEval and 17.7 points on reasoning task GSM8K.

A Novel Stochastic Particle-Field Algorithm for a Reaction-Diffusion-Advection Cancer Invasion Model

Jingyuan Hu, Zhongjian Wang, Jack Xin, Zhiwen Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.20140v2 Announce Type: replace Abstract: In this paper, we present a novel numerical framework for solving a specific biological reaction-diffusion-advection system of cancer growth in three dimensions (3D) using particles of variable mass. We adopt empirical particle measures to represent cell density and dynamically construct the concentration fields of multiple related chemical species throughout the 3D domain. Efficient interaction between the particles and the spatial grid is achieved through a Particle-in-Cell (PIC) algorithm, while diffusion in space is solved rapidly using a spectral method. We demonstrate that for this particular system, the rate of change of particle mass remains bounded over finite time intervals. Furthermore, in addition to the inherent positivity preservation of cell density guaranteed by the empirical particle measures, the concentrations constructed by the algorithm are also unconditionally positivity-preserving on the spatial grid. Moreover, we present a rigorous error analysis for the proposed method, and numerical experiments confirm the theoretical convergence rates. To the best of our knowledge, this is the first numerical work to solve this system in three dimensions, wherein a rapid spread of cells driven by haptotactic flux is observed, similar to the behavior documented in the two-dimensional case.

ShapeBench: A Scalable Benchmark and Diagnostic Suite for Standardized Evaluation in Aerodynamic Shape Optimization

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.20763v2 Announce Type: replace Abstract: Rapid progress in aerodynamic shape optimization (ASO) has outpaced currently-available standardized evaluation frameworks. Fair comparison requires a unified benchmark spanning diverse shape classes, objective formulations, and matched-budget state-of-the-art baselines. We introduce ShapeBench, an open-source ASO benchmark with a unified API spanning 103 tasks across eight shape categories and multiple optimization regimes. Each ShapeBench task includes a validated surrogate for fast search; when feasible, a high-fidelity Computational Fluid Dynamics (CFD) pipeline for final verification is available, enabling systematic fidelity-gap analysis. ShapeBench provides a reproducible protocol with well-configured baselines to compare fairly using a consistent budget metric, allowing for comparison among both classical and LLM-driven methods, including general-purpose optimizers and a new domain-specialized evolutionary LLM baseline, ShapeEvolve. Results on ShapeBench demonstrate substantial variance in optimizer rankings across shape categories and problem formulations, with mean pairwise Spearman $\rho = 0.013$, so single-task conclusions do not reliably generalize across problem classes. The benchmark is also far from saturation; classical methods are rarely applicable across all shape categories and tasks, further highlighting the need for more general-purpose approaches.

Astragalus: Automatic Configuration Repair for Production Networks

Zhenrong Gu, Peng Zhang, Xing Feng, Xu Liu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.22092v2 Announce Type: replace Abstract: Network configurations are prone to errors, which can lead to catastrophic service outages. A tool that can achieve automatic configuration repair (ACR) is highly desired by operators. Existing tools for ACR follow a \textit{semantics-driven approach}: they model network semantics as a set of SMT constraints, and solve them for a location or fix of the error. Due to the complex semantics of networks, constructing and solving these constraints can be prohibitively expensive, making these tools neither general nor scalable. Inspired by automatic program repair (APR), we explore another direction, i.e., a \textit{syntax-driven approach}, which generates and validates syntactically-valid candidate updates without modeling program semantics, often drawing on existing code in the same repository. Following this direction, we propose Astragalus, a syntax-driven method for ACR. It uses multiple iterations of a "localize-fix-validate" pipeline to search for repairs, and proves quite effective on configurations of our production network. Specifically, we show that Astragalus can repair every incident in multiple sizes of a synthesized network, and 97.5% of the incidents on a real network, both with 15 types of errors injected, within an average time of 6.93 seconds. It has also provided valid repairs in under 6 minutes for 7 recent network incidents or undesired changes, in a real production network with O(1,000)~O(10,000) devices.

More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts

V\'ictor Yeste, Paolo Rosso — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.22641v3 Announce Type: replace Abstract: Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence-level value detection. Using the ValuesML/Touch\'e ValueEval format, we compare sentence, window, and full-document inputs; no-RAG and retrieval-augmented settings with a curated moral knowledge base; supervised DeBERTa-v3-base/large encoders; and zero-shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full-document context improves supervised DeBERTa encoders by 3.8-4.8 macro-F1 points over sentence-only input, but does not consistently help zero-shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa-v3-base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late-fusion and cross-attention RAG variants for encoders. Per-value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value-sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.

Appearance-Invariant Detection of Suggestive Motion via Laban Movement Descriptors

Jaehoon Ahn, Jeonghan Kong, Moon-Ryul Jung — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.24488v2 Announce Type: replace Abstract: Content moderation in online multiplayer 3D virtual environments is increasingly automated, yet detection has focused on images, video, and audio, leaving suggestive motion a blind spot. We present a motion-only classification pipeline that detects suggestive and explicit movement from SMPL skeleton trajectories using Laban Movement Analysis (LMA) descriptors. On a dataset spanning everyday, artistic, suggestive, and explicit movement (17+ hours of video), a logistic regression trained on 61-feature LMA descriptors reaches 68% binary SFW/NSFW accuracy (70% random forest) under a leak-free evaluation protocol. At this level, our descriptor performs comparably to a learned video model trained on the same motion re-rendered as appearance-free video, a gray figure with no clothing, skin, or scene. The indirectness (tortuosity) of each joint's trajectory, measured as the ratio of the joint's path length to its net displacement, peaks at the suggestive tier, showing that the Direct-to-Indirect polarity of Laban's Space factor provides an interpretable marker of the shift from functional to suggestive motion. Ultimately, Laban-based kinematic descriptors offer a lightweight, interpretable approach to suggestive-motion detection: every decision decomposes into named, theory-grounded features. Because the classifier operates on pose trajectories alone, moderation can run directly on avatar poses in virtual environments, with no appearance data.

Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability

David N. Olivieri, Antonio F. P\'erez Rodr\'iguez — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.25225v2 Announce Type: replace Abstract: Mechanistic interpretability often studies Transformer behavior by intervening on internal activations through activation patching, causal tracing, path patching, and steering directions. This paper develops Transformer Field Theory: a response-theoretic framework in which the residual stream of a fixed forward pass is treated as a Transformer field over layer depth and token position. In this formulation, patching becomes a localized source insertion into the Transformer field, first-order sensitivity fields predict patch effects, Green functions describe downstream propagation, and patch selection is posed as an adjoint inverse problem. Empirically, we test the theory's forward response objects in GPT-2-style autoregressive Transformers. Localized Transformer-field interventions exhibit a bounded local linear regime; first-order sensitivities predict patch effects across layer-token sites; localized sources generate structured anisotropic Transformer-field propagation; high-sensitivity sites and sliced Green operators provide reduced response descriptions; and prompt-induced Transformer-field displacements partially transfer answer behavior. These results establish sensitivities, Transformer-field responses, and sliced Green operators as practical objects for organizing patching experiments, while providing the forward mathematical basis for patch-site inference and cross-scale response transfer.

LENS: A Staged Design for Interaction Granularity in Sequential CTR Prediction

Yuan Wang, Yue Liu, Jun Zhang, Jie Jiang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.25583v2 Announce Type: replace Abstract: In sequential CTR prediction, a central design question is at what granularity the target should interact with the user behaviour sequence. Existing models mainly follow two routes. Raw-item architectures such as DIN let the target score each item in the sequence directly. This relies on well-trained item embeddings and becomes brittle for sparse items. Latent-query architectures such as HyFormer, MixFormer, and OneTrans build query representations by combining the target with other information. This is more robust across item-density regimes but blunter: target-specific control is diluted. We propose LENS to restore target-specific control within these coarser bottlenecks. LENS has two modules: a Target-Conditioned Query Gate (TCQG) for query activation and a Target-Conditioned Position Bias (TCPB) for history retrieval. We further introduce Query-Specific Position Bias (QueryPos), a simple static position-aware reference for latent-query backbones. Across three representative latent-query backbones and four datasets, the combined QueryPos+LENS design achieves positive total-gain point estimates in all twelve evaluated backbone--dataset cells. We also identify a density-dependent conditioning rule: as item density decreases, the optimal condition source shifts from item-only to item-plus-sequence.

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

JunJia Guo (Joe), Yuhang Yao (Joe), Jiawei (Joe), Zhou, Jingdi Chen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.26144v2 Announce Type: replace Abstract: We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchmarks that focus on algorithmic tasks, VISTA targets realistic UI-centric development, where agents must produce functional, visually coherent applications from underspecified inputs. We define five prompt-information conditions that vary along two axes, visual/structural fidelity and stack constraint: (1) text only with free stack choice, (2) text with reference screenshots under three specified stacks, (3) text with reference screenshots under free stack choice, (4) text with screenshots and pruned Figma structure under a single specified stack, and (5) text with screenshots and pruned Figma structure under free stack choice. To enable robust evaluation, each page in the benchmark is manually annotated with interactive UI components and around three visual anchor points, addressing the well-known limitations of script-based testing tools such as Playwright in open-ended code generation settings. Evaluation combines DOM-grounded reference matching, behavior-specific browser tests, and CLIP-based visual similarity, jointly measuring structural alignment, behavioral completeness, and overall visual fidelity. We use VISTA to assess four agent systems drawn from two model families and two harnesses, finding that visual fidelity and functional correctness are partially decoupled across both input conditions and agents, and that agent editing style varies sharply but is largely orthogonal to task quality. VISTA establishes a rigorous and reproducible foundation for advancing agent-based software engineering research.

Subsystem Structure as an Inferential Resource for Coupled Engineered Systems

Esmaeil Ghorbani, J\"urgen Hackl — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.27544v2 Announce Type: replace Abstract: Engineered infrastructure systems pose inverse problems in which hidden states, unknown parameters, and subsystem couplings must be inferred from sparse and noisy measurements. These problems are difficult because physical subsystems are heterogeneous, sensing is partial, uncertainty is distributed across subsystem interfaces, and computational cost grows rapidly with system size. We address this challenge with probabilistic compositional inference, a graph-based architecture that represents a coupled system as interacting subsystems, each retaining its own local model, estimator, and uncertainty representation, while coupling is handled through physically meaningful stochastic messages exchanged across subsystem interfaces. This formulation allows mechanistic, learned, and deterministic components to coexist within a single inference framework and propagates calibrated uncertainty without assembling a global augmented state or covariance. We validate the framework in three increasingly demanding settings: a sparse-sensing canonical inverse problem, where interface couplings can also be learned from data; infrastructure-scale power networks, where the method matches centralized joint state-and-parameter inference while reducing computational scaling from approximately cubic to approximately linear; and a multi-physics turbine embedded in a power-grid network, where heterogeneous subsystems compose hierarchically without degrading local inference or collapsing local posteriors into a global estimate. Together, these results show that subsystem structure can be exploited as the organizing principle for uncertainty-aware inverse inference in coupled engineered systems.

Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems

Srini Ramaswamy — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.27628v2 Announce Type: replace Abstract: As autonomous and agentic AI systems scale in robotic and human-machine environments, managing hallucination and persistent but unjustified action remains an open challenge. Rather than attributing these failures solely to model or alignment limitations, this paper explores the architectural vulnerability of unbounded autonomy - the presumption that an agent should continue operating regardless of rising uncertainty. It introduces a theory of managed autonomy that defines intelligent behavior through the formal capacity to detect epistemic drift, suspend reasoning, attempt recovery, and ultimately surrender control when reliability diminishes. We instantiate this theory via the SMARt (Self-Managing Multi-tier Autonomous Reasoning with Regulated/Revoked transitions) model, a four-layer framework featuring Stable, Meta-cognitive, Assisted, and Regulated states. By developing a timed, guarded Petri net formulation, we establish theoretically bounded properties for the system, demonstrating how architecture can formally mandate escalation, constrain invalid outputs, and ensure governance reachability under specified conditions. We further analyze how incorporating domain-specific trigger sets across varied operational settings (e.g., healthcare, robotics, etc.) can systematically preserve safety, assuming completeness and soundness criteria are met. Because these triggers are designed to be adaptive, the SMARt model accommodates the safe, controlled expansion of an agent's operational scope over time. We conclude that formalizing failure management within the autonomy lifecycle is a crucial step toward realizing reliable and governed artificial intelligence.

Universal Time Series Generation with Neural Controlled Differential Equations

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.28507v2 Announce Type: replace Abstract: Recent work on the sequence universality of State Space Models (SSMs) has introduced efficient, maximally expressive continuous-time approaches for time-series modelling. While these works focus on discriminative settings, we extend this perspective to generative time-series modelling by proving that maximally expressive Structured Linear Controlled Differential Equations (SLiCEs) are universal time-series generators, in the sense that they can approximate the induced path laws of continuous causal pushforwards on compact latent sets in $W_\infty$. Building on these theoretical results, we propose Generative SLiCEs (G-SLiCEs), a maximally expressive continuous-time model for flow matching on path-space. Empirically, we show that expressivity improves performance in probabilistic forecasting and downstream tasks, while retaining the advantages of continuous-time models such as generalising to arbitrary observation grids. This is particularly beneficial for irregular grids, where fixed-grid models often struggle.

CrossAlpha: An Annual-Report Benchmark for Cross-Market Factor Researc (with LLM Agents)

Qian Wang, Zhongyi Tong, Nuo Chen, Zhaomin Wu, Bingsheng He — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.29286v2 Announce Type: replace Abstract: Cross-market factor research studies whether firm-level signals from one or more markets can predict returns in a target market, but existing public benchmarks do not support cross-market disclosure-to-return evaluation. Building such a benchmark is challenging because filings differ across languages and regulatory systems, disclosure-derived similarity can be biased by common reporting components, and cross-market signals must be evaluated under feasible trading-time alignment. We introduce \textbf{CrossAlpha}, a public annual-report benchmark for cross-market factor research. CrossAlpha addresses these challenges through three corresponding components: \emph{Disclosure Distillation}, which standardises heterogeneous filings into ten-category English business descriptions; \emph{Residual Schema Graph Construction}, which builds PCA-whitened cross-market firm-pair scores from schema-level disclosures; and \emph{Timing-Aligned Evaluation}, which pairs the graph with 11 years of daily OHLCV data to construct forward-return labels under feasible cross-market execution protocols. CrossAlpha covers about 3,600 firms and 10,700 firm-year reports from the United States, Japan, Taiwan, South Korea, and Hong Kong, and releases about 19M directed firm-pair scores. In experiments, disclosure-derived cross-market peers outperform domestic text, industry-code, and return-correlation peers in the US-to-Japan setting (ICIR 0.39 versus 0.07--0.18), and cross-market sources beat the domestic text baseline in most target markets. CrossAlpha offers an open-sourced, reusable, return-grounded benchmark for cross-market financial NLP.

Plan, Don't Pose: Long Composite Motion Generation with Text-Aligned BFM

Nikolay Shvetsov, Maksim Bobrin, Nazar Buzun, Anton Bozhedarov, Dmitry V. Dylov — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.29906v2 Announce Type: replace Abstract: Text-to-motion (T2M) generation has broad applications in character animation, virtual avatars, and human-robot interaction. Existing methods typically generate pose trajectories or motion tokens directly from language, forcing a single model to handle semantic interpretation, long-horizon structure, and low-level physical realization. This coupling makes them costly and often unreliable for long, compositional, or semantically dense prompts. We propose Text2BFM, the first framework that aligns natural language with pretrained Behavioral Foundation Models (BFMs) for T2M generation without relying on heavy end-to-end motion generators. Text2BFM operates in the latent policy space of a frozen BFM, using it as an executable motion prior. A text-aligned variational behavioral bottleneck compresses BFM policy-latent sequences into compact motion representations that are compatible with language and preserve long-horizon behavioral structure. Generation is performed in this compact behavioral manifold with a lightweight conditional generator, and the resulting latent encoded behaviors are decoded into policy latents that drive the pretrained frozen BFM. By decoupling semantic planning from motion execution, Text2BFM achieves efficient, robust T2M generation and strong performance on long, compositional textual descriptions.

Triangle Splatting SLAM

Nicholas Fry, Eric Dexheimer, Kirill Mazur, Paul H. J. Kelly, Andrew J. Davison — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.31419v2 Announce Type: replace Abstract: We present a dense RGB-D SLAM system using differentiable triangles as the 3D map representation. While 3D Gaussian Splatting has emerged as the leading method for novel-view synthesis, triangles remain the standard primitive for traditional rendering hardware, game engines, and downstream tasks requiring explicit geometry such as simulation, collision, and editing. Recent offline methods have demonstrated that an unstructured 'triangle soup' can be optimised into a photorealistic mesh via Delaunay triangulation across a set of posed images. Building upon this insight, we present the first dense SLAM system to employ Triangle Splatting to perform both tracking and mapping through online differentiable rendering of a triangle soup. The map can be converted into a connected mesh on-the-fly via restricted Delaunay triangulation, enabling new online capabilities such as mesh deformation and collision checking. On Replica and TUM-RGBD, our system outperforms baselines on 3D geometry, matches the camera-tracking accuracy, and enables online mesh-based scene editing.

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

Adrian de Wynter — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.31514v3 Announce Type: replace Abstract: Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergence of, ascribe to, or assume, generalised anthropomorphic attributes to them (e.g., morality or understanding of natural language). Our goal is not to argue in favour or against the existence of these attributes, but to point out that these conclusions could be incorrect. For this we build and train a simple neural network on the videogame Age of Empires II, and note that any entity in a sufficiently-powerful substrate, such as LEGO or the Greater Boston Area, could also present such attributes. Hence, the purported anthropomorphic attributes of LLMs are empirically non-unique: although some properties (e.g., responses to prompts) could remain invariant, others, such as the interpretation of their perceived behaviour, might change with the substrate. Thus, any empirically-grounded discussion on these attributes requires explicit measurement criteria; otherwise the interpretation is left to the representation. We then show that assuming that these attributes exist or not in a system, independent of the substrate and in a generalised way, leads to either circular or uninformative conclusions. This is regardless of the experimenter's viewpoint on the subject, or whether the outcome shows existence or non-existence. Finally we propose a 'null' assumption, where one assumes LLM non-uniqueness instead of assuming anthropomorphic attributes to set up an experiment, along with examples of it. We also discuss potential objections to our work, briefly survey the field, and prove that Age of Empires II is functionally- and Turing-complete.

BOUTEF: A Multilingual Corpus for FakeNews in North Africa -- Language as a Weapon

Kamel Smaili, Yassine Toughrai, Amina Laggoun, David Langlois — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.00193v2 Announce Type: replace Abstract: The rapid spread of fake news on social media has become a major challenge, particularly in multilingual and under-resourced contexts such as North Africa. In this paper, we introduce BOUTEF, a large-scale multilingual corpus designed to study the propagation, characteristics, and impact of fake news in Algeria and Tunisia. The corpus integrates three complementary components: fake narratives, genuine narratives, and associated user-generated comments, along with verified debunking information. It covers a wide range of languages and linguistic varieties, including MSA, Algerian and Tunisian dialects, Arabizi, French, English, and code-switched language. Building on this resource, we conduct a comprehensive empirical analysis combining quantitative and qualitative approaches. We examine thematic distributions, linguistic and rhetorical strategies, sentiment patterns, and social engagement dynamics. Statistical analyses reveal significant associations between thematic categories and message veracity, as well as strong correlations between user engagement and the visibility of fake content. Our findings show that fake news relies heavily on emotionally charged narratives, sensational framing, and hybrid linguistic practices that enhance virality and audience engagement. In contrast, debunking content adopts a more factual and verification-oriented style. Furthermore, a comparative analysis between Algeria and Tunisia highlights both shared dynamics and country-specific characteristics shaped by sociopolitical contexts. The results emphasize the role of informal language practices in the diffusion and reception of misinformation. By providing a rich, annotated, and publicly available dataset, this work contributes to advancing research on fake news detection, low-resource language processing, and the understanding of information disorders in complex linguistic environments.

Error bounds for approximate posteriors from likelihood-informed reduced-order models

Han Cheng Lie, Jakob Scheffels, Elisabeth Ullmann — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.00274v2 Announce Type: replace Abstract: In the design of computational methods for Bayesian inverse problems, costly forward model evaluations make it difficult to sample from or compute the posterior. This motivates the need for approximate forward models that are cheaper to evaluate. We consider reduced-order forward models which exploit the lower-dimensional structure in the Bayesian inverse problem by projecting to the "likelihood-informed subspace" of the parameter space where the prior-to-posterior update is significant. However, the theoretical properties of these reduced-order forward models and their impact on the solution of the Baysian inverse problem are not always well-understood. In this work we consider linear Gaussian inverse problems with a possibly singular prior covariance matrix. We analyse a recently proposed reduced-order model which uses a Petrov-Galerkin projection to likelihood-informed subspaces that arise in optimal low-rank approximations of the posterior covariance matrix. We bound the error in the resulting approximation of the root prior-preconditioned Hessian of the data misfit. Based on this we also bound the errors of the approximate posterior covariance and mean. Our analysis shows that this reduced-order model recovers the exact posterior when the rank of the reduced-order model is equal to the "intrinsic dimension" of the inverse problem, i.e. the rank of the prior-preconditioned Hessian. Two numerical experiments from structural engineering illustrate the performance of our bounds.

Interaction-Centered Intelligence: Toward an Interaction-Based Theory of Human-AI Co-Creation

Nicholas Davis — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.00807v2 Announce Type: replace Abstract: Traditional artificial intelligence has largely conceptualized intelligence as isolated computation occurring within bounded agents. Across classical AI, machine learning, and many generative systems, the dominant unit of analysis remains the individual model or autonomous system evaluated through outputs, benchmarks, prediction accuracy, or optimization performance. While these approaches have produced major advances, they often under-theorize the role of interaction in the emergence of intelligence, creativity, meaning, and adaptive behavior. This paper proposes interaction as the primary unit of analysis for co-creative AI and interaction-centered intelligence more broadly. Drawing from distributed cognition, embodied cognition, enaction, participatory sense-making, human-computer interaction, and computational creativity, the paper traces a historical progression toward increasingly relational accounts of intelligence. Building upon prior work in Creative Sense-Making, quantified co-creation, and co-creative systems such as the Drawing Apprentice and AI Drawing Partner, it argues that intelligence emerges through evolving interaction dynamics among agents, environments, and socio-technical systems rather than solely through internal computation. The paper introduces Interaction-Centered Intelligence as a framework for understanding human-AI co-creation, collaborative emergence, adaptive participation, and interactional dynamics. Rather than evaluating intelligence solely through generated outputs, the framework emphasizes interaction trajectories, coordination patterns, participatory engagement, adaptive regulation, and interactional drift unfolding through time. Implications for explainable co-creative AI, hybrid intelligence, enactive AI, and future human-AI systems are discussed.

Revisiting Neural Processes via Fourier Transform and Volterra Series

Peiman Mohseni, Nick Duffield, Raymond K. W. Wong — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.01172v2 Announce Type: replace Abstract: Modeling unknown latent functions from finite, irregularly sampled measurements is a recurring challenge across science and engineering. Neural processes (NPs), a family of probabilistic functional models, are promising solutions -- especially when endowed with domain-specific symmetries like translation equivariance, which improve sample efficiency and generalization. Yet existing translation-equivariant NPs face two limitations: (i) they stack generic components with non-linearities, obscuring the induced function class and limiting interpretability; and (ii) convolutional designs rely on kernels with local receptive fields and require dense uniform input grids, while attention-based methods avoid these issues but scale quadratically with the number of observations. We address both with two contributions. First, using the Volterra expansion, we characterize continuous translation-equivariant operators as sums of higher-order convolutions, yielding analytical transparency while admitting efficient approximation by first-order convolutions. Second, we introduce set Fourier convolutions (SFConvs), a frequency-domain parameterization that operates directly on irregularly sampled points, achieves approximately global receptive fields, and scales linearly in the number of observations. Building on these ideas, we propose two conditional NPs (CNPs): SFConvCNPs, which stack SFConv blocks with non-linearities, and SFVConvCNPs, which integrate the Volterra formulation. Experiments on synthetic and real-world datasets demonstrate our methods' efficacy against state-of-the-art baselines.

MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics

\v{Z}iga Kova\v{c}i\v{c}, Kevin Ellis — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.01538v2 Announce Type: replace Abstract: To study the ability to infer physical dynamics from videos and extrapolate them forward in time, we assemble a dataset of 2D Material Point Method (MPM) physical simulations covering rich physical phenomena such as deformable objects, fluids, kinetic objects, and emitters. We study code generation and video diffusion approaches on this dataset, identifying their strengths and weaknesses by varying the amount of physically relevant side information. The code generation model, beyond giving a working demonstration of automatic synthesis of MPM simulations, reveals that such an approach struggles with inferring physical parameters from visual input, but relative to video diffusion, produces physically and temporally stable extrapolations forward in time, while the video diffusion model more strongly identifies geometric properties from visual input but produces physically implausible extrapolations.

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.01621v2 Announce Type: replace Abstract: Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on RxR-CE.Project Page: https://baobao0926.github.io/Goal2Pixel/.

Realistic noise synthesis reduces bias and improves tissue microstructure estimation with supervised machine learning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.02044v2 Announce Type: replace Abstract: Diffusion MRI enables non-invasive probing of tissue microstructure, but accurate parameter estimation is challenged by noise-related effects. In supervised machine learning frameworks trained on simulated data, discrepancies between the noise characteristics of simulated and acquired signals introduce a form of covariate shift, whereby the input signal distribution differs between training and inference. We investigated the impact of this mismatch on microstructure parameter estimation and propose a realistic noise synthesis (RNS) framework to mitigate it. RNS incorporates both the Rician expectation and the effective post-processing noise variance into simulated training signals. The Rician expectation was modelled using a noise standard deviation estimated with MPPCA, while the effective standard deviation was derived from spherical harmonic residuals of preprocessed data. The method was evaluated using the cylinder-zeppelin and the SANDI models on simulated datasets across multiple SNR levels and on in vivo diffusion data with repeated acquisitions. Sensitivity to noise misestimation was also assessed. Ignoring magnitude-induced noise effects during training produced systematic, SNR-dependent parameter bias, particularly at low SNR. Incorporating the Rician expectation substantially reduced bias to the level of noise-aware nonlinear least-squares fitting. Modelling the effective standard deviation further improved precision. Performance was largely independent of regression architecture but sensitive to accurate noise estimation. These findings demonstrate that realistic noise modelling in simulated training data mitigates signal-domain covariate shift and is essential for unbiased supervised microstructure estimation, particularly in low-SNR regimes associated with high b-values or high spatial resolution.

Variational Learning for Insertion-based Generation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.02133v3 Announce Type: replace Abstract: Non-monotonic sequence generation methods, such as masked diffusion models, provide a flexible alternative to left-to-right autoregressive modeling by allowing tokens to be generated in non-fixed and prescribed orders. Despite their practical advantages, most existing non-monotonic models are order-agnostic and rely on a fixed-length grid, limiting their ability to support variable-length generation and adaptive insertion order. In this work, we introduce a probabilistic framework for learning insertion order in variable-length insertion models. We formalize a bijective correspondence between insertion trajectories and permutations, which enables an exact reparameterization of the data likelihood as a sum over permutations. Building on this result, we propose the Insertion Process (IP), a stochastic generative model that jointly learns where to insert, what to insert, and when to terminate, trained via permutation-based variational inference. Unlike prior fixed-canvas approaches, IP natively supports variable-length generation and learns data-driven preferences over insertion orders. Experiments on goal-conditioned planning and molecular string generation demonstrate that learning insertion order improves both modeling quality and generalization in domains without a canonical left-to-right structure.

Closed-Form PI and PID Tuning of All-Pole Plants up to Third Order for Monotonic Minimum-Settling Step Responses

Senol Gulgonul — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.02868v2 Announce Type: replace Abstract: A unified, closed-form analytical PI/PID tuning method is presented for all-pole plants up to third order that yields a strictly monotonic (zero-overshoot) step response with minimum settling time. The design target is the binomial closed loop p^n/(s+p)^n, which is monotonic with robustness depending only on the order n. Because a fixed PI/PID cannot assign the closed-loop poles and the controller zeros independently, realizing this target exactly requires the controller zeros to be cancelled, which forces the controller numerator to divide the plant denominator. It follows that an exact, real-gained solution exists for any stable plant precisely up to second order with a PI controller and third order with a PID controller; beyond that the residual binomial factor acquires a complex pair of damping sqrt(3)/2, which a generic plant does not contain. Explicit gains are derived for first-order plants (PI), second-order plants with real and complex poles (PI and PID), and third-order plants with three real poles or one real pole plus a complex pair (PID). The freedom of the coincident designs is shown to be bounded: a quadratic nonnegativity condition gives the exact window of the design pole for strict monotonicity, which collapses at the pole-ratio-2 changeover for real poles and is nonempty for damping ratios above approximately 0.443 for complex poles. Monotonicity guarantees Mt = 1, hence Ms <= 2, phase margin >= 60 degrees, and gain margin >= 6 dB, tightening to universal constants for the binomial family. Load-disturbance attenuation obeys IAEd = 1/Ki, making the cost of cancellation explicit, and comparisons with SIMC, the CHR zero-overshoot rule, and deadbeat-fitted explicit formulas quantify the trade: at matched maximum sensitivity the proposed design settles faster than SIMC on the third-order example, with markedly lower controller gains and peak control effort.

FOLD: Fuzzy Online Deduplication for Very Large Evolving Datasets via Approximate Nearest Neighbor Search

Nelson Bore, Pritish Mishra, Constantin Adam, Eyal de Lara, Oana Balmau — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.03001v2 Announce Type: replace Abstract: Fuzzy deduplication is key to constructing large language model training corpora. However, classic Locality-Sensitive Hashing (LSH) pipelines scale poorly as corpora grow and are ill-suited to continuous ingestion. The main issue is that each new document batch must be checked against the admitted corpus before insertion. As the corpus grows, the LSH buckets grow: each query can hit several large buckets and must scan the returned candidates. To solve this problem, we present RAD (Retrieval-Augmented Deduplication), an online fuzzy deduplication system that delivers both high recall and throughput for evolving datasets. RAD maintains an incrementally updated HNSW index over admitted documents, retrieving a small, high-quality candidate neighborhood for each incoming document instead of repeatedly re-scanning the accumulated corpus. RAD is the first online fuzzy deduplication system to use HNSW, leading to stable throughput as datasets grow. However, it is not easy to maintain high recall when using HNSW-style indexes. The core issue is the distance metric between graph nodes. Jaccard similarity, the metric used for fuzzy deduplication, yields low recall when applied out-of-the-box with an HNSW index. It leads to distance score crowding, making graph traversal unreliable within a bounded number of steps. RAD addresses this with a bitmap representation that provides a more discriminative, Jaccard-aligned signal during HNSW search. Across four LLM-scale datasets (LM1B, C4, RealNews, and Common Crawl), RAD preserves the scaling trajectory needed for online fuzzy deduplication: at 30M documents, it maintains 0.94-0.97 recall relative to state-of-the-art LSH solutions, and delivers up to an 8x throughput increase.

Can Factual Opinions Be Edited (Manipulated) in Large Language Models?

Yuanpu Cao, Ziyi Yin, Fenglong Ma, Jinghui Chen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.03096v2 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly integrated into various domains, making knowledge editing techniques crucial yet potentially hazardous. Current editing methods primarily target atomic facts, overlooking the significant risks associated with manipulating factual opinions, e.g., documented stances of public figures on societal issues. Such manipulation could reshape public images, influence elections, and alter societal views. To systematically assess this threat, we introduce the Factual Opinion Editing with Evidence (FOE) benchmark, which encompasses 261 public figures, 19 issue categories, and 2,178 complete opinion records. Our evaluations demonstrate that current editing techniques struggle significantly with factual opinions, often achieving only superficial changes while failing to preserve consistency between the edited opinion and the supporting evidence generated by the model. To address this limitation, we further propose a simple yet effective Self-Generated Evidence-Aligned method that achieves opinion-evidence alignment without relying on explicit instructions. Together, our benchmark and method provide a foundation for understanding the emerging security implications of factual opinion editing in LLMs.

Ollivier-Ricci curvature in cycle overlap mode

Zexian Zhou, Bo Jiao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.03317v3 Announce Type: replace Abstract: Ollivier-Ricci curvature of an edge (x,y) is defined by comparing the distance taken to transport from neighbors of x to neighbors of y. It is a structural measure that has been studied in many fields such as community detection and deep neural networks. However, high computational complexity or error limits its application in large scale-free graphs. This paper proposes an optimal transport principle to minimize the distance by 3,4,5-cycles that include the edge (x,y), and designs a curvature calculation approach named Curvature in Cycle Overlap Mode (CCOM). In this approach, a greedy and pruning algorithm is proposed to approximate the optimal transport principle. We theoretically and experimentally verified that our approach CCOM can significantly improve the accuracy of the curvature on real-world networks with low time consumption. In addition, we compared CCOM with baseline approximation approaches in community detection tasks using the same curvature-based framework, and experimentally confirmed the effectiveness of CCOM on large scale-free graphs.

Intellectual Humility as a Cognitive Filter for AI-Generated Health Misinformation. An Evolutionary Perspective on Epistemic Vigilance

Marcin Rz\k{a}deczka, Maciej Wodzi\'nski, Kacper Zacharski, Marcin Moskalewicz — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.03377v2 Announce Type: replace Abstract: We present experimental findings from a study (N=99) examining how intellectual humility (IH), i.e., the metacognitive awareness of epistemic limitations, affects the evaluation of AI-generated health dialogues varying in scientific rigor. Participants were randomly assigned to evaluate one of three dialogues about exercise and mental health: scientifically accurate, moderately pseudoscientific, or strongly pseudoscientific. Results reveal that IH functions as a selective cognitive filter. Individuals with higher humility scores rated pseudoscientific content as significantly less credible, while showing no correlation with credibility assessments of accurate content. Crucially, humility did not predict the ability to identify AI as the source of dialogues, suggesting that epistemic vigilance operates on content quality rather than source attribution. We interpret these findings through an evolutionary lens, proposing that IH represents an ancestral adaptation for navigating informationally uncertain environments. It remains effective at detecting exploitation attempts in AI-generated content, despite humans lacking evolved mechanisms for detecting AI sources. The study contributes to understanding how foundation models might improve or undermine human epistemic defenses, especially in health communication contexts.

Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention

Dhanesh Ramachandram — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.04364v2 Announce Type: replace Abstract: Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction. The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies. A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88.85% versus 88.95% top-1) while raising pointing accuracy by 16 points (52.6% versus 36.4%). Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88.6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0.5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to $2.9\%$.

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

Ming-Hao Hsu, Xiaohai Tian, Jun Zhang, Zhizheng Wu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.04474v2 Announce Type: replace Abstract: Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this gap is not a uniform cognitive deficit. Evaluating two architecturally diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. Yet on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this as an entity binding failure: continuous speech features blur precise entity-property associations during implicit reasoning. To validate this diagnosis, we introduce Entity-Aware Chain-of-Thought (EA-CoT), a lightweight inference-time intervention forcing SLLMs to enumerate entities and bind them to claims before reasoning. EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4 percentage-point accuracy gain. Ablations confirm the gains stem from explicit semantic binding, reframing the gap as an elicitation failure rather than a missing capability.

GENEB: Why Genomic Models Are Hard to Compare

Daria Ledneva, Mikhail Nuridinov, Denis Kuznetsov — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.04525v3 Announce Type: replace Abstract: Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.

Parthenon Law: A Self-Evolving Legal-Agent Framework

Hejia Geng, Leo Liu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.04602v3 Announce Type: replace Abstract: As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable deployment faces three obstacles: no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB -- $12{,}510$ agent trajectories -- shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textsc{Parthenon}, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience -- as a firm refines its checklists and playbooks after each matter -- without touching model weights. Across our large-scale empirical analysis, \textsc{Parthenon} substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.

What Type of Inference is Active Inference?

Wouter W. L. Nuijten, Mykola Lukashchuk, Thijs van de Laar, Bert de Vries — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.04935v2 Announce Type: replace Abstract: Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution transparent. We then show that proper EFE-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE-based planning. This clarifies which corrections are needed for cross-entropy planning and for full EFE-based planning. The same entropy-corrected formulation leads to a detailed message-passing scheme for EFE-based planning together with simpler ablations. Experiments on three grid-world environments show that full EFE-based planning outperforms ablations that omit either the planning correction or the epistemic corrections.

Agents' Last Exam

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.05405v2 Announce Type: replace Abstract: Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long horizon, economically valuable, real world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 sub fields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is below 1%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP relevant impact.

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

Wenhao Mu, Facundo Yan, Anik Mumssen, Marisa Eisenberg, Alexander Rodr\'iguez — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.05692v2 Announce Type: replace Abstract: Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on real-world observations without ground-truth counterfactuals or on simplified simulations that fail to capture complex causal dynamics. To address this gap, we develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions. Unlike existing benchmarks, it supports static and time-varying treatments, as well as both single-policy and multi-policy intervention settings, enabling evaluation of causal inference methods across a broad range of causal inference scenarios. Leveraging a calibrated agent-based model grounded in real-world demographic, mobility, epidemiological, and policy data, we generate realistic counterfactual trajectories across more than 150 U.S. counties. Using this benchmark, we evaluate widely used and state-of-the-art causal inference methods, revealing substantial performance differences and highlighting the challenges of realistic time-series causal reasoning.

GenAutoML: An Agentic Framework for Dynamic Architecture Generation and Optimization in Time-Series Analysis

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.05860v2 Announce Type: replace Abstract: Designing neural architectures for time-series forecasting and anomaly detection remains a resource-intensive task that often requires substantial domain expertise. Traditional Automated Machine Learning (AutoML) systems typically rely on static, predefined search spaces, limiting their ability to adapt to diverse data characteristics. We present GenAutoML, an agentic framework that leverages Large Language Models (LLMs) as neural architects to bridge natural-language requirements and executable PyTorch implementations. The framework incorporates a Sandboxed Reflection Loop for autonomous code refinement and a Signature-Aware Runtime that enforces architectural consistency and execution safety. To improve robustness under non-stationary conditions, we further introduce a Dynamic Reversible Instance Normalization (Dyn-RevIN) wrapper. Experiments on the ETTh1, ETTm1, and Weather benchmarks demonstrate that GenAutoML can dynamically generate task-specific neural architectures tailored to dataset characteristics. Among the generated models, WaveInterferenceNet achieves inference latency below 0.01 ms per sample while maintaining competitive predictive performance. By emphasizing computational efficiency, architectural adaptability, and stable optimization behavior, GenAutoML enables the creation of ultra-lightweight neural networks suitable for resource-constrained and latency-sensitive Edge AI deployments.

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.06113v2 Announce Type: replace Abstract: Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-Language Model (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.

Learning to Contest: Decentralized Robust Fairness in Cooperative MARL via Cross-Attention

Can Savc{\i} — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.06162v2 Announce Type: replace Abstract: Fair cooperative multi-agent reinforcement learning (MARL) teams that maximize an egalitarian welfare are exploitable: a single self-interested agent free-rides on the surplus that fair agents forgo to raise the worst-off, and the known remedy is a centralized need-based allocator. We show that a decentralized defense becomes possible once contention is graded: when a contested resource still delivers a fraction $1-c$, a worst-off cooperator that contests a free-rider strictly improves on yielding, so leverage exists for every $c < 1$. We introduce CAN, a permutation-equivariant cross-attention policy over agents' observed behaviour that infers how many free-riders are present and responds proportionally -- turn-taking when none, contesting just enough when some. Trained against an adversarial league, CAN keeps best-response exploitability near the centralized oracle ($\rho \approx 1.2\text{--}1.5$ vs. $\rho = N$ unprotected) at essentially no efficiency cost, whereas the fair-MARL learners (GGF, FEN, SOTO) each collapse to an exploitable or wasteful extreme. Giving those objectives CAN's identical adversarial training does not rescue them, so the objective -- not adversarial training alone -- is what makes hardening possible. Against a committed (non-adaptive) defector, every learned defense including ours provides deterrence rather than immunity, weakening as the leverage $(1-c)/2$ vanishes. Across further environments and team sizes the same principle sets the scope: robustness holds exactly as far as the game's contest leverage reaches, and we map that boundary rather than claim to remove it.

Agentic Large Language Models for Automated Structural Analysis of 3D Frame Systems

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.06525v2 Announce Type: replace Abstract: Large language models (LLMs) have emerged as powerful foundation models with strong reasoning capabilities across domains. Beyond reactive text generation, agentic LLMs enable autonomous workflow execution through modular task decomposition and coordinated tool use. In structural engineering, recent efforts have developed agentic LLMs for automated analysis of plane frames. However, their extension to 3D frames remains underexplored due to challenges in irregular geometric representation, topological consistency, and long-horizon reasoning. This paper proposes an agentic LLM framework for automated structural analysis of 3D frames from natural language inputs. Irregular 3D frames are represented by projection onto a 2D plan, where orthogonal gridlines define spatial coordinates and a matrix of number of stories encodes vertical extrusion of each grid cell. Building on this representation, the framework establishes a multi-agent pipeline: a problem analysis agent parses input into structured JSON; a floor decomposition agent derives the spatial layout of each floor; the 3D geometry is assembled by node, girder, slab, and column agents; support and load agents assign boundary and loading conditions, and code translation agents generate executable SAP2000 script. Evaluated on ten representative 3D frames, the proposed framework achieves an average accuracy of 90% across repeated trials, demonstrating consistent and reliable performance.

HKVM-RAG: Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG

Mingyu Zhang, Ying Ma — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.07218v2 Announce Type: replace Abstract: Multi-hop RAG poses a data-engineering problem beyond passage matching: under fixed retrieval budgets, a system must organize retrieved text into evidence units that expose answer chains. Dense retrievers score passages independently, while graph-based memories make associations explicit but often rely on pairwise or entity-centered keys that fragment multi-hop evidence. We present HKVM-RAG, a key-value-separated evidence-organization layer. It assembles answer-path hyperedges from cached passage-level LLM evidence tuples and uses them as retrieval keys, while retaining passage text as answer values. To isolate key-space design, our fixed-substrate protocol holds the tuple cache, candidate passages, reader, and evaluation budget constant across pairwise graph and hypergraph variants. Weighted hypergraph key-value retrieval improves over KG-PPR by +3.426 F1 on 2WikiMultiHopQA and +3.592 F1 on MuSiQue; HotpotQA shows that higher structured support coverage need not yield standalone answer-F1 gains. We therefore study WHG-KV as an evidence-control signal rather than a dense-retrieval replacement. Oracle and train-to-dev analyses identify support selection as repairable, and a dense-aware controller combines frozen ColBERTv2 and HKVM rank/score features using out-of-fold HKVM predictions. It reaches 88.846, 65.073, and 85.810 F1 on the three benchmarks, improving over ColBERTv2 by +11.084, +6.763, and +5.966 F1. Source-level ablations show that matched non-WHG structured signals do not match the WHG-KV gains. These results provide bounded evidence that key-value-separated hypergraph organization can serve as a reusable evidence-control mechanism for multi-hop RAG.

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

Jinju Lee — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.07334v2 Announce Type: replace Abstract: This report treats chord-symbol sequences as an interpretable, controllable time series for genre-local harmonic modeling. The frozen Music Transformer base - released as a pop-jazz fine-tune endpoint but verified in this revision weight-identical to the pop-only Phase-0 baseline, so all gains are measured over a pure-pop prior (see Changes in v2) - is extended to eleven target genres: blues, bossa nova, Bach chorales, country, electronic, folk, funk, gospel, hip-hop, R&B/soul, and rock. The main evaluation compares LoRA, IA3, BitFit, prefix tuning, and full fine-tuning over 11 genres and 3 seeds, a complete 165-cell grid. All five methods improve over the frozen base on held-out chord prediction (macro gains +2.89 to +3.61 percentage points); LoRA and IA3 score highest, but pairwise Wilcoxon tests with Holm and Benjamini-Hochberg correction do not support a decisive winner. A matched-data-size control sharpens this: at a common corpus size IA3 stays on top while LoRA drops to last, so the small method gaps are partly data-driven rather than representational. A control-token baseline is also strong, and wrong-genre adapters often beat the frozen base, suggesting the adaptation effect is largely lightweight conditioning over a reusable harmonic base rather than genre-specific adapter memory. Further diagnostics (rank sweeps, wrong-genre rotation, a base-checkpoint ablation that v2 reinterprets as a same-weights control, chord-only genre classification, output-distribution statistics, real-song evaluation, duplicate analysis) support a bounded conclusion: chord-symbol adaptation reliably improves genre-local harmonic prediction, but chord symbols alone do not carry complete genre identity. Perceived genre authenticity and musical quality are left to controlled listener evaluation.

Combinatorial Landscape Analysis for Dominating Set and Vertex Coloring

Johanna Gasse, Antonia Heinen, Felix Kn\"ofel, Timo K\"otzing, Maxim Stanko — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.07361v2 Announce Type: replace Abstract: We analyze the two combinatorial problems of Dominating Set and Vertex Coloring regarding what kind of local optima are present for various instances. For a variety of graph classes each, we determine whether the induced landscapes are unimodal, plateau-unimodal (all optima are just one plateau), equimodal (all local optima are global) or truly multimodal. We do this for two different neighborhood operators, one based on making only a single change and one also allowing swaps (interchanging two parts of the solution).

Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

Haoyuan Li, Zhengdong Hu, Jun Wang, Hehe Fan, Yi Yang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.07436v2 Announce Type: replace Abstract: This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 60% on VSI-Bench.

Tracing Stablecoin Contagion during the USDC Depeg after the Silicon Valley Bank Collapse

Krongtum Sankaewtong, Stefan Kitzler, Bernhard Haslhofer, Yuichi Ikeda — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.07442v2 Announce Type: replace Abstract: The March 2023 collapse of Silicon Valley Bank (SVB) disrupted the core premise of stablecoins, which are digital tokens designed to maintain a fixed value against the U.S. dollar and serve as on-chain substitutes for dollar liquidity. The event triggered a sharp depeg of USDC, creating a rare exogenous shock to the stablecoin ecosystem. While price deviations during this crisis are well documented, the underlying behavioral reorganization of on-chain activity remains less understood. Here, we analyze high-granularity transaction data to measure the shock's effects on network activities, volumes, and prices, reconstructing the contagion pathway from market-wide synchronization down to account-level reallocation. By extracting phase dynamics, we first show that transaction activity across major stablecoins became strongly synchronized during the crisis window, indicating a collective market-level response. We then uncover a bifurcated contagion pathway. While USDT, WBTC, and WETH reacted primarily as liquidity absorption channels with larger trade volumes, only USDC-related assets exhibited immediate price responses alongside surging transaction counts. This reflects the dominant role of USDC-related assets in this incident and their immediate behavioral connection to user panic, driving a mass reallocation from single-coin to multi-coin portfolios. Finally, governed by persistent intraday time-zone rhythms and balance-size heterogeneity, these findings provide a comprehensive empirical framework for understanding systemic risk and flight-to-quality mechanisms in fractional-reserve digital asset networks.

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

Jeremy Yang, Kate Zyskowski, Noah Yonack, Jerry Ma — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.07489v2 Announce Type: replace Abstract: Frontier AI systems are bridging the gap between intelligence and utility by shifting from conversational assistants to autonomous agents that execute tasks end to end. Using production data from Perplexity's Search and Computer products, we study this transition by examining how AI agents accelerate and reshape knowledge work. Three key empirical findings emerge. First, using sessions with near-identical initial query pairs as natural experiments for the same underlying task attempted with both products, Computer performs 26 minutes of autonomous work per user session, versus 33 seconds for Search. Computer automates task decomposition and execution that Search users might otherwise manually orchestrate and implement. As a result, Computer shifts follow-up query distribution toward higher-order work such as verification and extension. Autonomy also increases execution quality, with per-query dissatisfaction rates 55% lower on Computer than on Search. Second, due to its autonomy advantage, Computer reduces completion time from 269 to 36 minutes on matched tasks, lowering estimated time and cost by 87% and 94%, respectively, compared to humans equipped with Search alone. Third, Computer changes the scope of work that users attempt: Computer queries more often cross occupational boundaries, require higher-order cognition, draw on broader expertise, take the form of composite tasks that bundle interdependent subtasks into a single query, and unlock work activities that are essentially absent from Search usage among the same users. Together, the evidence indicates that AI agents accelerate workflows, enhance output quality, reduce costs, and expand the breadth and depth of automated work.

How reliable are LLMs when it comes to playing dice?

Luca Avena, Gianmarco Bet, Bernardo Busoni — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.07515v2 Announce Type: replace Abstract: We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

Yasushi Sakai, Allen Song, Kent Larson — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.08098v2 Announce Type: replace Abstract: Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. In this paper, we show a delegation-based aggregator (Propagational Proxy Voting, PPV; Sakai et al., 2025) yields an unsupervised consensus rule that beats majority on MMLU-Pro by +1.5 pp overall and +2.24 pp on the non-trivial subset (paired McNemar p ~ 1.0e-14, n = 8,099). Majority discards two signals that every sample carries: within-group letter entropy and between-group reasoning geometry. PPV exposes per-voter levers that consume exactly these two signals: When (how much weight a voter keeps on its own pick) and Whom (how it splits the remainder across peers). We drive When with letter entropy and Whom with per-question-centered embedding cosine. Our method needs no gold labels and no auxiliary training: per-question, we partition 128 sampled generations into 16 groups, compute each group's letter-level semantic entropy and reasoning embedding centroid, and feed both into a stochastic delegation matrix whose stationary distribution selects the consensus answer. We walk through an example in which PPV overturns a clear 10-6 majority for the wrong letter: the 10-voter majority cluster is geometrically incoherent (mean within-cluster cosine -0.02) while the 6-voter minority is tight (+0.26), so propagated delegation mass concentrates on the minority's answer even though entropy alone would keep the majority ahead. We further report delegation strategies with negative results that constrain the design space for unsupervised LLM aggregation. No within-question ensemble of confidence modes closes the oracle gap.

CACR:Reinforcing Temporal Answer Grounding in Instructional Video via Candidate-Aware Causal Reasoning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.08436v2 Announce Type: replace Abstract: The task of temporal answer grounding in instructional video (TAGV), which aims to locate precise video segments that respond to natural language queries, is increasingly important for direct video answer retrieval. This task remains challenging due to the need to comprehend semantically complex questions and to address the significant length mismatch between untrimmed videos and short target moments. Existing methods often suffer from sensitivity to irrelevant content or insufficient visual reasoning capabilities. To tackle these limitations, we propose a Candidate-Aware Causal Reasoning (CACR) framework. Our approach first employs a Visual-Language Pre-training based Candidate Selection (VBCS) algorithm to efficiently generate K candidate segments, then applies a temporal logic reasoning module enhanced by a rejection reward mechanism and optimized via Group Relative Policy Optimization (GRPO) for robust inference. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art performance in terms of mean Intersection-over-Union (mIoU), providing a new perspective for reasoning-based retrieval in long videos.

RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

Shengcheng Luo, Kefei Wu, Xiaoying Zhou, Wanlin Li, Ziyuan Jiao, Chenxi Xiao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.08765v2 Announce Type: replace Abstract: Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual observations are unreliable or occluded. However, robustly aligning sparse, heterogeneous tactile measurements with dense visual representations remains a fundamental challenge. Most existing approaches require policies to learn cross-modal correspondences implicitly from limited demonstrations, without leveraging geometric priors. As a result, they are often data-inefficient and generalize poorly when visual observations are degraded. To address this limitation, we propose a framework that explicitly grounds physical contacts in the image domain. Using robot forward kinematics and camera calibration, we project tactile sensor locations directly onto the RGB image plane. We then render force-modulated Gaussian saliency maps to model spatial uncertainty arising from kinematic and calibration errors. By integrating these 2D spatial anchors through a zero-initialized conditioning architecture, our method injects physical contact priors into standard visual backbones while preserving pre-trained visual representations. We evaluate our method on six dexterous manipulation tasks in both simulation and the real world under severe visual occlusions. Real-world experiments show that explicit RGB-S grounding in the image domain improves real-world occluded manipulation success rates by $26.7$ percentage points over the strongest implicit visuo-tactile baseline, suggesting its improved spatial reasoning and robustness to occlusion. Project page: touch-as-saliency.github.io

A Unifying Lens on Reward Uncertainty in RLHF

Ely Hahami, Yoel Zimmermann, Ray Zhou, Jack Benarroch Jedlicki — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.09073v2 Announce Type: replace Abstract: Reinforcement learning from human feedback (RLHF) is bottlenecked by reward hacking, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is pessimism: lowering rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a distributional reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pm\beta\log\mathbb{E}_p[e^{\pm r/\beta}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.

Chimera: Protocol-Aware Recovery for Confidential BFT Consensus

Tong Liu, Xiaoqing Wen, Ziwei Zhou, Si Liu, Jianyu Niu, Cong Wang, Yinqian Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.09101v2 Announce Type: replace Abstract: Trusted Execution Environments (TEEs) have enabled confidential Byzantine Fault-Tolerant (BFT) consensus systems with confidentiality and improved scalability. However, TEEs do not provide state continuity: during recovery, a compromised host can roll back a crashed enclave to a stale persistent state, significantly threatening both safety and availability. Existing defenses face a fundamental tradeoff: they either impose substantial overhead on critical consensus paths, reducing throughput and increasing latency, or incur prolonged recovery delays, hurting availability. We present the first systematic taxonomy of rollback-resilient recovery for confidential BFT consensus, distilling prior approaches into four categories. We further expose their inherent limitations. Guided by this detailed analysis, we design CHIMERA, a protocol-aware recovery framework that breaks this tradeoff. Our key insight is that rollback protection in consensus systems should not be uniform. Different types of persistent states differ fundamentally in their state distribution, update behavior, and representation form. CHIMERA separates persistent state into metadata and logs according to these protocol-level properties and applies distinct recovery mechanisms to each type. We formally model CHIMERA in Maude and verify its safety and liveness properties. We implement it on Braft and ZooKeeper using Intel TDX, and evaluate it in both LAN and WAN settings. Results show that CHIMERA achieves higher throughput, lower recovery latency, and better availability than state-of-the-art rollback-resilient baselines.

sketch-plot: Progressive Editing for Text-to-Image Academic Figures

Yinghao Tang, Yupeng Xie, Yingchaojie Feng, Tingfeng Lan, Jiale Lao, Wei Chen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.09171v2 Announce Type: replace Abstract: Text to image (T2I) models such as gpt-image-2 can now generate publication grade academic figures from a short prompt, but the output is a flat raster: a user who wants to change one arrow, one label, or one icon has to regenerate the whole image, which also disturbs the parts they wanted to keep. We present sketch-plot, an interactive system that closes this controllability gap with a three layer progressive editing pipeline: a generated PNG, an addressable puzzle of editable pieces, and a per piece SVG. The user stops at the layer that gives them enough control for the change at hand, so the cost of decomposition and vectorisation is paid only on the pieces that need it. Realising this pipeline is not trivial. General segmentation models lack the semantic discriminability to decompose a research figure cleanly, and end to end image vectorisation produces incomplete shapes and loses semantic structure. We therefore route both stages through a human in the loop interface that lets the user accept, refine, or reject decomposition and vectorisation decisions on a piece by piece basis. We validate the design with an expert user study, in which participants found sketch-plot effective for making targeted edits to AI generated academic figures and preferred it over regenerating the whole image. A demonstration video is available at https://paper-plot.dev/sketch.

Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture

Yoojin Nam, Jinhoon Jeong, Namkug Kim — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.09500v3 Announce Type: replace Abstract: As autonomous research agents and AI co-scientist systems push large language models (LLMs) from drafting toward end-to-end manuscript production, the bottleneck shifts from generation to verification. Fluent LLM output can hide fabricated citations, numbers that drift from source tables, and unmet reporting-guideline items; existing tools generate without verifying, and self-critique inherits the blind spots that produce confident fabrication. We describe an architecture pairing generation with verification, resting on three principles: decompose the workflow into self-contained skills, gate every stage transition with halt-on-failure, and resolve each integrity question with the cheapest sufficient mechanism, a deterministic, re-executable check where one suffices and a prose-level probe only where interpretation is unavoidable. This determinism-where-possible split, organized as an integrity-gate taxonomy, is the core contribution. It is realized as MedSci Skills, an open-source toolkit of 43 skills with a 21-detector deterministic tier, evaluated on three public-dataset pipelines (STARD, PRISMA, STROBE) and a seeded-defect ablation. Across the three pipelines every content-hash manifest verified clean and the gates surfaced real defects; on 27 identical injected defects the deterministic gates detected all 27 with no false positives on the matched clean fixtures, whereas a single-prompt LLM reviewer detected 11, its misses in code, bibliography, and style defects the prose hides. Determinism-where-possible verification yields an auditable, re-executable trail that exposes the evidence a human needs to check an LLM-assisted manuscript: feasibility and reproducibility evidence, not a claim of human-competitive quality, which a separate blinded study addresses. MedSci Skills is MIT-licensed and archived (v3.8.0).

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.09639v2 Announce Type: replace Abstract: The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training data. To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three-stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film-theory-inspired narrative parsing, and iii) hierarchical dual-modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates exceptional single-modality quality alongside precise audio-video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance-1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi-shot, long-form joint audio-video generation. Our project page is available at https://aliothchen.github.io/projects/CineDance/.

MinhwaNet: Faithful but Insufficient Object Grounding in Korean Folk Painting

Joonhyung Bae — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.09855v2 Announce Type: replace Abstract: Korean folk painting (minhwa) is built from a small vocabulary of auspicious symbols, a tiger for protection, a pair of birds for marital harmony, a peony for wealth, that recur across many of its painted genres. This suggests an obvious computational approach, identify which symbols appear in a painting and read the genre from the inventory. Working with a public corpus that pairs whole paintings, eight-field bilingual curatorial captions, and a separate set of expert object crops, we find that this approach does not work. A model given only a list of which symbols a painting contains predicts the genre far worse than a model that fuses the image with the curatorial text, and forcing the genre representation to be object-grounded actively hurts accuracy. The visual evidence on which the genre prediction rests is nonetheless localized and inspectable. A leakage-safe object evidence map projected from a part-level detector is spatially faithful to where curators isolated symbolic objects and to a patch-based surrogate's own gradient saliency. We name this configuration a faithful-but-insufficient dissociation. The part-level explanation is honest about what the part-level model sees, yet the genre target turns on how symbols are arranged rather than on which ones appear. The same lens separates a content label that survives transfer to held-out source institutions, genre, from a style label that does not, era, a prediction we confirm on two further labels in the corpus. We release the multimodal system, a worked-example reading of one painting's evidence map against its catalogue, and a set of evaluation cautions that recur in long-tailed heritage collections.

Using Seismic Statistical Features and VQ-VAE to Improve Spatiotemporal Seismicity Predictability

Wei Quan, Denise Gorse — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.10069v2 Announce Type: replace Abstract: In this paper we build upon a previous study in which we demonstrated, using XGBoost and earthquake catalogue data from Japan and Chile, that a set of 60 seismic statistical features (SSFs) had much greater predictive value than a set of 428 generic time series features from the tsfresh package. We here extend this previous work in two key ways, focusing on data from Japan as a large dataset is necessary in order to allow for the training of a deep learning (autoencoder) model. First, we move from whole-region prediction (considering, for each candidate event, the likelihood of an event M $\geq$ 5.0 anywhere in the region in the next 15 days) to localised predictions in which both the region of feature computation and the region of prediction are restricted to a circle of radius 24 km around the candidate event, and we show that performance remains excellent, similar to our previous whole-region study for the same area. Second, we here couple this proven set of SSFs, based on one-dimensional (catalogue) data, with a novel feature based on two-dimensional seismic maps, obtained by training a VQ-VAE model to reproduce such maps as output and identifying a measure of its error in doing so with a localised build-up of crustal stress. We show that while localised prediction based on SSFs can be effective alone, with test AUC values as high as those obtained in the case of Japan in our previous whole-region study, the inclusion of the new natively-spatial VQ-VAE-derived feature, top-ranked by SHAP analysis, can enhance performance and additionally appears to near-wholly replace the traditionally-computed $b$-value in terms of feature usage.

An Improved Generative Adversarial Network for Micro-Resistivity Imaging Logging Restoration

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.10200v2 Announce Type: replace Abstract: An improved GAN-based imaging logging image restoration method is presented in this paper for solving the problem of partially missing micro-resistivity imaging logging images. The method uses FCN as the generative network infrastructure and adds a depth-separable convolutional residual block to learn and retain more effective pixel and semantic information; an Inception module is added to increase the multi-scale perceptual field of the network and reduce the number of parameters in the network; and a multi-scale feature extraction module and a spatial attention residual block are added to combine the channel attention. The multi-scale module adds a multi-scale feature extraction module and a spatial attention residual block, which combine the channel attention mechanism and the residual block to achieve multi-scale feature extraction. The global discriminative network and the local discriminative network are designed to gradually improve the content and semantic structure coherence between the restored parts and the whole image by playing off each other and the generative network. According to the experimental results, the average structural similarity measure of the five sets of imaged logging images with different sizes of missing regions in the test set is 0.903, which is an improvement of about 0.3 compared with other similar methods. It is shown that the method in this study can be used for the restoration of micro-resistivity imaging log images with good improvement in semantic structural coherence and texture details, thus providing a new deep learning method to ensure the smooth advancement of the subsequent interpretation of micro-resistivity imaging log images.

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

Sanghee Park, Geewook Kim, Kee-Eung Kim — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.10403v2 Announce Type: replace Abstract: Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at https://github.com/naver-ai/KCSAT-ML.

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.10616v2 Announce Type: replace Abstract: Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts that exceed their finite context windows, making memory retention a fundamental resource-allocation problem. Existing memory systems improve management through heuristic scoring, retrieval optimization, or learned compression, but largely treat retention as a local decision problem and do not explicitly model its long-term consequences under realistic observability constraints. To fill this gap, we formulate memory retention as a constrained stochastic optimization problem with explicit budget feasibility, evidence utility, and delayed costs including miss penalties, reacquisition delays, and stale-information risk. We then propose OSL-MR (Observability-Safe Learning for Memory Retention), a novel framework that enforces a strict separation between online-observable features and offline-available supervision (OAS). OSL-MR combines an evidence learner trained from realized evidence supervision with a Mixed-Score heuristic that serves both as a deployable online-safe baseline and as a structured inductive prior for learning. The resulting policy learns query-conditioned evidence value directly from interaction data while remaining deployable under the same observability constraints. Experiments on LOCOMO and LongMemEval show that OSL-MR consistently outperforms recency-based methods, Generative Agents-style scoring, and other heuristic baselines, particularly under tight memory budgets. The Mixed-Score prior further improves precision while preserving recall, and sensitivity analysis demonstrates robustness across a wide range of cost configurations.

PhysMetrics.Weather: An Evaluation Framework for Physical Consistency in ML Weather Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.10642v2 Announce Type: replace Abstract: Machine learning weather prediction (MLWP) models have achieved impressive forecasting performance at a small fraction of the computational costs required for traditional physics-based methods. However, they are primarily (1) data-driven and (2) evaluated using pixel-wide error metrics (e.g., RMSE), so there are no guarantees that their forecasts are consistent with known physical laws. We introduce PhysMetrics$.$Weather, an evaluation framework that assesses the physical realism of MLWP models across three types of metrics: conservation, spectral, and dynamical. By quantifying physical realism, this tool guides the development of physics-informed architectures and helps evaluate whether MLWP models are reliable for operational use. Our framework is available on Github at https://github.com/Emmakast/PhysMetrics.Weather.

One Step Closer to Ground Truth: A Multi-Scale Residual-Aware Representation Learning Pipeline for Predicting Time Series Data

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.10678v2 Announce Type: replace Abstract: Transformer-based models have emerged as leading paradigms in time-series forecasting in recent years, employing self-attention mechanisms to capture long-range dependencies. Despite their success, these single-stage forecasting architectures exhibit persistent systematic residual biases arising from structural discrepancies, unmodeled stochastic components, or inadequate multi-scale temporal representations. This limitation persists when residuals are treated as irreducible noise, precluding adaptive correction of structured error patterns. To address this limitation, we introduce a two-stage, model-agnostic framework that explicitly decouples forecasting and residual learning into distinct stages of representation learning. A base transformer first generates the initial predictions. Subsequently, a dedicated meta-corrector dynamically models structured error patterns across multivariate channels, preserves cross-variable dependencies, and iteratively refines the residual bias of the base transformer. By formalizing this pipeline as a hypothesis space expansion, our framework addresses approximation limitations inherent in single-stage architectures, removes reliance on restrictive assumptions, and enables end-to-end learning of complex error dynamics. Evaluated on eight popular benchmark datasets using established protocols, our approach achieves state-of-the-art performance, with significant improvements in standard metrics (MSE, MAE). The results demonstrate the framework's ability to mitigate systematic biases and enhance robustness to complex temporal dynamics, advancing the practical applicability of transformer-based forecasting models.

UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.10683v2 Announce Type: replace Abstract: Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.

Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings

Roberto Mart\'inez-Cruz, Alvaro J. L\'opez-L\'opez, Jos\'e Portela — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.10716v2 Announce Type: replace Abstract: Pre-trained language models (PLMs) have achieved strong performance in keyphrase extraction (KPE), largely due to their ability to generate rich contextualized representations. However, long-document KPE remains challenging because salient keyphrase evidence may be scattered across distant document sections that cannot be jointly captured within the limited context window of most PLMs. Although long-context large language models (LLMs) can process broader textual contexts, their computational cost limits their practicality for efficient and high-throughput KPE. To overcome this limitation, we propose an attention expansion mechanism that augments PLM token representations with information from surrounding out-of-context chunks using pre-trained word embeddings. The proposed mechanism expands the effective contextual scope of PLM-based KPE models without requiring full-document attention or expensive LLM-based inference. We evaluate our approach across five PLM backbones, including general-purpose, scientific, task-specific, and long-context encoders, using two training regimes and five benchmark corpora from scientific and news domains. Experimental results demonstrate that attention expansion consistently enhances KPE performance across all evaluation settings, outperforming state-of-the-art models and yielding notable improvements in F1 score. The improvements extend to domain-specific, task-specialized, and native long-context models, showing that the proposed mechanism provides complementary information rather than merely compensating for limited input length. These results establish attention expansion as an efficient and effective strategy for long-document KPE.

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

Naihao Deng, Yilun Zhu, Naichen Shi, Clayton Scott, Rada Mihalcea — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.10931v2 Announce Type: replace Abstract: Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. We further find that models differ in their susceptibility based on the initial likelihood of producing biased outputs. Our results reveal a critical vulnerability in post-training: alignment can be overridden by a single example.

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11042v3 Announce Type: replace Abstract: Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11092v3 Announce Type: replace Abstract: Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: https://opendrivelab.com/RoboNaldo.

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

Ilay Kamai, Hugues Van Assel, Aviv Regev, Hagai B. Perets, Randall Balestriero — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11190v2 Announce Type: replace Abstract: Cross-modal alignment (CA) and cross-modal prediction (CP) are the dominant paradigms for multimodal representation learning, yet there is no systematic understanding of when each succeeds, when each fails, and when cross-modal training helps at all -- a gap that leaves practitioners, especially in scientific domains like biomedicine or astrophysics, with heterogeneous instruments and multiple levels of organization and measurement, unable to diagnose why standard methods underperform the best single modality. We develop a unified linear framework that addresses both questions. Under a spiked signal-plus-noise model with structured cross-modal nuisance correlation, we derive separation ratios for both objectives that expose complementary failure modes: alignment whitens each modality and fails when nuisance is strongly correlated across views; prediction encodes whatever is cross-predictable through a one-sided whitening, with recovery governed by source-modality quality. The resulting phase diagram partitions multimodal problems into four regimes: Both, CA only, CP only, and Neither. We present a data-driven procedure to locate real-world datasets in this diagram using a small labeled subsample, identifying the preferred objective and prediction direction before any cross-modal training. Experiments on synthetic data, stereo-vision benchmarks, image-caption pairs, and real astrophysical data validate the predictions in the nonlinear regime, including the Neither regime where cross-modal training is actively harmful. Our framework lets practitioners diagnose their multimodal problem and choose the right objective before committing to training. Code to reproduce the results is available at https://github.com/IlayMalinyak/mm_align_vs_pred.

Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

Taha Bouhsine — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11255v2 Announce Type: replace Abstract: Bernstein--Schur kernels are products of a finite-feature kernel and a completely monotone shift-invariant kernel: nonstationary kernels falling between the shift-invariant and dot-product templates random features exploit, so neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and samples the radial factor's one-dimensional Bernstein--Widder scale before applying Gaussian random Fourier features, giving feature dimension $Dm$, free of the $O(d^2)$ size of the exact modulation feature. With the modulation kept exact (the $m\to\infty$ limit), we prove unbiasedness, an exact variance, and a matrix-Bernstein operator-norm bound controlled by the top kernel and modulation eigenvalues and an intrinsic dimension rather than the crude $N\max_{ij}$ route. Whitening this argument at the ridge makes the effective dimension $d_{\mathrm{eff}}(\lambda)$ the \emph{exact} intrinsic dimension of the matrix variance, so $O((1+\|P\|_{\mathrm{op}}/\lambda)\log(d_{\mathrm{eff}}/\delta))$ radial draws preserve the kernel-ridge solution; tilting the draw by a closed-form whitened leverage improves this to the effective-dimension count $O((1+d_{\mathrm{eff}})\log(d_{\mathrm{eff}}/\delta))$. Conditioning on the sketch carries every guarantee to the deployed doubly-randomized estimator up to one additive sketch term, and all hold for the whole class with the modulation Gram in place of the polynomial one. The flagship instance is the biased $yat$-kernel $k_{yat,b}(w,x)=(w^\top x+b)^2/(\|w-x\|^2+\varepsilon)$, whose family span contains the inverse-multiquadric kernel by finite differences in $b$.

The Long Tail, Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience

Kazuki Nakayashiki, Keisuke Watanabe — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11654v2 Announce Type: replace Abstract: A social highlighter's most useful signal -- which passages a crowd of readers marks -- exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies -- it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

Sangmin Lee, Eekgyun Ahn, Woongjib Choi, Hong-Goo Kang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11681v2 Announce Type: replace Abstract: We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.

Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning

Shengcheng Luo, Xiyan Huang, Zhe Xu, Wanlin Li, Ziyuan Jiao, Chenxi Xiao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11767v2 Announce Type: replace Abstract: Blind grasping with a dexterous hand is a crucial manipulation capability. Nevertheless, learning such tactile-only policies for real robots remains challenging due to the tactile sim-to-real gap and the limited expressiveness of sparse tactile signals. To bridge this gap, we propose a framework for tactile-only blind grasping that is deployable on a physical multi-fingered robotic hand. Our approach combines three key components. First, we introduce a Real2Sim tactile calibration pipeline that constructs a contact-calibrated digital-twin simulator capable of reproducing real tactile signals. Second, we improve the expressiveness of sparse tactile observations using a layout-aware tactile encoder, which incorporates sensor-geometry priors through self-supervised pretraining. Third, to improve generalization to unseen objects, we train object-specific reinforcement-learning experts in the calibrated simulator and aggregate their successful grasp trajectories into a tactile-conditioned Diffusion Policy. We evaluate our method on a physical LEAP Hand equipped with distributed tactile sensing across 10 seen and 10 unseen objects. The deployed policy achieves a 27\% real-world grasp success rate across all 20 objects, without real-world grasping demonstrations or visual input. Simulation ablations show that layout-aware tactile pretraining improves grasping performance, while sensing-level evaluations confirm that Real2Sim calibration increases the consistency of tactile contact events between simulation and hardware. Together, these results suggest that contact-event calibration, geometry-aware tactile representation learning, and diffusion-based policy aggregation provide an effective path toward tactile-only blind grasping on real dexterous robotic hands. Project page:Dex-Blind-Grasp.github.io.

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11792v2 Announce Type: replace Abstract: Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.

Scalable Deep Learning Framework for Global High-Resolution Land Use Reconstruction

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11793v2 Announce Type: replace Abstract: Uncertainty in the terrestrial carbon cycle remains a major constraint in climate projections, partly driven by the uncertainties affecting the land surface representation and variability in Earth system models. To address this limitation, we present a data-driven framework AI4Land, for generating high-resolution historical reconstructions and future projections of key land surface variables. The framework follows a two-phase approach using a U-Net architecture. In the first phase, which is the focus of this work, it reconstructs annual land use and land cover by integrating coarse-resolution scenario data with static geophysical features. In a planned second phase, the resulting high-resolution maps will be used to predict dynamic biophysical variables, particularly leaf area index, at finer temporal scales. Trained on Earth observation data, the models learn to reproduce spatially explicit and physically consistent land surface patterns, extending temporal coverage to periods lacking direct observations. AI4Land was developed and trained on MareNostrum5, demonstrating how GPU-accelerated HPC infrastructure enables global-scale climate AI pipelines. The final product is a suite of open-source emulators designed for real-time coupling with digital twin platforms, such as those developed under the Destination Earth initiative. By delivering realistic and evolving land surface conditions on demand, this work aims to reduce critical uncertainties and improve the predictive power of next-generation climate simulations.

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11836v2 Announce Type: replace Abstract: This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.

Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

Yuto Furutani, Takashi Otonari, Kaede Shiohara, Toshihiko Yamasaki — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11894v2 Announce Type: replace Abstract: Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.

GraspLLM: Towards Zero-Shot Generalization on Text-Attributed Graphs with LLMs

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Li Yang, Wentao Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11898v2 Announce Type: replace Abstract: Research on Text-Attributed Graphs (TAGs) has gained significant attention recently due to its broad applications across various real-world data scenarios, such as citation networks, e-commerce platforms, social media, and web pages. Inspired by the remarkable semantic understanding ability of Large Language Models (LLMs), there have been numerous attempts to integrate LLMs into TAGs. However, existing methods still struggle to generalize across diverse graphs and tasks, and their ability to capture transferable graph structural patterns remains limited. To address this, we introduce the GraspLLM, a framework that combines Graph structural comprehension with semantic understanding prowess of LLMs to enhance the cross-dataset and cross-task generalizability. Specifically, we represent node texts from different graphs in a unified semantic space with a frozen general embedding model, on top of which we perform motif-aware contrastive learning across multiple motif-induced adjacency matrices to extract dataset-agnostic structural information. Then, with our proposed optimal contextual subgraph, we extract the most contextually relevant subgraph for each target node and align these subgraphs to the token space of LLM via an alignment projector. Extensive experiments on TAG benchmark datasets spanning diverse domains reveal that GraspLLM consistently outperforms previous LLM-based methods for TAGs, especially in zero-shot scenarios, highlighting its strong generalizability across different datasets and tasks. Our code is available at https://github.com/Heinz217/GraspLLM.

Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability

Kuo-En Hung, Hung-Yue Suen, Shih-Ching Yeh, Hsiang-Wen Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11930v2 Announce Type: replace Abstract: Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging problem in AI-assisted interview assessment because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.

Human-Enhanced Loop Modeling (HELM): Agent-Based Finite Element Modeling of Concrete Bridge Barriers

Quankai Wang, Yulin Xie, Tongfei Yang, Minghui Cheng, Ran Cao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12025v2 Announce Type: replace Abstract: Finite element (FE) modeling of safety-critical infrastructure such as bridge barriers requires high-fidelity nonlinear dynamic analysis, yet the current FE modeling process remains labor-intensive and lacks automation. This paper presents the Human-Enhanced Loop Modeling (HELM) framework, a collaborative human-agent protocol that decomposes long-sequence finite element modeling into discrete, visually verifiable checkpoints across geometry generation, boundary condition definition, and material assignment. The framework is demonstrated through a 20-case matrix of reinforced concrete bridge barriers under MASH TL-4 and TL-5 lateral loading conditions, interfacing specialized agents with two widely used commercial FE softwares, i.e., ANSYS and LS-PrePost. Experimental results show that HELM improves the baseline autonomous modeling success rate from 20% to 75%, with agent-level pass rates for geometry and boundary condition tasks approximately doubling. Error analysis reveals that spatial reasoning and algebraic logic limitations constitute the primary failure modes, underscoring the value of structured human-in-the-loop intervention for modeling automation. The complete agent design code and prompts are open-sourced and can be accessed at: https://github.com/SimAgentDev/Ansys-LSPP-AgentKit.

A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

Wanting Wang, Xiye Ma, Yuyang He, Minghui Cheng, Ran Cao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12040v2 Announce Type: replace Abstract: The design of reinforced concrete highway barriers is a safety-critical process that requires strict compliance with regulatory provisions such as the AASHTO-LRFD bridge design guidelines. Current engineering practice relies heavily on manual, iterative, and heuristic calculations to satisfy complex nonlinear material and mechanics constraints. Although Large Language Models (LLMs) demonstrate strong generative capabilities, their direct application to structural engineering remains limited by hallucination risks and insufficient physical grounding. To address these challenges, this study proposes a novel "generation-evaluation-optimization" closed-loop framework for automated concrete barrier design using the multi-agent orchestration capabilities of AutoGen. Experimental results demonstrate that the proposed agentic framework achieves over 98% design accuracy, significantly outperforming standalone general-purpose LLMs. More importantly, the study reveals that design performance is not necessarily correlated with model scale, where an 8B-parameter lightweight model could outperform unconstrained 631B-parameter flagship models. This finding highlights the potential to substantially reduce computational costs while improving the accessibility of AI-assisted engineering tools for industry applications. The source code for the proposed multi-agent design framework is available at the project GitHub repository: https://github.com/MXY820/barrier-design. Keywords: Structural Engineering; Multi-Agent Systems; Large Language Models; Concrete Barrier Design; AutoGen; Design Automation.

A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs

Ao Sun — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12160v2 Announce Type: replace Abstract: Decoding-time truthfulness methods -- layer-contrast decoding, inference-time intervention, and learned logit adapters -- have demonstrated 10-30 point gains on TruthfulQA when applied to base language models. However, modern instruction-tuned LLMs already achieve substantially higher baselines (61-76%), raising the question of whether these methods remain effective in practice. We design a six-control evaluation framework -- out-of-distribution training, multi-judge validation, simple decoding baselines, confound controls, bootstrap confidence intervals, and seed variance -- and apply it across 5 models (1B-70B), 3 benchmarks, and 15 methods. We find that previously reported gains shrink substantially under strict controls: on the full TruthfulQA benchmark (N=817), no token-level method achieves statistically significant improvement, and the best learned adapter scores -2.0 points below greedy (p=.23). We identify five evaluation sensitivities -- contamination, judge choice, missing baselines, confounds, and statistical noise -- that individually or jointly account for these discrepancies. Cross-benchmark validation on HaluEval QA and TriviaQA confirms that these patterns extend beyond TruthfulQA. Deliberative prompting methods (chain-of-thought, self-critique) appear more robust in the evaluated regime, with CoT achieving +5.6-19pp across benchmarks as a training-free, single-pass method. We release a seven-point evaluation checklist and discuss implications for future truthfulness research.

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12236v2 Announce Type: replace Abstract: Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed--accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.

VOID: Defeating Unauthorized Mimicry in Latent Diffusion Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12263v2 Announce Type: replace Abstract: While Latent Diffusion Models (LDMs) have revolutionized visual synthesis, they are increasingly exploited for unauthorized mimicry of individuals. Existing defenses inject deceptive perturbations to steer the generated images toward irrelevant targets. However, this approach hinges on an ungrounded assumption: subtle perturbations can maintain their deceptive efficacy throughout an LDM's extensive generation process. In reality, the model's innate restoration mechanism will remove such perturbations and cause individual identities to re-emerge in the images generated. We propose VOID, a defense framework that overcomes this conundrum by manipulating an LDM's intrinsic stochasticity. VOID perturbs the diffusion pipeline in two novel ways: 1) amplifying the latent encoding errors to shatter an image's semantic structure, and 2) counteracting the target guidance signals to suppress the model's restoration capabilities. This results in a semantic corruption that thwarts any unauthorized mimicry. Notably, the security gain does not come at the price of visual utility, as VOID simultaneously manages to confine perturbations to human-imperceptible regions of protected images. Our comprehensive evaluation of 24 state-of-the-art defenses against 10 mimicry attacks on 5 datasets demonstrates VOID's unprecedented protection power: it increases the average Frechet Inception Distance (FID) from 113 to 365, a 223% improvement over the strongest defense to date.

DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images

Pengfei Wang, Shihao Wang, Liyi Chen, Zhiyuan Ma, Guowen Zhang, Lei Zhang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.12368v2 Announce Type: replace Abstract: While monocular depth estimation has achieved significant progress, achieving generalized metric depth estimation for both narrow field-of-view (FoV) perspectives and $360^\circ$ panoramas remains an unsolved challenge. Existing methods are often tailored to specific camera types and struggle to produce accurate metric depth that generalizes across diverse settings. This limitation stems from two key challenges: the inherent geometric discrepancy between perspective and panoramic cameras, and the scarcity of panoramic training data with metric annotations. In this work, we introduce DepthMaster, a unified metric depth estimation framework. Rather than employing specialized networks to learn spherical distortions, we reformulate the problem by decomposing panoramic images into overlapping perspective patches. Crucially, distinct from prior projection-based methods that rely on ad-hoc architectural modifications to handle boundaries, we introduce a novel Correspondence Consistency Loss (CCL) and inject virtual projection cameras as geometric priors, allowing us to seamlessly stitch the patches while avoiding specialized operators and keeping the backbone largely compatible with standard Transformer designs. This strategy also resolves the geometric differences by unifying all inputs into a canonical perspective representation, and effectively circumvents data scarcity by directly unlocking powerful metric priors from vast perspective datasets. Trained on a mixed dataset that contains only one panorama dataset, DepthMaster achieves state-of-the-art zero-shot performance on 13 diverse datasets, outperforming not only universal methods but also leading specialist models in both perspective and panoramic domains.

QoS Improvement in Multi User Cellular-Symbiotic Radio Network Assisted by Active-STAR-RIS

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2401.08301v2 Announce Type: replace-cross Abstract: In this article, we employ active simultaneously transmitting and reflecting reconfigurable intelligent surfaces (ASRIS) to enhance the quality of 6G cellular network services. The network integrates commensal symbiotic radio (CSR) subsystems to facilitate communication between passive Internet of Things (IoT) users and active users, referred to as symbiotic backscatter devices (SBDs) and symbiotic user equipments (SUEs), respectively. Since the SBDs are passive, transmitting information to the SUEs poses significant challenges. To overcome this challenge, we harness the capabilities of massive multiple input multiple output (MIMO) antennas within the base station (BS) to relay the information transmitted by SBDs with greater power. This scheme uses the non-orthogonal multiple access (NOMA) technique for multiple access among all users, and potential interferences are eliminated using successive interference cancellation (SIC). The primary objective is to maximize the throughput between SBDs and SUEs. To achieve this, we formulate an optimization problem involving variables such as active beamforming coefficients at the BS and ASRIS, phase adjustments of ASRIS, and scheduling parameters between CSR and cellular networks. To solve this optimization problem, we used three deep reinforcement learning (DRL) methods: proximal policy optimization (PPO), twin delayed deep deterministic policy gradient (TD3), and asynchronous advantage actor critic (A3C). These methods were simulated, and the results demonstrate that A3C, TD3, and PPO have the best convergence speeds and achieve the highest increases in network throughput, respectively. Finally, the proposed scheme was evaluated using passive simultaneously transmitting and reflecting RIS (STAR-RIS), which demonstrated poorer performance compared to ASRIS.

Plug-and-Play image restoration with Stochastic deNOising REgularization

Marien Renaud, Jean Prost, Arthur Leclaire, Nicolas Papadakis — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2402.01779v4 Announce Type: replace-cross Abstract: Plug-and-Play (PnP) algorithms are a class of iterative algorithms that address image inverse problems by combining a physical model and a deep neural network for regularization. Even if they produce impressive image restoration results, these algorithms rely on a non-standard use of a denoiser on images that are less and less noisy along the iterations, which contrasts with recent algorithms based on Diffusion Models (DM), where the denoiser is applied only on re-noised images. We propose a new PnP framework, called Stochastic deNOising REgularization (SNORE), which applies the denoiser only on images with noise of the adequate level. It is based on an explicit stochastic regularization, which leads to a stochastic gradient descent algorithm to solve ill-posed inverse problems. A convergence analysis of this algorithm and its annealing extension is provided. Experimentally, we prove that SNORE is competitive with respect to state-of-the-art methods on deblurring and inpainting tasks, both quantitatively and qualitatively.

Density of group languages in shift spaces

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2403.17892v3 Announce Type: replace-cross Abstract: The density of a rational language can be understood as the frequency of some pattern in the shift space, for example a pattern like "words with an even number of a given letter." We study the density of group languages, i.e. rational languages recognized by morphisms onto finite groups, inside shift spaces. We show that the density with respect to any given ergodic measure on a shift space exists for every group language, because it can be computed by using any ergodic lift of the given measure to a skew product between the shift space and the recognizing group. We then further study densities in shifts of finite type (with a suitable notion of irreducibility), and then in minimal shifts. In the latter case, we obtain a closed formula for the density under the condition that the aforementioned skew product has minimal closed invariant subsets which are ergodic under the product of the original measure and the uniform probability measure on the group. The formula is derived in part from a characterization of minimal closed invariant subsets for skew products between shifts and finite groups relying on notions of cocycles and coboundaries. In the case where the whole skew product is ergodic under the product measure, then the density is just the cardinality of the subset of the group which defines the language divided by the cardinality of the group. Moreover, we provide sufficient conditions for the skew product to have minimal closed invariant subsets that are ergodic under the product measure. Finally, we investigate the link between minimal closed invariant subsets, return words and bifix codes.

The DNA of Calabi-Yau Hypersurfaces

Nate MacFadden, Andreas Schachner, Elijah Sheridan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2405.08871v3 Announce Type: replace-cross Abstract: We implement Genetic Algorithms for triangulations of four-dimensional reflexive polytopes which induce Calabi-Yau threefold hypersurfaces via Batyrev's construction. We demonstrate that such algorithms efficiently optimize physical observables such as axion decay constants or axion-photon couplings in string theory compactifications. For our implementation, we choose a parameterization of triangulations that yields homotopy inequivalent Calabi-Yau threefolds by extending fine, regular triangulations of two-faces, thereby eliminating exponentially large redundancy factors in the map from polytope triangulations to Calabi-Yau hypersurfaces. In particular, we discuss how this encoding renders the entire Kreuzer-Skarke list amenable to a variety of optimization strategies, including but not limited to Genetic Algorithms. To achieve optimal performance, we tune the hyperparameters of our Genetic Algorithm using Bayesian optimization. We find that our implementation vastly outperforms other sampling and optimization strategies like Markov Chain Monte Carlo or Simulated Annealing. Finally, we showcase that our Genetic Algorithm efficiently performs optimization even for the maximal polytope with Hodge numbers $h^{1,1} = 491$, where we use it to maximize axion-photon couplings. Our methods for sampling and optimization are implemented in a Python package cyopt.

Causal Inference with Generative Artificial Intelligence: Application to Texts as Treatments

Kosuke Imai, Kentaro Nakamura — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2410.00903v5 Announce Type: replace-cross Abstract: In this paper, we demonstrate how to enhance the validity of causal inference with unstructured high-dimensional treatments like texts, by leveraging the power of generative Artificial Intelligence (GenAI). Specifically, we propose to use a deep generative model such as large language models (LLMs) to efficiently generate treatments and use their internal representation for subsequent causal effect estimation. We show that the knowledge of this true internal representation helps disentangle the treatment features of interest, such as specific sentiments and certain topics, from other possibly unknown confounding features. Unlike existing methods, the proposed GenAI-Powered Inference (GPI) methodology eliminates the need to learn causal representation from the data, and hence produces more accurate and efficient estimates. We formally establish the conditions required for the nonparametric identification of the average treatment effect, propose an estimation strategy that avoids the violation of the overlap assumption, and derive the asymptotic properties of the proposed estimator through the application of double machine learning. Finally, using an instrumental variables approach, we extend the proposed GPI methodology to the settings in which the treatment feature is based on human perception. The GPI is also applicable to text reuse where an LLM is used to regenerate existing texts. We conduct simulation and empirical studies, using the generated text data from an open-source LLM, Llama 3, to illustrate the advantages of our estimator over state-of-the-art causal representation learning algorithms.

Covering half-grids with lines and planes

Anurag Bishnoi, Shantanu Nene — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2501.11156v3 Announce Type: replace-cross Abstract: We study hyperplane covering problems for finite grid-like structures in $\mathbb{R}^d$. We call a set $\mathcal{C}$ of points in $\mathbb{R}^2$ a conical grid if the line $y = a_i$ intersects $\mathcal{C}$ in exactly $i$ points, for some $a_1 > \cdots > a_n \in \mathbb{R}$. We prove that the number of lines required to cover every point of such a grid at least $k$ times is at least $nk\left(1-\frac{1}{e}-O(\frac{1}{n}) \right)$. If the grid $\mathcal{C}$ is obtained by cutting an $m \times n$ grid of points in half along one of the diagonals, then we prove the lower bound of $mk\left(1-e^{-\frac{n}{m}}-O(\frac{n}{m^2})\right)$. In general, we call a grid obtained by cutting a grid in $\mathbb{R}^d$ along one of the diagonals a half-grid. Motivated by the Alon-F\"uredi theorem on hyperplane coverings of grids that miss a point and its multiplicity variations, we study the problem of finding the minimum number of hyperplanes required to cover every point of an $n \times \cdots \times n$ half-grid in $\mathbb{R}^d$ at least $k$ times while missing a point $P$. For almost all such half-grids, with $P$ being the corner point, we prove asymptotically sharp upper and lower bounds for the covering number in dimensions $2$ and $3$. For $k = 1$, $d = 2$, and an arbitrary $P$, we determine this number exactly by using the polynomial method bound for grids.

Central Limit Theorems for Stochastic Gradient Descent Quantile Estimators

Ziyang Wei, Jiaqi Li, Likai Chen, Wei Biao Wu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2503.02178v3 Announce Type: replace-cross Abstract: This paper develops asymptotic theory for quantile estimation via stochastic gradient descent (SGD) with a constant learning rate. The quantile loss function is neither smooth nor strongly convex. Beyond conventional perspectives and techniques, we view quantile SGD iteration as an irreducible, periodic, and positive recurrent Markov chain, which cyclically converges to its unique stationary distribution regardless of the arbitrarily fixed initialization. To derive the exact form of the stationary distribution, we analyze the structure of its characteristic function by exploiting the stationary equation. We also derive tight bounds for its moment generating function (MGF) and tail probabilities. Synthesizing the aforementioned approaches, we prove that the centered and standardized stationary distribution converges to a Gaussian distribution as the learning rate $\eta\rightarrow0$. This finding provides the first central limit theorem (CLT)-type theoretical guarantees for the quantile SGD estimator with constant learning rates. We further propose a recursive algorithm to construct confidence intervals of the estimators with statistical guarantees. Numerical studies demonstrate the effective finite-sample performance of the online estimator and inference procedure. The theoretical tools developed in this study are of independent interest for investigating general SGD algorithms formulated as Markov chains, particularly in non-strongly convex and non-smooth settings.

Sharp Detection Threshold for Correlation among Multiple Unlabeled Gaussian Networks

Taha Ameen, Bruce Hajek — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2504.16279v2 Announce Type: replace-cross Abstract: This paper studies the hypothesis testing problem of deciding whether $m \geq 2$ complete weighted graphs with Gaussian edge weights are mutually correlated after unknown relabelings of their vertices. Under the null model all edge weights are independent standard Gaussians, whereas under the planted model the graphs share a latent vertex alignment and each pair of corresponding edge weights has correlation $\rho$. For fixed $m$, we identify the sharp information-theoretic threshold for detection. Above the threshold, a generalized likelihood-ratio test achieves strong detection, whereas even weak detection is impossible below the threshold. The result extends the two-graph detection threshold of Wu, Xu, and Yu to any fixed number of graphs, exhibits a side-information regime in which two graphs alone are insufficient but multiple graphs enable detection, and, together with the recovery threshold of Vassaux and Massouli\'e, shows that this Gaussian multi-graph model has no detection--recovery gap.

Adaptive generative moment matching networks for improved learning of dependence structures

Marius Hofert, Gan Yao — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2508.21531v2 Announce Type: replace-cross Abstract: An adaptive bandwidth selection procedure for the mixture kernel in the maximum mean discrepancy (MMD) for fitting generative moment matching networks (GMMNs) is introduced, and improved learning of copula random number generators is demonstrated. Based on the relative error of the training loss, the number of kernels is increased during training; additionally, the relative error of the validation loss is used as an early stopping criterion. While training time remains similar, adaptively training GMMNs (AGMMNs) significantly increases training performance, which is shown based on validation MMD trajectories, samples and validation MMD values. Superiority of AGMMNs over GMMNs and parametric copula models is also demonstrated in terms of three applications. First, convergence rates of estimators based on quasi-random versus pseudo-random samples from copulas are investigated in dimensions as large as 100 for the first time. Second, replicated validation MMDs, as well as Monte Carlo and quasi-Monte Carlo applications demonstrate the improved training of AGMMNs for a copula model implied by the 50 constituents of the S&P 500 index after deGARCHing. Last, both the latter dataset and 50 constituents of the FTSE 100 are used to demonstrate that the improved training of AGMMNs indeed translates to an improved model prediction.

Efficient Solvers for SLOPE in R, Python, Julia, and C++

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2511.02430v3 Announce Type: replace-cross Abstract: We present a suite of packages in R, Python, Julia, and C++ that efficiently solve the Sorted L-One Penalized Estimation (SLOPE) problem. The packages feature a highly efficient hybrid coordinate descent algorithm that fits generalized linear models (GLMs) and supports a variety of loss functions, including Gaussian, binomial, Poisson, and multinomial logistic regression. Our implementation is designed to be fast, memory-efficient, and flexible. The packages support a variety of data structures (dense, sparse, and out-of-memory matrices) and are designed to efficiently fit the full SLOPE path as well as handle cross-validation of SLOPE models, including the relaxed SLOPE. We present examples of how to use the packages and benchmarks that demonstrate the performance of the packages on both real and simulated data and show that our packages outperform existing implementations of SLOPE in terms of speed.

Semitopological Barycentric Algebras

Jean Goubault-Larrecq — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.12865v5 Announce Type: replace-cross Abstract: Barycentric algebras are an abstraction of the notion of convex sets, defined by a set of equations. We study semitopological and topological barycentric algebras, in the spirit of a previous study by Klaus Keimel on semitopological and topological cones (2008), which are special cases of semitopological and topological barycentric algebras. For example, the space of all continuous valuations (a very close cousin of measures) over a topological space is a topological cone, while probability valuations form a topological barycentric algebra, and subprobability valuations form a pointed topological barycentric algebra. Among other results, we show the existence of free semitopological cones over semitopological barycentric algebras and over pointed semitopological algebras, we investigate which semitopological barycentric algebras embed into semitopological cones and which pointed semitopological barycentric algebras embed strictly into semitopological cones. We study notions of local convexity, which split into weak local convexity, local convexity, local affineness and local linearity. We show that the weakly locally convex topological barycentric algebras are exactly the affine retracts of locally affine topological barycentric algebras. On locally convex barycentric algebras, we show sandwich theorems, extending theorems by Roth and Keimel on cones. A running theme of this paper is the notion of barycenters, which we progressively generalize until we reach a general notion of barycenters of continuous (resp., subprobability, probability) valuations, inspired by a definition of Choquet. We conclude with a general barycenter existence theorem, whose proof relies on the study of the Smyth poweralgebra, namely the topological barycentric algebra of all non-empty convex compact saturated subsets of a topological barycentric algebra.

PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation

Xiao-Qi Han, Ze-Feng Gao, Wen-Kao Li, Peng-Jie Guo, Zhong-Yi Lu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.21227v3 Announce Type: replace-cross Abstract: In recent years, generative artificial intelligence has made significant advances in the design of crystalline materials, giving rise to approaches based on graph neural networks, diffusion models, and large language models. Existing evaluations commonly follow the stability-uniqueness-novelty (S.U.N.) framework, where stability is primarily assessed using thermodynamic criteria, which do not fully capture the dynamical stability essential for a material's practical existence. Dynamical stability is a key determinant of whether a material can be synthesized and persist, with phonon spectrum calculations serving as the standard for its evaluation. However, the high computational cost of such calculations has prevented large-scale assessment of dynamical stability in generated crystals. In this work, we introduce PhononBench, the first large-scale benchmark for dynamical stability in AI-generated crystals. Leveraging the recently developed MatterSim interatomic potential, which achieves density-functional-theory (DFT)-level accuracy in phonon predictions across more than 10,000 materials, PhononBench enables efficient phonon calculations and dynamical-stability analysis for 133,838 crystal structures generated by 7 leading crystal generation models. PhononBench reveals a widespread limitation of current generative models: unless otherwise specified, all reported dynamical-stability metrics are evaluated at a phonon-frequency threshold of -0.1 THz, with the average dynamical-stability rate across all generated structures being only 32.15%, and the top-performing model, MatterGen, reaching just 45.05%.In addition, we identify 32,995 crystal structures that are phonon-stable across the entire Brillouin zone under a strict threshold of -0.001 THz. In addition, a web-based service is accessible at http://phononbench.cn/, enabling minute-level ultra-fast phonon predictions.

From geometry to dynamics: Learning overdamped Langevin dynamics from sparse observations with geometric constraints

Dimitra Maoutsa — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2512.23566v2 Announce Type: replace-cross Abstract: How can we learn the laws underlying the dynamics of stochastic systems when their trajectories are sampled sparsely in time? Existing methods either require temporally resolved high-frequency observations, or rely on geometric arguments that apply only to conservative systems, limiting the range of dynamics they can recover. Here, we present a new framework that reconciles these two perspectives by reformulating inference as a stochastic control problem. Our method uses geometry-driven path augmentation, guided by the geometry in the system's invariant density to reconstruct likely trajectories and infer the underlying dynamics without assuming specific parametric models. Applied to overdamped Langevin systems, our approach accurately recovers stochastic dynamics even from extremely undersampled data, outperforming existing methods in synthetic benchmarks. This work demonstrates the effectiveness of incorporating geometric inductive biases into stochastic system identification methods.

The Replicator-Optimization Mechanism: A Scale-Relative Formalism for Persistence-Conditioned Dynamics with Application to Consent-Based Metaethics

Murad Farzulla — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.06363v2 Announce Type: replace-cross Abstract: This paper formalizes a widely used dynamical class--replicator-mutator dynamics and Price-style selection-and-transmission--and makes explicit the modeling choices (scale, atomic unit, interaction topology, transmission kernel) that determine how this class instantiates across domains. The backbone is known; we do not claim to have discovered selection. The novel contributions are threefold: (i) a scale-relative kernel parameterization where atomic units are themselves parameters, enabling systematic instantiation across physics, biology, economics, cognition, and social organization; (ii) a consent-friction instantiation for political philosophy, where friction is the primitive, legitimacy functions as survival probability, and belief-transfer functions as mutation kernel; and (iii) a derivation path from social contract theory rather than from biology or physics, arriving at the same formal structure via an independent route. We provide a bridge principle connecting descriptive dynamics to instrumental normativity: if agents prefer lower expected friction, then "ought" claims are shorthand for policies that reduce expected friction under the specified dynamics. This conditional structure avoids the is-ought fallacy while grounding normative discourse in empirically tractable dynamics. We address pathological cases (authoritarian stability, suppressed friction) through explicit modeling of latent versus observed friction. The framework generates testable predictions through operationalization of friction, legitimacy, and belief-transfer dynamics, and is falsifiable at the level of measurement apparatus rather than formal structure.

The table maker's quantum search

Benjamin C. A. Morrison, Stefanos Kourtis — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.13306v2 Announce Type: replace-cross Abstract: We show that quantum search can be used to compute the hardness to round an elementary function, that is, to determine the minimum working precision required to compute the values of an elementary function correctly rounded to a target precision of $n$ digits for all possible precision-$n$ floating-point inputs in a given interval. For elementary functions $f$ related to the exponential function, quantum search takes time $\tilde O(2^{n/2} \log (1/\delta))$ to return, with probability $1-\delta$, the hardness to round $f$ over all $n$-bit floating-point inputs in a given binade. For periodic elementary functions in large binades, standalone quantum search yields an asymptotic speedup over the best known classical algorithms and heuristics. We then estimate the resources required for a fault-tolerant implementation of the proposed algorithm for the $\sin$ and $\cos$ functions in double precision. We find that, although the algorithm can in principle compete with the fastest known practical method for computing the hardness to round over all binades in the format, it requires qubit coherence times that are unrealistically long for present technology.

Bulk-Calibrated Credal Ambiguity Sets: Fast, Tractable Decision Making under Out-of-Sample Contamination

Mengqi Chen, Thomas B. Berrett, Theodoros Damoulas, Michele Caprio — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.21324v2 Announce Type: replace-cross Abstract: Distributionally robust optimisation (DRO) minimises the worst-case expected loss over an ambiguity set that can capture distributional shifts in out-of-sample environments. While Huber (linear-vacuous) contamination is a classical minimal-assumption model for an $\varepsilon$-fraction of arbitrary perturbations, including it in an ambiguity set can make the worst-case risk infinite and the DRO objective vacuous unless one imposes strong boundedness or support assumptions. We address these challenges by introducing bulk-calibrated credal ambiguity sets: we learn a high-mass bulk set from data while considering contamination inside the bulk and bounding the remaining tail contribution separately. This leads to a closed-form, finite $\mathrm{mean}+\sup$ robust objective and tractable linear or second-order cone programs for common losses and bulk geometries. Through this framework, we highlight and exploit the equivalence between the imprecise probability (IP) notion of upper expectation and the worst-case risk, demonstrating how IP credal sets translate into DRO objectives with interpretable tolerance levels. Experiments on heavy-tailed inventory control, geographically shifted house-price regression, and demographically shifted text classification show competitive robustness-accuracy trade-offs and efficient optimisation times, using Bayesian, frequentist, or empirical reference distributions.

Efficient Stochastic Optimisation via Sequential Monte Carlo

James Cuin, Davide Carbone, Yanbo Tang, O. Deniz Akyildiz — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2601.22003v2 Announce Type: replace-cross Abstract: The problem of optimising functions with intractable gradients frequently arises in machine learning and statistics, ranging from maximum marginal likelihood estimation procedures to fine-tuning of generative models. Stochastic approximation methods for this class of problems typically require inner sampling loops to obtain (biased) stochastic gradient estimates, which rapidly becomes computationally expensive. In this work, we develop sequential Monte Carlo (SMC) samplers for optimisation of functions with intractable gradients. Our approach replaces expensive inner sampling methods with efficient SMC approximations, which can result in significant computational gains. We establish convergence results for the basic recursions defined by our methodology which SMC samplers approximate. We demonstrate the effectiveness of our approach on the reward-tuning of energy-based models within various settings.

Thermodynamic assessment of machine learning models for solid-state synthesis prediction

Jane Schlesinger, Simon Hjaltason, Nathan J. Szymanski, Christopher J. Bartel — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.04075v2 Announce Type: replace-cross Abstract: Machine learning models have recently emerged to predict whether hypothetical solid-state materials can be synthesized. These models aim to circumvent direct first-principles modeling of solid-state phase transformations, instead learning from large databases of successfully synthesized materials. Here, we assess the alignment of several recently introduced synthesis prediction models with material and reaction thermodynamics, quantified by the energy with respect to the convex hull and a metric accounting for thermodynamic selectivity of enumerated synthesis reactions. A dataset of successful synthesis recipes was used to determine the likely bounds on both quantities beyond which materials can be deemed unlikely to be synthesized. With these bounds as context, thermodynamic quantities were computed using the CHGNet foundation potential for thousands of new hypothetical materials generated using the Chemeleon generative model. Four recently published machine learning models for synthesizability prediction were applied to this same dataset, and the resultant predictions were considered against computed thermodynamics. We find these models generally overpredict the likelihood of synthesis, but some model scores do trend with thermodynamic heuristics, assigning lower scores to materials that are less stable or do not have an available synthesis recipe that is calculated to be thermodynamically selective. In total, this work identifies existing gaps in machine learning models for materials synthesis and introduces a new approach to assess their quality in the absence of extensive negative examples (failed syntheses).

TokaMark: A Comprehensive Benchmark for MAST Tokamak Plasma Models

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2602.10132v3 Announce Type: replace-cross Abstract: Development and operation of commercially viable fusion energy reactors such as tokamaks require accurate predictions of plasma dynamics from sparse, noisy, and incomplete sensors readings. The complexity of the underlying physics and the heterogeneity of experimental data pose formidable challenges for conventional numerical methods, and highlight the promise of modern data-native approaches. A major obstacle in realizing this potential is, however, the lack of curated, openly available datasets and standardized benchmarks. Existing fusion datasets are scarce, fragmented across institutions, facility-specific, and inconsistently annotated, which limits reproducibility and prevents a fair and scalable comparison of AI approaches. In this paper, we introduce TokaMark, a structured benchmark to evaluate AI models on real experimental data collected from the Mega Ampere Spherical Tokamak (MAST). TokaMark provides a comprehensive suite of tools designed to unify access to multi-modal fusion data and standardize evaluation protocols. The benchmark includes a curated list of 14 tasks spanning a range of physical mechanisms, exploiting a variety of diagnostics and covering multiple operational use cases. A baseline model is provided to facilitate transparent comparison and validation within a unified framework. By establishing a unified benchmark, TokaMark aims to accelerate progress in data-driven AI-based plasma modeling, contributing to the broader goal of achieving sustainable and stable fusion energy. The dataset, benchmark, documentation, and tooling are open-sourced under https://github.com/UKAEA-IBM-STFC-Fusion-FMs/tokamark_baseline.

Contextual Invertible World Models: A Neuro-Symbolic Agentic Framework for Colorectal Cancer Drug Response

Christopher Baker, Tianyu Ren, Karen Rafferty, Hui Wang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.02274v3 Announce Type: replace-cross Abstract: Precision oncology is currently limited by the small-N, large-P paradox, where high-dimensional genomic data is abundant but pharmacological response samples are sparse. While deep learning achieves predictive accuracy, it frequently fails to provide the mechanistic clarity required for clinical adoption. We present the Contextual Invertible World Model (CIWM), a Neuro-Symbolic Agentic Framework that bridges this gap by integrating a quantitative machine learning emulator with a Large Language Model reasoning layer. Utilising a stringently curated, high-fidelity data engineering pipeline on the Sanger GDSC dataset ($ N=83 $), we isolate true biological signals from in vitro artifacts to establish a rigorous baseline predictive correlation for complex transcriptomics ($ r=0.268 $). Through Inverse Reasoning, we perform in silico CRISPR perturbations across the colorectal landscape. The framework autonomously overturns classical mechanistic assumptions, identifying a hierarchical dominance of mutant KRAS over the APC/Wnt-axis in driving 5-fluorouracil resistance ($ \Delta=-0.0469 $) via a "KRAS Shield" mapped to MAPK/PI3K networks. Furthermore, the agentic layer identified a "PIK3CA Paradox", revealing that repairing PIK3CA inadvertently increases chemoresistance ($ \Delta=+0.0085 $) by triggering a compensatory feedback loop that hyperactivates the dominant MAPK survival pathway.

Stability properties of Minimal Gated Unit neural networks

Stefano De Carli, Davide Previtali, Mirko Mazzoleni, Fabio Previdi — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.03017v2 Announce Type: replace-cross Abstract: In this work, we address the need for efficient and formally stable Recurrent Neural Networks (RNNs) in environments with limited computational resources by analyzing the stability of the Minimal Gated Unit (MGU) network, a lightweight alternative to common gated RNNs used in system identification. We derive sufficient parametric conditions for the MGU network's input-to-state stability and incremental input-to-state stability properties. These conditions enable a-posteriori validation of model stability and form the basis for novel stability-promoting training methodologies, including a warm-start of the network's parameters and a projected gradient-based optimization scheme, both of which are presented in this work. Comparative evaluation, including robustness analysis and validation on synthetic and real-world data (i.e., the Silverbox benchmark), demonstrates that the minimal gated unit network successfully combines formal stability guarantees with superior parameter efficiency and faster inference times compared to other state-of-the-art recurrent neural networks, while maintaining comparable and satisfactory accuracy. Notably, the results attained on the Silverbox benchmark illustrate that the stable MGU network effectively captures the system dynamics, whereas other stable RNNs fail to converge to a reliable model.

A Unified Latent Space Disentanglement VAE Framework with Robust Disentanglement Effectiveness Evaluation

Xiaoan Lang, Md Mostafizer Rahman, Fang Liu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.11242v2 Announce Type: replace-cross Abstract: Evaluating and interpreting latent representations, such as variational autoencoders (VAEs), remains a significant challenge for diverse data types, especially when ground-truth generative factors are unknown. To address this, we unify several state-of-the-art disentangled VAE approaches for latent space disentanglement into one framework -- bfVAE. To assess the effectiveness of a disentangled VAE model and enhance latent space interpretability, we propose Feature Variance Heterogeneity via Latent Traversal (FVH-LT) and Dirty Block Sparse Regression in Latent Space (DBSR-LS). To ensure robust interpretability of learned latent space, we develop a greedy alignment strategy (GAS) that mitigates label switching and aligns latent dimensions across runs to set the foundation of result aggregation. We also introduce a convenient scalar latent space separation index (LSSI) based on the GAS-aligned outputs of FVH-LT and DBSR-LS to summarize the overall latent structural separation without knowledge of the ground-truth generative factors. We compare bfVAE to five VAE models and validate the effectiveness FVH-LT, DBSR-LS, and LSSI in on seven tabular and image datasets. Under our examined experimental settings, bfVAE provides a more flexible disentanglement framework achieves more favorable overall trade-off between disentanglement and reconstruction than the benchmark VAE models; FVH-LT and DBSR-LS reliably uncover semantically meaningful and domain-relevant latent structures and generally yield consistent results; and LSSI makes an effective quantitative summary of latent structural separation.

Mirror Descent on Riemannian Manifolds

Jiaxin Jiang, Lei Shi, Jiyuan Tan — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.17527v2 Announce Type: replace-cross Abstract: Mirror Descent (MD) is a scalable first-order method widely used in large-scale optimization, with applications in image processing, policy optimization, and neural network training. This paper generalizes MD to optimization on Riemannian manifolds. In particular, we develop a Riemannian Mirror Descent (RMD) framework via reparameterization and further propose a stochastic variant of RMD. We also establish non-asymptotic convergence guarantees for both RMD and stochastic RMD. As an application to the Stiefel manifold, our RMD framework reduces to the Curvilinear Gradient Descent (CGD) method proposed in [26]. Moreover, when specializing the stochastic RMD framework to the Stiefel setting, we obtain a stochastic extension of CGD, which effectively addresses large-scale manifold optimization problems.

Fusion Learning from Dynamic Functional Connectivity: Combining the Amplitude and Phase of fMRI Signals to Identify Brain Disorders

Jinlong Hu, Jiatong Huang, Zijian Cai — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2603.24603v2 Announce Type: replace-cross Abstract: Dynamic functional connectivity (dFC) derived from resting-state functional magnetic resonance imaging (fMRI) has been extensively utilized in brain science research. The sliding window correlation (SWC) method is a widely used approach for constructing dFC by computing correlation coefficients between amplitude time series of signals from pairs of brain regions. In this study, we propose an integrated approach that incorporates both amplitude and phase information of fMRI signals to improve the detection of brain disorders. Specifically, we introduce a multi-scale fusion learning framework, namely MSFL, which leverages two complementary dFC features derived from SWC and phase synchronization (PS). Here, SWC captures amplitude correlations, while PS measures phase coherence within dFC. We evaluated the efficacy of MSFL in classifying autism spectrum disorder and major depressive disorder using two publicly available datasets: ABIDE I and REST-meta-MDD, respectively. The results indicate that MSFL significantly outperforms existing comparative models. Moreover, we performed model explanation analysis using the SHAP framework, which showed that both types of dFC features from SWC and PS contribute to detecting brain disorders.

An Algebraic Introduction to Persistence

Ulrich Bauer, Thomas Br\"ustle, Luis Scoccola — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2604.07022v2 Announce Type: replace-cross Abstract: We introduce persistence with an emphasis on its algebraic foundations, using the representation theory of posets. Linear representations of posets arise in several areas of mathematics, including the representation theory of quivers and finite dimensional algebras, Morse theory and other areas of geometry, as well as topological inference and topological data analysis -- often via persistent homology. In some of these contexts, the category of poset representations of interest admits a metric structure given by the so-called interleaving distance. Persistence studies the algebraic properties of these poset representations and their behavior under perturbations in the interleaving distance. We survey fundamental results in the area, applications to pure and applied mathematics, advanced topics such as multiparameter persistence, as well as theoretical challenges and open questions.

Earth Science Foundation Models: From Perception to Reasoning and Discovery

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.12542v2 Announce Type: replace-cross Abstract: Large foundation models (FMs) are transforming Earth science by integrating heterogeneous multimodal data, such as multi-platform imagery, gridded reanalysis data, diverse geophysical and geochemical observations, and domain-specific text, to support tasks ranging from basic perception to advanced scientific discovery. This paper provides a unified review of Earth science foundation models (Earth FMs) through two complementary dimensions: depth, which traces the evolution of model capabilities from perception to multimodal reasoning and agentic scientific workflows, and breadth, which summarizes their expanding applications across the atmosphere, hydrosphere, lithosphere, biosphere, anthroposphere, and cryosphere, as well as coupled Earth system processes. Using this framework, we review representative multimodal Earth foundation models and compile more than 200 datasets and benchmarks spanning diverse Earth science tasks and modalities. We further discuss key challenges in multimodal data heterogeneity, scientific reliability and continual updating, scalability and sustainability, and the transition from foundation models to agentic and embodied Earth intelligence, and outline future directions toward more integrated, trustworthy, and actionable AI Earth scientists. Overall, this paper offers a structured roadmap for understanding the development of Earth foundation models from both capability depth and application breadth.

Sticky CIR process with potential: invariant measure and exact sampling

Tony Shardlow — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.13648v4 Announce Type: replace-cross Abstract: We study the sticky Cox--Ingersoll--Ross (CIR) process in one dimension, a diffusion on $[0,\infty)$ with a sticky boundary condition at the origin, arising as the marginal process in a sparse Bayesian inference framework based on Hadamard--Langevin dynamics. For the parameter range $\delta\in(1,2)$, in which the origin is accessible but not absorbing, we prove well-posedness of the process and uniqueness of its invariant measure, which is a mixture of a point mass at zero and a weighted gamma-type density on the interior. We derive an explicit Green's function for the resolvent in terms of confluent hypergeometric functions, and use this to construct an exact sampler for the invariant measure in the zero-potential case. For a non-trivial potential $G$, we establish existence and uniqueness of the tilted invariant measure via a Girsanov change of measure, and develop two sampling algorithms: a Metropolis--Hastings corrected sampler that targets the invariant measure exactly, and a cheaper, biased unadjusted Langevin algorithm (ULA) for a boundary-clamped variant of which we prove a first-order expansion of the stationary bias with an explicit constant: the leading error is a rank-one transfer of mass $K_\star h|\log h| $ onto the atom, so the total-variation bias is of exact order $h|\log h | $ -- independent of $\delta$ -- whenever the potential has nonzero boundary drift. Numerical experiments confirm the predicted behaviour: the Metropolis--Hastings sampler achieves the target invariant measure at all step sizes, while the ULA bias follows the proven first-order law, including its constant.

Measuring Control-Plane Openness in Near-Term Quantum Computing: A Rubric, Its Validation, and an Application to Thirteen Vendor Stacks

Rylan Malarchick — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.15233v2 Announce Type: replace-cross Abstract: Public access to pulse-level and control-electronics interfaces in commercial quantum computing has bifurcated. This paper proposes a six-axis rubric for measuring control-plane openness, the layer between gate-level circuit specification and physical control electronics, defined operationally so that the same evidence produces the same grade across vendors. The rubric is validated three ways: a blinded re-grading pass, thirty-nine days after the evidence cutoff, that tests whether the cited evidence and the level definitions alone reproduce the recorded grades; a boundary-case methodology that fixes where each level begins and ends; and a published grading protocol that lets others reproduce and contest any cell. We establish that the rubric measures change rather than describing a snapshot by comparing the catalog against the documented control plane before the February 2025 removal of pulse-level access from IBM hardware, and reporting the cells that moved. The rubric is applied to thirteen commercial vendors across superconducting, trapped-ion, neutral-atom, and photonic modalities as of May 1, 2026, as its first application, and one of the three harms the rubric is designed to detect is demonstrated through a reproduction-access audit of five pre-2025 IBM Qiskit Pulse experiments against the access available on current hardware, carried through to a client-side structural port of the audit's selected target to Rigetti Quil-T. The catalog ships as a separate machine-readable artifact under CC-BY-4.0 with per-cell source URLs (https://doi.org/10.5281/zenodo.20163276). The catalog readings will change as vendor policies shift; the rubric is the contribution that survives them.

Deep Learning-based Algebraic Reynolds Stress Closures for RANS Simulations of Turbulent Flows

Daniel Dehtyriov, Jonathan F. MacArt, Justin Sirignano — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.26358v2 Announce Type: replace-cross Abstract: Turbulence is ubiquitous in engineering and science, yet direct simulation is prohibitively expensive. The Reynolds-averaged Navier-Stokes (RANS) equations provide savings exceeding ten orders of magnitude but introduce unclosed terms (the closure problem). Offline-trained machine-learning (ML) closures suffer distribution shift in predictive simulations, while ML methods that bypass the governing equations struggle to generalise from scarce high-fidelity data. We develop a physics-derived deep learning closure model for RANS, the Deep Algebraic Reynolds Stress Model (DARSM), which can be trained on small datasets and accurately generalise across Reynolds numbers, to unseen geometries, and to different flow regimes. A neural network maps flow invariants to empirical parameters in an implicit algebraic Reynolds stress equation, derived from the Reynolds stress transport equations under the weak-equilibrium assumption, imposing physics-based structure on the ML closure. End-to-end optimisation through the governing PDEs and the coupled implicit closure eliminates distribution shift, but both unrolled and implicit automatic differentiation fail on the stiff coupled solver. We derive adjoint equations that exploit the solver's implicit-explicit structure for efficient optimisation. On canonical square-duct and periodic-hill benchmarks, DARSM reduces average test velocity error over baseline RANS by $2$-$4\times$ across Reynolds number, geometries, and flow regimes, with peak case-level reductions of $12\times$. The model trained on attached, anisotropy-dominated flows (square duct) accurately generalises without retraining to separated flows (periodic hills), a regime change in the underlying physics. DARSM also outperforms five established ML methods: offline training, tensor-basis neural networks, field-inversion machine learning, DeepONets, and physics-informed neural networks.

Diagnosing the conditional-mean barrier in scientific machine-learning surrogates

Junfeng Chen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.28076v2 Announce Type: replace-cross Abstract: Many problems in computational science and engineering become one-to-many after coarse graining, partial observation, or inverse reconstruction: a resolved state may not determine a unique subgrid forcing, a structural descriptor may not determine a unique effective response, and a low-resolution observation may correspond to many plausible high-resolution fields. In such settings, deterministic surrogates may learn a well-defined mathematical object while still missing application-relevant uncertainty. This tutorial develops a self-contained module centered on the conditional-mean barrier: the point at which a squared-loss predictor has reached the conditional mean and the remaining error is irreducible aleatoric variance. We give two diagnostics for locating this barrier, residual-feature orthogonality and the coefficient of determination against its explained-variance ceiling, and prove that adding latent randomness to a squared-loss predictor collapses it back to the conditional mean. Crossing the barrier therefore requires a loss that scores distributions rather than point predictions. We briefly organize common distributional objectives, including negative log-likelihood, moment and observable matching, variational objectives, adversarial divergences, and score matching, by the feature of the conditional law each targets. The emphasis is the boundary itself and a finite-data procedure for recognizing it, rather than a survey of methods beyond it. CPU-based demonstrations on a two-branch law and a two-scale Lorenz-96 closure problem show how the diagnostics distinguish deterministic underfitting from residual distributional variability.

Real-rootedness of the Poincar\'e polynomials of $\overline{\mathcal M}_{0,n}$: an AI-assisted proof

Gergely B\'erczi, Young-Hoon Kiem — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2605.29151v2 Announce Type: replace-cross Abstract: We prove real-rootedness for the Poincar\'e polynomial \[ P_n(t)=\sum_{i=0}^{n-3} \dim H^{2i}(\overline{\mathcal M}_{0,n};\mathbb{Q})t^i \] of the Deligne--Mumford moduli space $\overline{\mathcal M}_{0,n}$ of stable $n$-pointed rational curves, proving a conjecture of Aluffi--Chen--Marcolli. The proof starts from the Keel--Manin--Getzler recurrence, but its main new idea is a bivariate deformation $F_m(y,t)$ of the Poincar\'e polynomial. This deformation reveals a hidden interlacing structure not visible in the one-variable recurrence. For fixed $t<0$, the zero set of $F_m$ in the $y$-direction is controlled by a Sturm--Rolle argument on the interval $0

One Transit Is All You Need: Detecting Exoplanets Through Learned Stellar Behaviour with EXOVEIL

Pratik Priyanshu — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.02778v3 Announce Type: replace-cross Abstract: I present EXOVEIL, a transit detection system that learns what a star's brightness should look like and flags when reality disagrees. Unlike existing systems that require phase-folded input, EXOVEIL operates on raw flux time series and can detect planets that transit only once.A Transformer world model, trained on 16,499 Kepler light curves with transit-masked self-supervised learning, predicts expected stellar flux. A matched-filter detector with variance weighting extracts transit signals from the prediction residuals. A learned classifier (XGBoost) separates planets from false positives, achieving AUC 0.938 on Kepler DR25. Applied to single-transit injection-recovery, EXOVEIL recovers 32% of transits at 1000 ppm depth a task where all classification-based systems score 0% by construction. A blind search of 3,737 Kepler stars yields 179 new transit-like signals not present in the DR25 TCE catalogue, including 46 monotransit candidates. Applied withoutretraining to 47 confirmed TESS planets in the PLATO LOPS2 field, EXOVEIL achieves 100% recovery, demonstrating zero-shot cross-mission transfer. At PLATO's 25-second cadence, detection reaches 100 ppm -- approaching the Earth-analog regime. I provide the first application of conformal prediction to transit detection (95.9% empirical coverage) and release the system as pip install exoveil with pretrained weights and a candidate catalogue.

Counterfactual Explanations for Deep Two-Sample Testing

Wei-Cheng Lai, Marco Simnacher, Christoph Lippert — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.04009v2 Announce Type: replace-cross Abstract: Two-sample testing is a fundamental tool for detecting distributional differences across scientific domains, but classical tests (including kernel-based tests) can be ineffective on high-dimensional structured data such as images. Recent deep two-sample tests improve sensitivity in these settings by learning informative representations, yet they provide limited insight into which data features drive rejection of the null hypothesis $H_0$. To address this issue, we propose a counterfactual explanation framework for deep two-sample testing that generates sample-level edits moving observations from a source group toward a target group while explicitly reducing the discrepancy measured by the test. Our method combines a diffusion autoencoder with a pretrained deep two-sample test model and optimizes a maximum mean discrepancy (MMD) objective in the test model's representation space to produce plausible counterfactuals. We quantify distribution-level effects through changes in the test statistic and the resulting two-sample p-values. We evaluate the method on synthetic 2D shape datasets and two MRI cohorts. Across both settings, the counterfactual transformations consistently increase p-values relative to the original samples, indicating that the edited source set becomes statistically closer to the target distribution under the test. We measure minimality using LPIPS to ensure the counterfactuals remain close to the original samples. The resulting edits provide interpretable evidence of the features associated with the detected group differences. On MRI, the localized changes are consistent with known anatomical differences between cohorts.

Palindrome complexity versus factor complexity

Jeffrey Shallit — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.08127v2 Announce Type: replace-cross Abstract: Let ${\bf x} = (a_i)_{i \geq 0}$ be an infinite word over a finite alphabet $\Sigma$. Let $\rho (n)$ be the factor complexity function for $\bf x$ and ${\rm Pal}(n)$ be the palindrome complexity function for $\bf x$. We give a new relationship between these two quantities; namely, if $\bf x$ is not ultimately periodic, then $$ \lim_{n \rightarrow \infty} {{ {\rm Pal} (n) \log ({\rm Pal} (n) + 1)} \over {\rho (n)}} = 0. $$ Furthermore, we prove that the numerator in this result is essentially optimal.

LLM can Read Spectrogram: Encoder-free Speech-Language Modeling

Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.10231v2 Announce Type: replace-cross Abstract: Recent speech-aware large language models (Speech-LLMs) rely on a pre-trained speech encoder to convert audio into semantic-rich representations consumable by LLM. In this work, instead, we explore: can an LLM learn to read Mel spectrogram directly without a dedicated speech encoder? We propose Mel-LLM, an encoder-free Speech-LLM that feeds lightly pre-processed Mel spectrogram patches directly into the LLM through a linear projection, allowing the LLM to learn speech-text alignment purely through its own parameters. We conduct extensive experiments on both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. For ASR, we evaluate on the OpenASR leaderboard public sets and production-level scaling experiments, demonstrating that the encoder-free solution achieves competitive performance with only limited degradation compared to encoder-initialized counterparts. We find that when data is limited, initialization from a multimodal checkpoint (Phi-4-MM) is crucial for maintaining performance. We also present ablation studies revealing which LLM layers are less relevant to speech encoding. For TTS, we show preliminary results with a next-token VAE approach. While TTS performance is not yet optimal, these results establish the feasibility of a fully unified encoder-free architecture for autoregressive speech-text modeling.

Fundamentals of NOMA in Low-Earth Orbit Coordinated Multi-Satellite Networks

Xiangyu Li, Bodong Shang, Junchao Ma, Qingqing Wu, Jie Feng, Deshuang Huang — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.10301v2 Announce Type: replace-cross Abstract: Coordinated multi-satellite (CoMS) transmission and non-orthogonal multiple access (NOMA) are envisioned to jointly enhance coverage, capacity, and spectrum efficiency for satellite networks. Their integration into a unified CoMS-NOMA framework will allow more efficient, reliable, and energy-efficient multi-user access. This paper investigates the downlink performance of CoMS-NOMA networks from a system-level perspective, in which multiple satellites cooperatively serve multiple users via NOMA. Leveraging tools from stochastic geometry, related angles and distances in CoMS-NOMA are first derived as intermediate results. Then, we obtain the combined signal power distributions and analyze coverage and spectrum performance under both inter- and intra-satellite interference, accounting for potential imperfect successive interference cancellation (SIC). The analytical model is validated across a range of system parameters, including the number of satellites, service region angle, error-propagation factor, and power allocation coefficients. Numerical results indicate that increasing the number of cooperative satellites does not always improve coverage and spectrum efficiency. Additionally, while a higher main-lobe gain improves coverage, a near-perfect SIC provides only slightly greater benefits than a reasonably good SIC. With properly selected power allocation coefficients, CoMS-NOMA achieves up to a 270% improvement in coverage and a 56% gain in sum spectral efficiency, compared with conventional orthogonal and single-satellite schemes, indicating potential for green, energy-efficient satellite networking.

Fixed-Threshold One-Bit Toeplitz Covariance Estimation under Sparse-Ruler Sampling

Zhiyong Cheng, Shengyao Chen — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11110v2 Announce Type: replace-cross Abstract: We study Toeplitz covariance estimation when fixed-threshold one-bit quantization is combined with deterministic sparse-ruler sampling, so that each observed bit is reused across many lag products. At a nonzero threshold the signs have nonzero mean, and this reuse gives raw sign products a coherent one-vertex variance component governed by weighted row sums; centering removes it and leaves a degenerate sparse-pair statistic. We prove a Gaussian variance contraction theorem for hollow quadratic forms of bounded coordinate transforms, including hard threshold signs: the variance is bounded by the squared correlation operator norm times the squared Frobenius norm of the edge weights, with constants independent of dimension, support size and maximum degree. For the oracle centered sparse-ruler estimator, the leading operator-norm term is $\gamma_0L_1\kappa_{\rm obs}\sqrt{\varphi(\Omega)\log d/n}$, where $\varphi(\Omega)=\sum_{s=1}^{d-1}q_s^{-1}$ is the coverage coefficient of the ruler; pooled marginal calibration from the $n|\Omega|$ observed bits adds a plug-in term. A spectral-packing lower bound in a known-scale identity-neighborhood submodel shows that this dependence is intrinsic under balanced coverage geometry; in the non-saturated regime where the coverage term dominates, the oracle estimator is minimax rate optimal over this submodel.

Artificial Intelligence in Ship Finance: Applications, Opportunities, and a Case Study in AI-Augmented Loan Origination

Lasse Dierich, Orestis Schinas — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11238v2 Announce Type: replace-cross Abstract: Ship finance is a data-intensive and document-heavy segment of asset-based lending, requiring the integration of financial, technical, contractual, and regulatory information from heterogeneous and largely unstructured sources. Increasing environmental regulation and ESG reporting requirements are adding further complexity to underwriting and loan-origination processes. Recent advances in artificial intelligence (AI), particularly large language models (LLMs), create new opportunities for processing and analysing such information. This paper reviews potential applications of AI in ship finance, with a particular focus on LLM-based systems for document comprehension, information extraction, and workflow automation. We present ShipFinance.ai, a modular agentic architecture to support loan application workflows in ship finance. The proposed system combines an LLM-based extraction module, financial analysis components, external maritime data services, and a controlled document-generation module with a chatbot interface to support the preparation of standardized financing applications. The paper discusses the key challenges for using such models in production. We argue that AI-assisted systems can support maritime finance professionals in managing increasingly complex information and reporting requirements.

Physically Constrained Ensemble Gaussian Process Modelling for Expensive Quantum Systems with Heteroskedastic Noise

Arpan Biswas, Sutirtha Paul, Joseph Agada, Matthias Thamm, Adrian Del Maestro — Fri, 12 Jun 2026 00:00:00 -0400

arXiv:2606.11240v2 Announce Type: replace-cross Abstract: Accurate modeling of quantum many-body systems often requires computationally expensive simulations such as Density Matrix Renormalization Group (DMRG) or Quantum Monte Carlo (QMC) calculations. These methods, while precise, impose significant time and resource constraints, limiting their use in exhaustive parameter exploration. Moreover, these expensive simulations can contain variable errors over the large unknown parameter space, which needs to be quantified and propagated. Thus, predictive modelling is required to estimate the functional space accurately over scarcely sampled data with heteroskedastic noise, while preserving the physical relevance of the estimation. Therefore, we present a Physically Constrained Ensemble Gaussian Process (pc-EGP) framework designed to efficiently model complex and noisy quantum systems under physical consistency constraints. The proposed method first enforces physical constraints as a user controlled weighted penalty to the data-driven loss function of the Gaussian Process (GP) surrogates. Then an ensemble of such GP models is trained with variable noisy simulations via numerical quadrature method where these multiple GP(s) at different nodes is integrated as a quadrature weighted average. We first demonstrate the framework on synthetically generated data before applying to quantum systems. In the first case study, we leverage DMRG simulations of the Bose-Hubbard Model to predict the critical interaction parameter Uc governing the superfluid-to-Mott-insulator transition. In the second case study, we demonstrate our method on QMC simulations, of a quantum liquid confined inside a nanoporous silicate with the goal of optimizing a chemical environment to realize a one-dimensional superfluid. Compared to conventional GP, pc-EGP achieves a better balance of accuracy and physically meaningful predictions.