Research Paper

Business Learning Models

Jasminder Singh Gulati

jsg@deepnative.ai

Native AI Labs Inc.

Prashanth Jaganathan

prashanth@deepnative.ai

Native AI Labs Inc.

Manas Taneja

manas@deepnative.ai

Native AI Labs Inc.

Native

research@deepnative.ai

Native AI Labs Inc.

Names in alphabetical order only; no relevance to amount of contribution. All authors contributed equally.

~12 min read

Abstract

We propose Business Learning Models (BLMs), a framework for compact, domain-specialized language models designed to function as AI co-founders for enterprise operations. Unlike generic large language models that prioritize breadth, BLMs deliberately specialize on single-organization communication patterns through a two-stage architecture: a 270M parameter tool selection model deployed on edge devices for low-latency action routing, and a response generation model that learns to reason about business context. Both components are trained using verifiable reward signals rather than learned reward models, with the tool selector employing RL-enhanced supervised fine-tuning on execution outcomes and the response generator using Reasoning with Next-Token Prediction (RNTP) to discover effective reasoning strategies. This work addresses a critical gap by prioritizing deep expertise in specific business verticals and small-to-medium enterprises over broad general knowledge, while maintaining data sovereignty.

1 Introduction

Large language models have demonstrated potential for diverse applications, yet a fundamental mismatch exists between their design and business operations. Generic models trained on web-scale data excel at breadth but struggle with depth, possessing surface-level knowledge about millions of topics but lacking nuanced understanding of any single organization's communication patterns, decision-making processes, and operational context [1]. This limitation becomes critical when businesses seek an AI co-founder capable of understanding how the business works, making informed decisions under uncertainty, and evolving alongside the organization.

We propose Business Learning Models (BLMs), a framework for compact, domain-specialized models that learn directly from organizational communication streams. The architecture comprises two components: a tool selection system (270M parameters) that routes user intents to appropriate actions, and a response generation system that learns to reason about business context. The tool selector is trained using an agentic RAG system that demonstrates successful workflows, with reinforcement learning refining behavior based on execution outcomes. The response generator employs Reasoning with Next-Token Prediction (RNTP) [4], treating actual message continuations as verifiable rewards and discovering effective reasoning strategies by comparing multiple internal thought processes.

This approach addresses challenges that distinguish business operations from well-defined domains. Unlike mathematics or programming where correctness is binary, business decisions involve judgment calls, evolving priorities, and context-dependent trade-offs [2]. The model must navigate ambiguity and adapt as the business changes. Traditional machine learning approaches assuming static datasets and fixed evaluation metrics fundamentally misalign with this reality, while our proposed continuous learning from communication streams enables perpetual adaptation to organizational evolution.

This paper makes three contributions: (1) an integrated two-stage architecture combining compact tool selection with reasoning-capable response generation, designed for edge deployment and continuous adaptation; (2) adaptation of reinforcement learning advances [4, 5, 6] to business communications, demonstrating how verifiable training signals can replace expensive human feedback; (3) a deployment model where businesses own their AI systems rather than relying on centralized cloud providers, addressing data privacy, operational cost, and strategic autonomy. While this represents a research proposal requiring empirical validation, the architecture is grounded in established techniques and addresses a clear gap in enterprise AI deployment strategies.

2 Proposed Architecture

We propose a two-stage architecture for Business Learning Models (BLM) designed to address enterprise AI deployment challenges: creating compact, domain-specialized models capable of real-time learning from business communications. Unlike generic large language models that prioritize breadth, our BLM framework targets deliberate overfitting on company-specific data streams to achieve superior performance on organizational tasks at a fraction of the computational cost.

2.1 System Overview

The BLM system comprises two specialized components operating in tandem:

  • Stage 1: Tool Selection SLM (270M parameters). A compact model based on Gemma3 that maps user intents to appropriate tool invocations. Deployed on edge devices for low-latency, private tool routing. Trained via RL-enhanced supervised fine-tuning on traces collected from an agentic RAG system.
  • Stage 2: Response Generation Model. A reasoning-capable model trained using Reasoning with Next-Token Prediction (RNTP). Learns to generate internal reasoning tokens that improve prediction of actual business continuations, with reasoning strategies discovered through reinforcement learning rather than supervised imitation.

2.2 Stage 1: Edge-Optimized Tool Selection via RFT

The first training stage produces a specialized Tool Selection SLM capable of mapping user intents to appropriate MCP (Model Context Protocol) tool invocations. We propose an edge-based deployment strategy that enables low-latency, private tool routing while maintaining sophisticated reasoning capabilities for complex enterprise workflows.

Model Architecture and Edge Deployment Strategy

We base our Tool Selection SLM on the Gemma3 architecture with approximately 270 million parameters. At this scale, the model can be deployed on edge devices, enabling offline operation without centralized cloud dependency. The model executes in 4-bit quantized format on commodity hardware with approximately 1GB memory footprint.

The model receives structured inputs encoding the user query x, conversational context c, available tool schema descriptions S = {s₁, s₂, ..., sₖ}, and historical tool usage patterns. Leveraging FunctionGemma's native support for structured formatting conventions such as <start_function_call> and <end_function_call> tokens, the model generates well-formed JSON tool invocations with extracted parameters that interface directly with the MCP protocol layer.

Dynamic Data Acquisition via Agentic RAG

We propose a dynamic Agentic RAG-to-Training pipeline that captures high-fidelity business context during actual workflow execution. During an initial collection phase, a larger centralized LLM acts as the primary agent, performing tool invocations through a RAG-augmented process. For every user interaction, the system records a complete trace tuple:

(1)

where x represents the user query, c captures the conversational context, [(tᵢ,rᵢ)] represents the ordered chain of tool invocations and their outputs, and y encodes the user signal indicating satisfaction.

The user signal y captures both explicit and implicit acknowledgments derived from natural language processing of user responses and behavioral patterns. The trace tuple structure captures multi-step reasoning chains, allowing the Tool Selection SLM to learn both individual tool mappings and reasoning patterns that govern tool composition.

RL-Enhanced Supervised Fine-Tuning

We follow an integrated training approach combining supervised learning with reinforcement learning to optimize tool selection for both syntactic correctness and execution quality. The training process operates on collected trace tuples using a Reward Labeling Model Rφ that scores user responses on a continuous scale. The unified training objective combines cross-entropy loss with reward-weighted optimization:

(2)

where w(yᵢ) = Rφ(yᵢ) represents the quality weight derived from user feedback, and the KL divergence term maintains proximity to a reference policy. The model generates candidate tool sequences evaluated through a composite reward combining execution success, user satisfaction, and efficiency metrics.

3 Stage 2: Response Generation via RNTP

The second training stage produces a Response Generation Model capable of contextually appropriate replies by learning to reason before responding. We adapt the Reasoning with Next-Token Prediction (RNTP) framework proposed by Morris (2024) to the business communication domain, treating the entire corpus of enterprise conversations as verifiable training data where the model learns reasoning patterns through next-token prediction.

3.1 Core RNTP Formulation

RNTP treats reasoning as a latent variable optimized via reinforcement learning rather than supervised imitation. Given that business communication is non-stationary (terminology, processes, and priorities change over time), we avoid relying on fixed "gold" reasoning traces and instead train reasoning only to the extent that it improves prediction of real organizational continuations.

We initialize from a small base model. For a conversational context c, the model generates a sequence with reasoning tokens inside <think> tags followed by response tokens inside <response> tags. For each context, we sample G candidate reasoning chains. The reward for a candidate is defined as the log-likelihood of the ground-truth continuation r observed in business data, conditioned on that reasoning chain:

(3)

Reasoning tokens receive no direct supervision; the model is only rewarded when a reasoning chain increases predictive accuracy of r. We optimize the policy using Group Relative Policy Optimization (GRPO) with a KL regularizer to a reference policy π_ref. The advantage A_g is computed by normalizing rewards within the sampled group (equations 4–5).

4 Advantages of the Proposed Approach

The BLM architecture offers several advantages over generic LLM approaches for enterprise AI deployment.

  • Parameter efficiency. By specializing on a single organization's communication patterns, the model achieves task-relevant performance with 30–100× fewer parameters than general-purpose LLMs.
  • Edge deployment for tool selection. The 270M parameter tool selector operates entirely on edge devices without external connectivity, with sub-100ms latency.
  • No reward model dependency. Both stages rely on verifiable signals rather than learned reward models.
  • Scalability to entire communication corpus. RNTP treats every conversation thread as valid training data.
  • Automatic reasoning discovery. The response generator discovers effective reasoning strategies through reinforcement learning.
  • Continuous adaptation. Both models can update periodically via low-rank adaptation [6].
  • Data sovereignty. On-premise deployment allows businesses to maintain complete control over their data.

5 Conclusion

This paper proposes Business Learning Models, a two-stage architecture that combines edge-deployed tool selection (270M parameters) with reasoning-capable response generation for enterprise AI systems. By adapting recent advances in reinforcement learning to business communication domains, the framework enables compact models to learn directly from organizational conversation streams without requiring curated datasets or learned reward models. Although the architecture is theoretically grounded in established techniques, the approach requires comprehensive empirical validation across multiple business domains to verify the claimed advantages of parameter efficiency, continuous adaptation, and edge deployment feasibility.

References

  1. [1] Belcak, P., et al. (2025). Small Language Models are the Future of Agentic AI. arXiv:2506.02153.
  2. [2] Shao, Z., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300.
  3. [3] Liu, Z., et al. (2024). MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. ICML, PMLR 235.
  4. [4] Morris, J. (2024). How to scale RL to 10^26 FLOPs. blog post.
  5. [5] Rafailov, R., et al. (2023). Direct Preference Optimization. arXiv preprint.
  6. [6] Hu, E.J., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR.

Explore more research

Native Research Lab: BLMs, SLMs, and living UX.