Skip to content

Feat/async ticks and networking#3

Merged
elprofesoriqo merged 10 commits into
mainfrom
feat/async-ticks-and-networking
Mar 31, 2026
Merged

Feat/async ticks and networking#3
elprofesoriqo merged 10 commits into
mainfrom
feat/async-ticks-and-networking

Conversation

@elprofesoriqo
Copy link
Copy Markdown
Collaborator

1. Core Compute & Scalability (RL Fundamentals)

To train agents at an enterprise or research scale, the environment must be capable of generating millions of steps per second.

  • Native Ray RLlib & PettingZoo Integration: Complete decoupling of the environment state from the learning logic. The environment must be strictly stateless at the instance level (Thread-Safe), allowing for parallel execution across thousands of Rollout Workers on CPU/GPU clusters.
  • Hierarchical Action Spaces (Auto-Regression): Transition from a flat discrete space to Dict or MultiDiscrete spaces. The agent makes sequential decisions within a single tick: Category (e.g., Exploit) -> Target (IP Mask) -> Payload (CVE). This drastically reduces the combinatorial explosion of the search space.
  • Procedural Topology Generation (Domain Randomization): Every env.reset() call dynamically generates a new graph-based topology (via NetworkX), adhering to logical business architectures (Internet -> DMZ -> Intranet -> Secure Zone). This guarantees Zero-Shot Transfer capabilities and prevents overfitting to a static map.

2. Semantic & Operational Realism (Hyper-Realism)

Minimizing the Sim-to-Real gap to ensure agent strategies are viable in actual corporate networks.

  • Strict POMDP with SIEM Noise: Implementation of a "Fog of War" with a 2-tick latency. The Blue Agent receives only an alert vector, completely blind to the true underlying network state. This vector is mathematically injected with False Positives (FPs) mimicking background business traffic anomalies.
  • Continuous Action Spaces for Defenders: Instead of binary "enable/disable firewall" toggles, the Blue Agent can fluidly tune anomaly detection thresholds (float values $0.0 - 1.0$), forcing the agent to balance tight security against operational business costs.
  • MITRE ATT&CK & D3FEND Standardization: The ActionRegistry pattern dictates that every action class contains strict metadata (e.g., tactics: ["TA0008"]). This enables the generation of human-readable "Playbooks" once the agent is trained.
  • Active Sessions & Active Directory: Introduction of attack persistence (C2 Sessions). Simulation of Windows Domains where extracting a Domain Admin hash radically alters lateral movement vectors (Pass-the-Hash mechanics).

3. Game Theory & Research Infrastructure

Training cannot rely on static heuristic bots. It requires a co-evolutionary ecosystem.

  • MARL League Training (Self-Play): Red and Blue agents evolve simultaneously. The system maintains a population of diverse agent "personalities" (e.g., a Red agent optimized for stealth vs. a Red agent optimized for aggressive ransomware deployment).
  • Empirical Game-Theoretic Analysis (EGTA): An analytical module to compute Nash Equilibrium and Exploitability metrics. The latest Red agent continuously evaluates against historical checkpoints of the Blue agent on the league table to prevent Catastrophic Forgetting of older evasion tactics.

4. "Beyond SOTA" Innovations (Exclusive to NetForge_RL)

Features designed to outclass current academic and commercial simulators:

  • A. Cloud-Native & Ephemeral Resources (K8s):
    Current SOTA environments simulate static bare-metal servers. NetForge_RL will introduce "Ephemerality." Nodes (representing Kubernetes Pods) can be automatically destroyed and spun up by an Auto-Scaler every few dozen ticks. The Red agent must learn to infect base images (Supply Chain) or execute Container Breakouts before their foothold evaporates.

  • B. Active Deception (Honeypots & Honeytokens):
    The Blue agent gains actions to inject fake credentials into RAM or spin up decoy services. When the Red agent ingests this data, its internal observation vector is "poisoned," forcing it to waste ticks attacking void targets while generating critical-priority alerts.

  • C. Zero-Trust Architecture (ZTA) Simulation:
    Moving beyond perimeter firewalls, the environment simulates continuous identity verification. Lateral movement success depends on a dynamic "Trust Score" attached to the session token, which degrades over time or upon anomalous behavior.

  • D. LLM-Driven SIEM Log Generation:
    Instead of returning hardcoded strings, NetForge_RL can optionally pipe action vectors through a lightweight local Large Language Model to generate highly realistic, unstructured system event logs (e.g., Sysmon data), forcing the Blue agent's pipeline to process noisy, real-world text data.

    • E. Stochastic Human Factor & Social Engineering (NPC Dynamics):
      Networks are operated by humans. The environment introduces simulated "User Nodes" (NPCs) that generate background traffic and possess a stochastic "Vulnerability Score". The Red agent can execute SpearPhishing or WateringHole attacks. The Blue agent can counter with SecurityAwarenessTraining, which temporarily reduces the users' susceptibility but costs operational budget. This forces agents to account for human error, not just software bugs.
  • F. Cyber-Physical Convergence (ICS/OT & SCADA Segments):
    Moving beyond data exfiltration, the environment includes Operational Technology (OT) subnets representing physical infrastructure (e.g., PLCs, cooling systems, power grids). Compromising these nodes manipulates continuous physical state variables (e.g., temperature, pressure). This allows research into catastrophic "Kinetic Impact" scenarios, where the reward function shifts from digital access to physical process disruption.

  • G. Attack Economics & Asymmetric Resource Budgets:
    Actions are no longer "free" outside of tick consumption. Both agents operate under strict economic constraints:

    • Red Budget: Purchasing/Deploying a "0-day exploit" guarantees stealth and success but heavily depletes the financial/compute budget. Noisy attacks (Brute-force) are cheap but trigger the SIEM.
    • Blue Budget (Business Uptime): Executing an IsolateHost or DropSubnetRoute action actively penalizes the Blue agent's reward function by simulating lost business revenue (Business Downtime). This forces Blue to prioritize surgical remediation over blanket network shutdowns.
  • H. Dynamic SOAR & YARA Synthesis:
    The Blue agent's action space is elevated from simple "Block IP" commands to dynamic rule generation. The defender can synthesize and deploy programmatic signatures (e.g., simplified YARA or Snort rules). The environment's physics engine dynamically evaluates the Red agent's subsequent payloads against these newly deployed regex/signature structures, enabling true automated incident response (SOAR) simulation.

…asking, OT/SCADA PLC nodes, Business Downtime economics, and dynamic procedural network padding.
Added OverloadPLC termination rewards, DMZ SpearPhishing bypass, SecurityAwareness mitigation logic, and RAM-seeded Honeytoken Active Deception.
Completely purged legacy MARL configurations. Registered procedurally-generated topologies, Dictionary POMDP observations, and ConflictResolution physics engines mapped securely for Ray executions.
@elprofesoriqo elprofesoriqo merged commit 1e85ece into main Mar 31, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant