Skip to content

Feat/nlp siem embeddings#5

Merged
elprofesoriqo merged 11 commits into
mainfrom
feat/NLP-SIEM-Embeddings
Apr 1, 2026
Merged

Feat/nlp siem embeddings#5
elprofesoriqo merged 11 commits into
mainfrom
feat/NLP-SIEM-Embeddings

Conversation

@elprofesoriqo
Copy link
Copy Markdown
Collaborator

The Implementation

  • NLP to RL? We will modify the environment's siem_logger to dynamically stream human-readable log syntax strings.
    • Example: Instead of returning a 5, it returns: "[AuthLog] User SYSTEM executed kernel32.dll on 10.0.0.4. Confidence: Medium."
  • Dual-Architecture Training: We pass these strings through a frozen, lightweight SentenceTransformer natively inside our PyTorch LSTMs. The RL Agent learns to read raw log text by encoding it into dense vectors before its policy makes an action.
  • Result: You will have trained a Cyber AI that can be plugged into a real enterprise Splunk instance natively because its observation space speaks fluent English/Syntax, not mathematical Python variables.

Modified can_route_to natively rejecting Secure subnet traversals entirely unless Red explicitly populates their agent_inventory hash lists via new AD memory objects.
Added global RotateKerberos loop neutralizing PassTheTicket hashes natively. Verified End-to-End via local python script.
Adds dual-mode hypervisor system (MockHypervisor for training, DockerHypervisor for evaluation). Includes Sim2RealBridge, HypervisorResult dataclass, and curated payload_library.json with 30+ real Metasploit stdout samples.
ExploitRemoteService, ExploitBlueKeep, ExploitHTTP_RFI now call bridge.dispatch() when available on global_state. HypervisorResult stdout and reward_delta attached to observation_data for SIEM pipeline.
Accepts sim2real_mode='sim'|'real' from scenario_config. Attaches bridge to global_state on init and episode reset. Calls teardown_all() between episodes.
New siem/ package. SIEMLogger converts action effects into authentic Windows Event ID + Sysmon XML strings (4624, 4625, 4648, 4688, 4768, 4776, Sysmon 1/3/10/22). Probabilistic detection with background noise injection.
New nlp/ package. LogEncoder encodes raw SIEM strings to 128-dim float32 vectors. Default: scikit-learn TF-IDF + LSA (zero extra deps, fast training). Optional: sentence-transformers all-MiniLM-L6-v2 + random projection. LRU embedding cache prevents re-encoding.
Observation space extended with siem_embedding: Box(128,). Blue agents receive live encoded SIEM vectors; Red agents receive zeros. SIEMLogger logs every resolved action effect + background noise per tick. LogEncoder fits TF-IDF on payload library + event template corpus at env init.
@elprofesoriqo elprofesoriqo merged commit 50c975b into main Apr 1, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant