Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)
[![CI](https://img.shields.io/github/actions/workflow/status/DeusData/codebase-memory-mcp/dry-run.yml?label=CI)](https://github.com/DeusData/codebase-memory-mcp/actions/workflows/dry-run.yml)
[![Tests](https://img.shields.io/badge/tests-5604_passing-brightgreen)](https://github.com/DeusData/codebase-memory-mcp)
[![Languages](https://img.shields.io/badge/languages-158-orange)](https://github.com/DeusData/codebase-memory-mcp)
[![Languages](https://img.shields.io/badge/languages-159-orange)](https://github.com/DeusData/codebase-memory-mcp)
[![Hybrid LSP](https://img.shields.io/badge/Hybrid_LSP-9_languages-blue)](#hybrid-lsp)
[![Agents](https://img.shields.io/badge/agents-11-purple)](https://github.com/DeusData/codebase-memory-mcp)
[![Pure C](https://img.shields.io/badge/pure_C-zero_dependencies-blue)](https://github.com/DeusData/codebase-memory-mcp)
Expand All @@ -16,7 +16,7 @@

**The fastest and most efficient code intelligence engine for AI coding agents.** Full-indexes an average repository in milliseconds, the Linux kernel (28M LOC, 75K files) in 3 minutes. Answers structural queries in under 1ms. Ships as a single static binary for macOS, Linux, and Windows — download, run `install`, done.

High-quality parsing through [tree-sitter](https://tree-sitter.github.io/tree-sitter/) AST analysis across all 158 languages, enhanced with [**Hybrid LSP** semantic type resolution](#hybrid-lsp) for Python, TypeScript / JavaScript / JSX / TSX, PHP, C#, Go, C, C++, Java, Kotlin, and Rust — producing a persistent knowledge graph of functions, classes, call chains, HTTP routes, and cross-service links. 14 MCP tools. Zero dependencies. Plug and play across 11 coding agents.
High-quality parsing through [tree-sitter](https://tree-sitter.github.io/tree-sitter/) AST analysis across all 159 languages, enhanced with [**Hybrid LSP** semantic type resolution](#hybrid-lsp) for Python, TypeScript / JavaScript / JSX / TSX, PHP, C#, Go, C, C++, Java, Kotlin, and Rust — producing a persistent knowledge graph of functions, classes, call chains, HTTP routes, and cross-service links. 14 MCP tools. Zero dependencies. Plug and play across 11 coding agents.

> **Research** — The design and benchmarks behind this project are described in the preprint [*Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP*](https://arxiv.org/abs/2603.27277) (arXiv:2603.27277). Evaluated across 31 real-world repositories: 83% answer quality, 10× fewer tokens, 2.1× fewer tool calls vs. file-by-file exploration.

Expand All @@ -32,7 +32,7 @@ High-quality parsing through [tree-sitter](https://tree-sitter.github.io/tree-si

- **Extreme indexing speed** — Linux kernel (28M LOC, 75K files) in 3 minutes. RAM-first pipeline: LZ4 compression, in-memory SQLite, fused Aho-Corasick pattern matching. Memory released after indexing.
- **Plug and play** — single static binary for macOS (arm64/amd64), Linux (arm64/amd64), and Windows (amd64). No Docker, no runtime dependencies, no API keys. Download → `install` → restart agent → done.
- **158 languages** — vendored tree-sitter grammars compiled into the binary. Nothing to install, nothing that breaks.
- **159 languages** — vendored tree-sitter grammars compiled into the binary. Nothing to install, nothing that breaks.
- **120x fewer tokens** — 5 structural queries: ~3,400 tokens vs ~412,000 via file-by-file search. One graph query replaces dozens of grep/read cycles.
- **11 agents, one command** — `install` auto-detects Claude Code, Codex CLI, Gemini CLI, Zed, OpenCode, Antigravity, Aider, KiloCode, VS Code, OpenClaw, and Kiro — configures MCP entries, instruction files, and pre-tool hooks for each.
- **Built-in graph visualization** — 3D interactive UI at `localhost:9749` (optional UI binary variant).
Expand Down Expand Up @@ -168,7 +168,7 @@ Removes all agent configs, skills, hooks, and instructions. Does not remove the
- `SEMANTICALLY_RELATED` (vocabulary-mismatch, same-language, score ≥ 0.80)

### Indexing pipeline
- **158 vendored tree-sitter grammars** compiled into the binary
- **159 vendored tree-sitter grammars** compiled into the binary
- **Generic package / module resolution** — bare specifiers like `@myorg/pkg`, `github.com/foo/bar`, `use my_crate::foo` resolved via manifest scanning (`package.json`, `go.mod`, `Cargo.toml`, `pyproject.toml`, `composer.json`, `pubspec.yaml`, `pom.xml`, `build.gradle`, `mix.exs`, `*.gemspec`)
- **Infrastructure-as-code indexing** — Dockerfiles, Kubernetes manifests, Kustomize overlays as graph nodes
- **[Hybrid LSP semantic type resolution](#hybrid-lsp)** for Python, TypeScript / JavaScript / JSX / TSX, PHP, C#, Go, C, C++, Java, Kotlin, and Rust — a lightweight C implementation of language type-resolution algorithms, structurally inspired by and compatible with major language servers including tsserver / typescript-go, pyright, gopls, Roslyn, Eclipse JDT, and rust-analyzer (parameter binding, return-type inference, generic substitution, JSX component dispatch, JSDoc inference for plain JS files, namespace + trait + late-static-binding resolution for PHP, file-scoped namespaces + records + LINQ method syntax for C#, class-hierarchy + overload + lambda resolution for Java, extension-function + scope-function resolution for Kotlin, trait-method + UFCS resolution for Rust)
Expand Down Expand Up @@ -508,14 +508,14 @@ codebase-memory-mcp ships a **lightweight C implementation of language type-reso

**Two-layer architecture:**

1. **Tree-sitter pass** — fast, syntactic, runs for every one of the 158 languages. Extracts definitions, calls, imports.
1. **Tree-sitter pass** — fast, syntactic, runs for every one of the 159 languages. Extracts definitions, calls, imports.
2. **Hybrid LSP pass** — type-aware, runs above the tree-sitter pass per-language. Refines call edges using the import graph plus a per-file or pre-built cross-file definition registry. Languages without a Hybrid LSP pass yet fall back to textual resolution, so you always get *some* answer.

The result is a knowledge graph accurate enough to drive `trace_path` across packages, inheritance hierarchies, and stdlib calls — without paying for a language server process per project.

## Language Support

158 languages, all parsed via vendored tree-sitter grammars compiled into the binary. Benchmarked against 64 real open-source repositories (78 to 49K nodes):
159 languages, all parsed via vendored tree-sitter grammars compiled into the binary. Benchmarked against 64 real open-source repositories (78 to 49K nodes):

| Tier | Score | Languages |
|------|-------|-----------|
Expand All @@ -540,7 +540,7 @@ src/
traces/ Runtime trace ingestion
ui/ Embedded HTTP server + 3D graph visualization
foundation/ Platform abstractions (threads, filesystem, logging, memory)
internal/cbm/ Vendored tree-sitter grammars (158 languages) + AST extraction engine
internal/cbm/ Vendored tree-sitter grammars (159 languages) + AST extraction engine
```

## Security
Expand Down
2 changes: 1 addition & 1 deletion THIRD_PARTY.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ The core runtime headers in `internal/cbm/vendored/common/tree_sitter/`

## Tree-sitter Grammars

158 pre-generated parsers are vendored in `internal/cbm/vendored/grammars/<lang>/`
159 pre-generated parsers are vendored in `internal/cbm/vendored/grammars/<lang>/`
(generated `parser.c` plus `scanner.c` where applicable, compiled statically).
Each grammar is the work of its upstream authors and each grammar directory
contains the upstream `LICENSE` file.
Expand Down
1 change: 1 addition & 0 deletions internal/cbm/cbm.h
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ typedef enum {
CBM_LANG_QML, // Qt QML (Qt Modeling Language — declarative UI + embedded JS)
CBM_LANG_CFSCRIPT, // CFML script dialect (.cfc components — Lucee/ColdFusion)
CBM_LANG_CFML, // CFML tag dialect (.cfm templates — Lucee/ColdFusion)
CBM_LANG_MOJO, // Mojo (Modular — Python-superset systems language; .mojo / .🔥)
CBM_LANG_COUNT
} CBMLanguage;

Expand Down
4 changes: 4 additions & 0 deletions internal/cbm/grammar_mojo.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
// Vendored tree-sitter grammar: mojo
// Each grammar compiled as separate unit (conflicting static symbols).
#include "vendored/grammars/mojo/parser.c"
#include "vendored/grammars/mojo/scanner.c"
19 changes: 19 additions & 0 deletions internal/cbm/lang_specs.c
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@ extern const TSLanguage *tree_sitter_apex(void);
extern const TSLanguage *tree_sitter_soql(void);
extern const TSLanguage *tree_sitter_sosl(void);
extern const TSLanguage *tree_sitter_pine(void);
extern const TSLanguage *tree_sitter_mojo(void);

// -- Empty sentinel --
static const char *empty_types[] = {NULL};
Expand Down Expand Up @@ -205,6 +206,18 @@ static const char *py_var_types[] = {"assignment", "augmented_assignment", NULL}
static const char *py_throw_types[] = {"raise_statement", NULL};
static const char *py_decorator_types[] = {"decorator", NULL};

// ==================== MOJO ====================
// Mojo (Modular) is a Python superset; the lsh/tree-sitter-mojo grammar is
// forked from tree-sitter-python, so every node type mirrors Python exactly
// EXCEPT the class array — Mojo's "struct"/"class" both parse as
// class_definition, but "trait" and the "__extension" form get their own
// nodes. So the spec reuses the py_* arrays and overrides only the class
// types. ("fn"/"def" both parse as function_definition; compile-time
// "alias NAME = value" has no dedicated node and is recovered as an
// `assignment`, so it falls under py_var_types like ordinary `var` fields.)
static const char *mojo_class_types[] = {"class_definition", "trait_definition",
"extension_definition", NULL};

// ==================== JAVASCRIPT ====================
static const char *js_func_types[] = {"function_declaration", "generator_function_declaration",
"function_expression", "arrow_function",
Expand Down Expand Up @@ -2005,6 +2018,12 @@ static const CBMLangSpec lang_specs[CBM_LANG_COUNT] = {
empty_types, empty_types, NULL, empty_types, NULL, NULL, tree_sitter_cfml,
NULL},

// CBM_LANG_MOJO (Python-derived; reuses py_* arrays, only class types differ)
[CBM_LANG_MOJO] = {CBM_LANG_MOJO, py_func_types, mojo_class_types, empty_types, py_module_types,
py_call_types, py_import_types, py_import_from_types, py_branch_types,
py_var_types, py_var_types, py_throw_types, NULL, py_decorator_types,
py_env_funcs, py_env_members, tree_sitter_mojo, NULL},

// CBM_LANG_GLEAM
[CBM_LANG_GLEAM] = {CBM_LANG_GLEAM, gleam_func_types, gleam_class_types, gleam_field_types,
gleam_module_types, gleam_call_types, gleam_import_types, empty_types,
Expand Down
1 change: 1 addition & 0 deletions internal/cbm/vendored/grammars/MANIFEST.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ Guarded by the `contract_all_grammars_in_graph` graph-breadth test in
| matlab | 15 | acristoffers/tree-sitter-matlab | `c2390a59016f` | VERIFIED-BOTH | ✅ |
| mermaid | 14 | monaqa/tree-sitter-mermaid | `90ae195b3193` | VERIFIED-BOTH | ✅ |
| meson | 15 | tree-sitter-grammars/tree-sitter-meson | `c84f3540624b` | VERIFIED-BOTH | ✅ |
| mojo | 15 | lsh/tree-sitter-mojo | `33193a99afe6` | UNVERIFIED (community; not in nvim-treesitter/Helix) | ✅ |
| nasm | 14 | naclsn/tree-sitter-nasm | `d1b3638d017f` | VERIFIED-BOTH | ✅ |
| nickel | 15 | nickel-lang/tree-sitter-nickel | `b5b6cc3bc7b9` | VERIFIED-BOTH | ✅ |
| nix | 13 | nix-community/tree-sitter-nix | `eabf96807ea4` | VERIFIED-BOTH | ✅ |
Expand Down
21 changes: 21 additions & 0 deletions internal/cbm/vendored/grammars/mojo/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
The MIT License (MIT)

Copyright (c) 2016 Max Brunsfeld

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Loading
Loading