Skip to content

Commit 1e519a2

Browse files
committed
feat: persist embedding metadata
- add schema versioning and migrate vault/cache metadata columns - store n_tokens and truncated on vault rows and cache entries - assert metadata persistence in unittest coverage - document the remote parser refactor plan for later implementation
1 parent 735dd09 commit 1e519a2

4 files changed

Lines changed: 449 additions & 10 deletions

File tree

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# Remote Embedding Parser Refactor Plan
2+
3+
This note captures the preferred implementation plan for later work. The goal is
4+
to make vectors.space response parsing directly testable without turning the
5+
parser into public API or giving production logic a test-only name.
6+
7+
## Goal
8+
9+
Refactor `src/dbmem-rembed.c` so the JSON response parsing currently embedded in
10+
`dbmem_remote_compute_embedding()` lives in a reusable internal function:
11+
12+
```c
13+
int dbmem_remote_parse_embedding_response(...);
14+
```
15+
16+
The function should be production logic, used by `dbmem_remote_compute_embedding()`
17+
and also callable from `test/unittest.c` via a manual forward declaration.
18+
19+
Do not expose it in `sqlite-memory.h` or another public header.
20+
21+
## Preferred Shape
22+
23+
Use a normal internal production name, not a test-only name:
24+
25+
```c
26+
int dbmem_remote_parse_embedding_response(
27+
const char *json,
28+
size_t json_len,
29+
float **embedding,
30+
size_t *embedding_capacity,
31+
jsmntok_t **tokens,
32+
int *tokens_capacity,
33+
embedding_result_t *result,
34+
char *err_msg,
35+
size_t err_msg_len
36+
);
37+
```
38+
39+
This keeps ownership explicit while avoiding exposure of `dbmem_remote_engine_t`
40+
or a new parser-state struct in test code.
41+
42+
## Production Usage
43+
44+
`dbmem_remote_compute_embedding()` keeps responsibility for:
45+
46+
- request construction
47+
- HTTP transport
48+
- HTTP status handling
49+
- context error propagation
50+
- aggregate remote-engine stats
51+
52+
After receiving a successful HTTP 200 response, it calls:
53+
54+
```c
55+
char err_msg[DBMEM_ERRBUF_SIZE] = {0};
56+
int rc = dbmem_remote_parse_embedding_response(
57+
engine->data,
58+
engine->data_size,
59+
&engine->embedding,
60+
&engine->embedding_capacity,
61+
&engine->tokens,
62+
&engine->tokens_capacity,
63+
result,
64+
err_msg,
65+
sizeof(err_msg)
66+
);
67+
68+
if (rc != 0) {
69+
dbmem_context_set_error(engine->context, err_msg);
70+
return -1;
71+
}
72+
73+
engine->total_tokens_processed += result->n_tokens;
74+
engine->total_embeddings_generated++;
75+
return 0;
76+
```
77+
78+
## Parser Responsibility
79+
80+
`dbmem_remote_parse_embedding_response()` should own:
81+
82+
- parsing JSON with `jsmn`
83+
- allocating/growing the token buffer
84+
- locating top-level `output_dimension`
85+
- locating `data[0].embedding`
86+
- allocating/growing the embedding buffer
87+
- parsing embedding floats
88+
- reading `data[0].truncated`
89+
- reading token metadata from `usage`
90+
- filling `embedding_result_t`
91+
92+
Token count priority should remain:
93+
94+
1. `usage.exact_prompt_tokens`
95+
2. `usage.estimated_prompt_tokens`
96+
3. `usage.prompt_tokens`
97+
4. `0` if none are present
98+
99+
## Unit Test Usage
100+
101+
`test/unittest.c` can manually forward-declare the function under the relevant
102+
test guards:
103+
104+
```c
105+
#if defined(TEST_SQLITE_EXTENSION) && !defined(DBMEM_OMIT_REMOTE_ENGINE)
106+
int dbmem_remote_parse_embedding_response(
107+
const char *json,
108+
size_t json_len,
109+
float **embedding,
110+
size_t *embedding_capacity,
111+
jsmntok_t **tokens,
112+
int *tokens_capacity,
113+
embedding_result_t *result,
114+
char *err_msg,
115+
size_t err_msg_len
116+
);
117+
#endif
118+
```
119+
120+
Tests create local buffers:
121+
122+
```c
123+
float *embedding = NULL;
124+
size_t embedding_capacity = 0;
125+
jsmntok_t *tokens = NULL;
126+
int tokens_capacity = 0;
127+
embedding_result_t result = {0};
128+
char err_msg[1024] = {0};
129+
```
130+
131+
Then call the parser with static JSON fixtures and free the buffers afterward:
132+
133+
```c
134+
dbmemory_free(embedding);
135+
dbmemory_free(tokens);
136+
```
137+
138+
## Fixture Tests To Add Later
139+
140+
Recommended deterministic cases:
141+
142+
- exact token count is preferred over estimated and prompt token counts
143+
- estimated token count is used when exact token count is absent
144+
- prompt token count is used when exact and estimated token counts are absent
145+
- missing usage object leaves `result.n_tokens == 0`
146+
- `data[0].truncated: false` maps to `result.truncated == false`
147+
- `data[0].truncated: true` maps to `result.truncated == true`
148+
- embedding float array is parsed correctly
149+
- output dimension is parsed correctly
150+
- missing `data`
151+
- missing `embedding`
152+
- empty embedding array
153+
- invalid top-level response shape
154+
155+
Also decide whether the parser should reject mismatches between
156+
`output_dimension` and the embedding array length. Failing fast is likely safer,
157+
because a dimension mismatch can break later vector initialization/search.
158+
159+
## Why This Plan
160+
161+
This approach avoids:
162+
163+
- live network dependence for parser correctness tests
164+
- exposing parser internals as public API
165+
- duplicating parser behavior in test-only code
166+
- coupling tests to `dbmem_remote_engine_t`
167+
- adding a new internal header before it is needed
168+
169+
The e2e test discussion can proceed separately, especially around whether token
170+
metadata should become persisted product state or remain parser-only metadata.

src/sqlite-memory.c

Lines changed: 120 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,9 @@ SQLITE_EXTENSION_INIT1
6060
#define DBMEM_SETTINGS_KEY_EMBEDDING_CACHE "embedding_cache"
6161
#define DBMEM_SETTINGS_KEY_CACHE_MAX_ENTRIES "cache_max_entries"
6262
#define DBMEM_SETTINGS_KEY_SEARCH_OVERSAMPLE "search_oversample"
63+
#define DBMEM_SETTINGS_KEY_SCHEMA_VERSION "schema_version"
64+
65+
#define DBMEM_SCHEMA_VERSION 2
6366

6467
// default values from https://docs.openclaw.ai/concepts/memory
6568
#define DEFAULT_CHARS_PER_TOKEN 4 // Approximate number of characters per token (GPT ≈ 4, Claude ≈ 3.5)
@@ -358,6 +361,105 @@ void dbmem_settings_load (sqlite3 *db, dbmem_context *ctx) {
358361

359362
// MARK: - Database -
360363

364+
static bool dbmem_database_column_exists (sqlite3 *db, const char *table, const char *column, int *out_rc) {
365+
char sql[256];
366+
snprintf(sql, sizeof(sql), "PRAGMA table_info(%s);", table);
367+
368+
sqlite3_stmt *vm = NULL;
369+
int rc = sqlite3_prepare_v2(db, sql, -1, &vm, NULL);
370+
if (rc != SQLITE_OK) {
371+
if (out_rc) *out_rc = rc;
372+
return false;
373+
}
374+
375+
bool exists = false;
376+
while ((rc = sqlite3_step(vm)) == SQLITE_ROW) {
377+
const char *name = (const char *)sqlite3_column_text(vm, 1);
378+
if (name && strcmp(name, column) == 0) {
379+
exists = true;
380+
break;
381+
}
382+
}
383+
384+
if (rc == SQLITE_DONE || rc == SQLITE_ROW) rc = SQLITE_OK;
385+
sqlite3_finalize(vm);
386+
if (out_rc) *out_rc = rc;
387+
return exists;
388+
}
389+
390+
static int dbmem_database_add_column_if_missing (sqlite3 *db, const char *table, const char *column, const char *alter_sql) {
391+
int rc = SQLITE_OK;
392+
if (dbmem_database_column_exists(db, table, column, &rc)) return SQLITE_OK;
393+
if (rc != SQLITE_OK) return rc;
394+
return sqlite3_exec(db, alter_sql, NULL, NULL, NULL);
395+
}
396+
397+
static int dbmem_database_schema_version (sqlite3 *db, int *version) {
398+
static const char *sql = "SELECT value FROM dbmem_settings WHERE key=?1 LIMIT 1;";
399+
400+
*version = 0;
401+
402+
sqlite3_stmt *vm = NULL;
403+
int rc = sqlite3_prepare_v2(db, sql, -1, &vm, NULL);
404+
if (rc != SQLITE_OK) goto cleanup;
405+
406+
rc = sqlite3_bind_text(vm, 1, DBMEM_SETTINGS_KEY_SCHEMA_VERSION, -1, SQLITE_STATIC);
407+
if (rc != SQLITE_OK) goto cleanup;
408+
409+
rc = sqlite3_step(vm);
410+
if (rc == SQLITE_ROW) {
411+
*version = sqlite3_column_int(vm, 0);
412+
rc = SQLITE_OK;
413+
} else if (rc == SQLITE_DONE) {
414+
rc = SQLITE_OK;
415+
}
416+
417+
cleanup:
418+
if (vm) sqlite3_finalize(vm);
419+
return rc;
420+
}
421+
422+
static int dbmem_database_set_schema_version (sqlite3 *db, int version) {
423+
return dbmem_settings_write_int(db, DBMEM_SETTINGS_KEY_SCHEMA_VERSION, version);
424+
}
425+
426+
static int dbmem_database_migrate_v1_to_v2 (sqlite3 *db) {
427+
int rc = dbmem_database_add_column_if_missing(db, "dbmem_vault", "n_tokens",
428+
"ALTER TABLE dbmem_vault ADD COLUMN n_tokens INTEGER NOT NULL DEFAULT 0;");
429+
if (rc != SQLITE_OK) return rc;
430+
431+
rc = dbmem_database_add_column_if_missing(db, "dbmem_vault", "truncated",
432+
"ALTER TABLE dbmem_vault ADD COLUMN truncated INTEGER NOT NULL DEFAULT 0;");
433+
if (rc != SQLITE_OK) return rc;
434+
435+
rc = dbmem_database_add_column_if_missing(db, "dbmem_cache", "n_tokens",
436+
"ALTER TABLE dbmem_cache ADD COLUMN n_tokens INTEGER NOT NULL DEFAULT 0;");
437+
if (rc != SQLITE_OK) return rc;
438+
439+
return dbmem_database_add_column_if_missing(db, "dbmem_cache", "truncated",
440+
"ALTER TABLE dbmem_cache ADD COLUMN truncated INTEGER NOT NULL DEFAULT 0;");
441+
}
442+
443+
static int dbmem_database_migrate (sqlite3 *db) {
444+
int version = 0;
445+
int rc = dbmem_database_schema_version(db, &version);
446+
if (rc != SQLITE_OK) return rc;
447+
448+
if (version > DBMEM_SCHEMA_VERSION) return SQLITE_MISMATCH;
449+
if (version <= 0) version = 1;
450+
451+
if (version < 2) {
452+
rc = dbmem_database_migrate_v1_to_v2(db);
453+
if (rc != SQLITE_OK) return rc;
454+
version = 2;
455+
rc = dbmem_database_set_schema_version(db, version);
456+
if (rc != SQLITE_OK) return rc;
457+
}
458+
459+
if (version != DBMEM_SCHEMA_VERSION) return SQLITE_MISMATCH;
460+
return SQLITE_OK;
461+
}
462+
361463
static int dbmem_database_init (sqlite3 *db) {
362464
const char *sql = "CREATE TABLE IF NOT EXISTS dbmem_settings (key TEXT PRIMARY KEY, value TEXT);";
363465
int rc = sqlite3_exec(db, sql, NULL, NULL, NULL);
@@ -367,14 +469,17 @@ static int dbmem_database_init (sqlite3 *db) {
367469
rc = sqlite3_exec(db, sql, NULL, NULL, NULL);
368470
if (rc != SQLITE_OK) return rc;
369471

370-
sql = "CREATE TABLE IF NOT EXISTS dbmem_vault (hash TEXT NOT NULL, seq INTEGER NOT NULL, embedding BLOB NOT NULL, offset INTEGER NOT NULL, length INTEGER NOT NULL, PRIMARY KEY (hash, seq));";
472+
sql = "CREATE TABLE IF NOT EXISTS dbmem_vault (hash TEXT NOT NULL, seq INTEGER NOT NULL, embedding BLOB NOT NULL, offset INTEGER NOT NULL, length INTEGER NOT NULL, n_tokens INTEGER NOT NULL DEFAULT 0, truncated INTEGER NOT NULL DEFAULT 0, PRIMARY KEY (hash, seq));";
371473
rc = sqlite3_exec(db, sql, NULL, NULL, NULL);
372474
if (rc != SQLITE_OK) return rc;
373475

374-
sql = "CREATE TABLE IF NOT EXISTS dbmem_cache (text_hash TEXT NOT NULL, provider TEXT NOT NULL, model TEXT NOT NULL, embedding BLOB NOT NULL, dimension INTEGER NOT NULL, PRIMARY KEY (text_hash, provider, model));";
476+
sql = "CREATE TABLE IF NOT EXISTS dbmem_cache (text_hash TEXT NOT NULL, provider TEXT NOT NULL, model TEXT NOT NULL, embedding BLOB NOT NULL, dimension INTEGER NOT NULL, n_tokens INTEGER NOT NULL DEFAULT 0, truncated INTEGER NOT NULL DEFAULT 0, PRIMARY KEY (text_hash, provider, model));";
375477
rc = sqlite3_exec(db, sql, NULL, NULL, NULL);
376478
if (rc != SQLITE_OK) return rc;
377479

480+
rc = dbmem_database_migrate(db);
481+
if (rc != SQLITE_OK) return rc;
482+
378483
sql = "CREATE VIRTUAL TABLE IF NOT EXISTS dbmem_vault_fts USING fts5 (content, hash UNINDEXED, seq UNINDEXED, context UNINDEXED);";
379484
rc = sqlite3_exec(db, sql, NULL, NULL, NULL);
380485
if (rc != SQLITE_OK) {
@@ -495,7 +600,7 @@ static int dbmem_database_add_entry (dbmem_context *ctx, sqlite3 *db, uint64_t h
495600
}
496601

497602
static int dbmem_database_add_chunk (dbmem_context *ctx, embedding_result_t *result, size_t offset, size_t length, size_t index) {
498-
static const char *sql = "INSERT INTO dbmem_vault (hash, seq, embedding, offset, length) VALUES (?1, ?2, ?3, ?4, ?5);";
603+
static const char *sql = "INSERT INTO dbmem_vault (hash, seq, embedding, offset, length, n_tokens, truncated) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7);";
499604

500605
sqlite3_stmt *vm = NULL;
501606
int rc = sqlite3_prepare_v2(ctx->db, sql, -1, &vm, NULL);
@@ -515,6 +620,12 @@ static int dbmem_database_add_chunk (dbmem_context *ctx, embedding_result_t *res
515620

516621
rc = sqlite3_bind_int64(vm, 5, (sqlite3_int64)length);
517622
if (rc != SQLITE_OK) goto cleanup;
623+
624+
rc = sqlite3_bind_int(vm, 6, result->n_tokens);
625+
if (rc != SQLITE_OK) goto cleanup;
626+
627+
rc = sqlite3_bind_int(vm, 7, result->truncated ? 1 : 0);
628+
if (rc != SQLITE_OK) goto cleanup;
518629

519630
rc = sqlite3_step(vm);
520631
if (rc == SQLITE_DONE) rc = SQLITE_OK;
@@ -1267,7 +1378,7 @@ static void dbmem_dump_embeding (const embedding_result_t *result) {
12671378
// MARK: - Embedding Cache -
12681379

12691380
static bool dbmem_cache_lookup (dbmem_context *ctx, uint64_t text_hash, embedding_result_t *result) {
1270-
static const char *sql = "SELECT embedding, dimension FROM dbmem_cache WHERE text_hash=?1 AND provider=?2 AND model=?3 LIMIT 1;";
1381+
static const char *sql = "SELECT embedding, dimension, n_tokens, truncated FROM dbmem_cache WHERE text_hash=?1 AND provider=?2 AND model=?3 LIMIT 1;";
12711382

12721383
if (!ctx->provider || !ctx->model) return false;
12731384

@@ -1300,8 +1411,8 @@ static bool dbmem_cache_lookup (dbmem_context *ctx, uint64_t text_hash, embeddin
13001411
memcpy(ctx->cache_buffer, blob, blob_bytes);
13011412
result->embedding = ctx->cache_buffer;
13021413
result->n_embd = dimension;
1303-
result->n_tokens = 0;
1304-
result->truncated = false;
1414+
result->n_tokens = sqlite3_column_int(vm, 2);
1415+
result->truncated = sqlite3_column_int(vm, 3) != 0;
13051416
found = true;
13061417

13071418
cleanup:
@@ -1337,7 +1448,7 @@ static void dbmem_cache_evict (dbmem_context *ctx) {
13371448
}
13381449

13391450
static void dbmem_cache_store (dbmem_context *ctx, uint64_t text_hash, const embedding_result_t *result) {
1340-
static const char *sql = "INSERT OR REPLACE INTO dbmem_cache (text_hash, provider, model, embedding, dimension) VALUES (?1, ?2, ?3, ?4, ?5);";
1451+
static const char *sql = "INSERT OR REPLACE INTO dbmem_cache (text_hash, provider, model, embedding, dimension, n_tokens, truncated) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7);";
13411452

13421453
if (!ctx->provider || !ctx->model) return;
13431454

@@ -1350,6 +1461,8 @@ static void dbmem_cache_store (dbmem_context *ctx, uint64_t text_hash, const emb
13501461
sqlite3_bind_text(vm, 3, ctx->model, -1, SQLITE_STATIC);
13511462
sqlite3_bind_blob(vm, 4, result->embedding, result->n_embd * (int)sizeof(float), SQLITE_STATIC);
13521463
sqlite3_bind_int(vm, 5, result->n_embd);
1464+
sqlite3_bind_int(vm, 6, result->n_tokens);
1465+
sqlite3_bind_int(vm, 7, result->truncated ? 1 : 0);
13531466

13541467
sqlite3_step(vm);
13551468

src/sqlite-memory.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
extern "C" {
2727
#endif
2828

29-
#define SQLITE_DBMEMORY_VERSION "1.1.0"
29+
#define SQLITE_DBMEMORY_VERSION "1.2.0"
3030

3131
// public API
3232
SQLITE_DBMEMORY_API int sqlite3_memory_init (sqlite3 *db, char **pzErrMsg, const sqlite3_api_routines *pApi);

0 commit comments

Comments
 (0)