Releases: clojure-finance/datajure
v2.0.9
A post-alpha audit pass reconciling the library with data.table-style semantics, plus a handful of correctness fixes uncovered by REPL verification of the DSL's per-partition execution paths.
Changed
qtilenow uses per-partition breakpoints when combined with exact keys in:by. Previously:by [:date (qtile :mktcap 5)]computed breakpoints once from the whole dataset (and silently produced wrong answers for per-date cross-sectional sorts — the canonical CRSP / Fama-French size sort). It now partitions by the exact keys first, then resolvesqtileagainst each sub-dataset, matching data.table / dplyr / q:An audit of every other DSL feature (;; Per-date size quintiles — now works the obvious way (core/dt stocks :by [:date (core/qtile :mktcap 5)] :agg {:mean-ret #dt/e (mn :ret)}) ;; Per-date NYSE-style breakpoints applied to all stocks (Fama-French size sort) (core/dt stocks :by [:date (core/qtile :mktcap 5 :from #dt/e (= :exchcd 1))] :agg {:mean-ret #dt/e (mn :ret)})
stat/*, aggregations inside composite#dt/e,cut :from,win/*,row/*,xbar,join :asof/:window) confirmedqtilewas the only outlier — every other feature already ran per-partition by virtue of living insideapply-group-*. Pure-qtile:by(no exact keys) still resolves globally since there is nothing to partition by.
Added
-
cast— long→wide reshaping. Complement tomelt. For each unique combination of:idcolumn values, pivots the:fromcolumn's distinct values into new columns filled from the:valuecolumn. New column names derived from:fromvalues (keywords pass through; strings converted viakeyword). Supports:aggfor duplicate cells and:fillfor missing cells (default nil).melt/castround-trip correctly.;; Reverse a melt (-> ds (melt {:id [:species :year] :measure [:mass :flipper]}) (cast {:id [:species :year] :from :variable :value :value})) ;; With aggregation for duplicate (id, from) cells (cast ds {:id [:date :sym] :from :metric :value :val :agg dfn/mean})
-
castaccepts a single-keyword:id. Normalised to a one-element vector, matchingmelt. Previously a single-keyword:iderrored with"Don't know how to create ISeq from: clojure.lang.Keyword".
Fixed
-
row/sum,row/mean,row/min,row/maxon all-nil rows. The four row-wise aggregators declared:float64readers but claimed in their docstrings to returnnilwhen every input is nil. Primitive float readers cannot holdnil— the value was silently coerced toNaN, contradicting the docstring. Readers are now:object, so all-nil rows honestly returnnil.row-min/row-maxalso cast non-nil results todoubleto preserve the always-numeric-result convention. -
wavg/wsumwith mismatched column lengths. Previously silently truncated (when the weight column was shorter) or NPE'd (when the value column was shorter). Now throws a structured:unequal-column-lengthsex-infowith:dt/op,:dt/weight-length, and:dt/value-lengthinex-data. -
join :asof :toleranceon datetime asof keys. Previously produced a rawjava.lang.ClassCastExceptionfrom an unguarded(double dt-value)coercion insidewithin-tolerance?. Now validates numerically-compatible asof key types upfront and throws a structured:join-tolerance-non-numericex-infowith actionable guidance (convert to epoch-milliseconds). Symmetric and asymmetric join key shapes (:onvs:left-on/:right-on) both report the correct column names inex-data. -
describeon all-missing numeric columns. When every value in a numeric column was missing,dfn/standard-deviationreturned-0.0whilemean,min,max,median, and percentiles correctly returnednil, producing an incoherent summary row.describe-columnnow routes all-missing numeric columns through the nil-filled branch (same as non-numeric columns). -
parse-window-specnow strict-validates window spec shape. Previously the implementation destructured[a b c :as wspec]and silently dropped any trailing elements. Malformed specs now throw a structured:join-invalid-windowex-info— trailing junk, wrong arity, non-numeric endpoints, non-vector specs, and misplaced unit keywords are all rejected upfront. Valid[lo hi],[lo hi unit], and[lo unit hi]shapes behave exactly as before. -
count-distinctnow excludesnil. The fn includednilin its distinct count, contradicting its docstring ("non-nil values"). Fixed by filteringsome?beforedistinct. -
qtileandcutnow use the same breakpoint algorithm.qtile'spercentile-breakpointspreviously used a floor-index approximation that produced different breakpoints thancut-bucket(which usesdfn/percentiles). The two now sharedfn/percentiles, soqtile :mktcap 5and#dt/e (cut :mktcap 5)produce identical bins for the same population. -
Breakpoint-at-exact-value semantics unified across
qtileandcut.bin-via-breakpointsnow uses<=(values equal to a breakpoint go to the lower bin), matchingcut-bucket'sjava.util.Arrays/binarySearchexact-match behaviour. The previously passingqtile-from-basictest's assertion was wrong; corrected to reflect actual semantics. -
xbar/xbar-bucketnow useMath/floorDiv. Previously usedquot, which truncates toward zero — so negative values bucketed incorrectly relative to q'sxbarsemantics. E.g.(xbar -3 5)now returns-5rather than0. -
validate-expr-colsandvalidate-select-colsNPE on zero-column datasets. Both helpers computed(->> avail-names (map ...) (sort-by second) first), which returnednilwhenavail-nameswas empty.(second nil)yieldednil, and(<= nil 3)threwNullPointerException. Guarded with(and closest ...)in the suggestion-emission branch.
Developer experience
win/scanop normalisation mirrorswin/each-prior. The parser now preferssym->opfor the canonical keyword, so invalid scan ops like/resolve to:div(consistent with the rest of the codebase) rather than a keyword literally spelled with a slash. Valid scan ops (+,*,max,min) are unchanged in both AST and runtime.
Internal
-
Damerau-Levenshtein deduplication. The edit-distance implementation used by typo suggestions was duplicated byte-for-byte in
datajure.coreanddatajure.expr. Extracted to the publicdatajure.expr/damerau-levenshteinas the single source of truth; bothvalidate-expr-cols/validate-select-colsandsuggest-opnow call it. -
Dead
win-opsset removed fromdatajure.expr(unused, and stale — missingwin/scanandwin/each-prior).
Testing
- Test count: 310 → 318 (+8 new deftests, +89 assertions). CI subset: 268/901 → 276/989. All passing.
- New deftests:
row-fns-all-nil-returns-nil-not-nan,wavg-wsum-unequal-lengths,asof-tolerance-non-numeric-error-test,describe-all-missing-numeric,wjoin-invalid-window-shape-test,qtile-per-group-breakpoints,qtile-from-with-exact-key. The existingqtile-combined-with-keywordtest was strengthened from a column-names-only check to an assertion of per-group bin counts (would fail loudly against the old global-breakpoint implementation).
v2.0.8
What's new in 2.0.8
qtile :from ? Reference-subpopulation breakpoints for :by grouping. Compute quintile boundaries from a filtered subset (e.g. NYSE stocks) and apply them to all rows ? the :by-side equivalent of #dt/e (cut :col n :from pred).
win/each-prior ? Generalized adjacent-element operator. Applies any binary operator to f(x[i], x[i-1]). Supports +, -, *, /, max, min, and comparison operators.
Bounded as-of joins ? :direction (:backward/:forward/:nearest) and :tolerance options on :how :asof.
Window join (:how :window, q's wj) ? For each left row, aggregates all right rows within a time window. :window [-5 0 :minutes] + :agg map with full temporal unit support.
CI fix ? stat_test.clj added to CI; tech.ml.dataset dep corrected to 8.007 in run-tests.sh.
Full details in the CHANGELOG.
v2.0.7 ? qtile
qtile: quantile grouping for :by
The :by-friendly companion to #dt/e (cut ...). Produces an equal-count bin assignment from a column distribution, computed once from the dataset before grouping. Use it to group by quantile rather than derive a column of quantile bins.
;; Quintile buckets of market cap
(core/dt stocks :by [(core/qtile :mktcap 5)]
:agg {:n core/nrow :mean-ret #dt/e (mn :ret)})
;; Result column is auto-named :mktcap-q5
;; Per-date size quintiles combined with an exact key
(core/dt stocks :by [:date (core/qtile :mktcap 5)]
:agg {:mean-ret #dt/e (mn :ret)})Companion to xbar (equal-width bins) with a symmetric API. Result column auto-named -q; override via :datajure/col metadata. Nil inputs form their own group.
Added
- qtile quantile grouping for :by (phase 52). See above.
Changed
- by->group-fn now receives the dataset. Internal refactor; no user-visible behaviour change. Enables :by markers that require population-level statistics (like qtile) to precompute state before grouping.
Testing
- 273 tests, 913 assertions (CI subset: 206 / 751). +6 deftests, +17 assertions over v2.0.6.
See CHANGELOG.md for details.
v2.0.6
[2.0.6] - 2026-04-17
Added
:within-orderwith:agg.:within-ordercan now be combined with:agg(with or without:by), sorting rows within each partition (or across the whole dataset) before aggregation. This enables order-sensitive aggregations like OHLC bar construction in a singledtcall:Previously this required two(core/dt trades :by [:sym] :within-order [(core/asc :time)] :agg {:open #dt/e (first-val :price) :close #dt/e (last-val :price) :vol #dt/e (sm :size) :n core/N})
dtcalls (pre-sort via:order-by, then aggregate). The restriction that:within-orderrequired:sethas been relaxed to ":within-orderrequires:setor:agg."nrowalias forN.core/nrowis now exported alongsidecore/Nas a more discoverable full-name alternative for row counting in:agg. Both are equivalent.
Changed
clean-column-namesis now Unicode-aware. The regex that strips non-identifier characters no longer removes CJK, Cyrillic, Greek, accented Latin, and other non-ASCII letters and digits. Mixed-script column names like"Some Name (HKD millions)!"combined with CJK or accented-Latin characters are now preserved intact; only punctuation, whitespace, and symbols are replaced. Pure-ASCII column names behave exactly as before. Seeclean-column-names-unicodein the test suite for coverage of CJK, accented Latin, and mixed-script cases.
Fixed
- Edit-distance algorithm for typo suggestions. The Levenshtein implementation in both
core.clj(column-name suggestions) andexpr.clj(op-name suggestions) was producing incorrect edit distances (e.g.,"kitten"to"sitting"returned 5 instead of 3), causing many legitimate suggestions to be dropped. The algorithm has been replaced with a correct Damerau-Levenshtein implementation that also treats single adjacent transpositions as distance 1. Typos like:hieghtnow correctly suggest:height. win/rationo longer propagatesInfinityon zero denominators. When the previous row's value is zero,win/rationow returnsnilrather thanInfinity. This matches thediv0philosophy and gives the correct result for the canonical simple-return idiom(- (win/ratio :price) 1)-- a zero-price observation now yieldsnilfor the next row's return, signalling "exclude" rather than contaminating downstream calculations.
Developer experience
:aggplain-function footgun detection. Plain functions passed to:aggreceive the group dataset, so#(:mass %)returns a column vector rather than a scalar -- a common mistake for users coming from:setcontext. Previously this silently produced nonsensical output (a column inside each result cell); now it throws a structured error with guidance pointing to either#(dfn/mean (:mass %))or the preferred#dt/e (mn :mass).- Unknown ops in
#dt/eproduce structured errors with suggestions. A typo like#dt/e (sqrt :x)previously caused a rawClassCastExceptionat runtime. It now throws anex-infoat read time with the suggestion:Unknown op `sqrt` in #dt/e expression. Did you mean: `sq`?. Namespaced ops (win/*,row/*,stat/*) get namespace-aware suggestions:win/mvag->win/mavg,stat/standardise->stat/standardize.
Testing
- Test count: 193 -> 200 (+7 new deftests, +72 assertions). All passing.
v2.0.5
Bug fixes:
- div0: fix crash on scalar denominator
- apply-group-agg/set: return empty dataset instead of nil on empty input
- melt: return empty dataset instead of nil when measure-cols is empty
- join :asof: :report option now works (was silently ignored)
- xbar-bucket/cut-bucket: replace fragile RoaringBitmap nil detection with idiomatic (nil? (nth rdr idx))
- by->group-fn: plain fn fallback key is now :fn-N instead of misleading :xbar-N; supports :datajure/col metadata for custom key names
v2.0.4
As-of join (:how :asof)
Inspired by q's aj. For each left row, finds the last right row where right-key <= left-key within an exact-match group. All left rows are always preserved.
;; Trade-quote matching
(join trades quotes :on [:sym :time] :how :asof)
;; Asymmetric key names
(join trades quotes :left-on [:sym :trade-time] :right-on [:sym :quote-time] :how :asof)
;; With cardinality validation
(join trades quotes :on [:sym :time] :how :asof :validate :m:1)Changes
- New
datajure.asofnamespace:asof-search,asof-indices,asof-match,build-result datajure.join::how :asofdispatch,:validatechecks right side only- 19 new tests (37 assertions), added to CI
- Fixes nil-at-midpoint bug in binary search
v2.0.3
v2.0.3
New
betweencolumn selector —(dt ds :select (core/between :month-01 :month-12))selects all columns positionally between two endpoints (inclusive, reversed endpoints normalised)betweenre-exported indatajure.concise
237 tests, 761 assertions.
v2.0.2
Fixes
- Add SCM
connectionanddeveloperConnectionfields to pom.xml for cljdoc compatibility - Add
^:no-doctodatajure.nreplto prevent cljdoc analysis failure (nrepl is an optional dev dependency)
No API changes — purely a publishing fix.
Installation
{:deps {com.github.clojure-finance/datajure {:mvn/version "2.0.2"}}}
Datajure v2.0.0
Ground-up rewrite. Not backwards-compatible with v1.
What's new
- Single opinionated syntax layer directly on tech.v3.dataset
dtquery function with six keywords::where,:set,:agg,:by,:within-order,:select,:order-by#dt/ereader tag — vectorized, nil-safe, pre-validated column expressions- Window functions (
win/*): rank, lag, cumsum, delta, ratio, mavg, ema, fills, scan, and more - Row-wise functions (
row/*): sum, mean, min, max, count-nil, any-nil? - Joins with cardinality validation and merge diagnostics
- Wide→long reshaping (
melt) - Unified file I/O dispatching on extension (CSV, TSV, Parquet, Arrow, Excel, Nippy)
- Data utilities: describe, clean-column-names, duplicate-rows, drop-constant-columns, coerce-columns
- Notebook integration: Clerk and Clay/Kindly viewers
- nREPL middleware for
*dt*auto-binding - 209 tests, 693 assertions
Prior work
v1 (multi-backend routing layer over tablecloth/clojask/geni) is preserved in the v1 branch. Credit to YANG Ming-Tian for the original v1 implementation.