Install dependencies with opam install --deps-only .
Build with make
esgg takes as an input an ES mapping (schema description)
and actual query (with syntax for variables, described below). Often additional information is needed to map ES fields into proper OCaml types,
this is achieved by attaching _meta annotation object to the affected field (ES only supports _meta at root level, so these annotations make it
impossible to store extended mapping back into ES which is a pity), as follows:
"counts": {
"_meta": {
"optional": true
},
"properties": {
"hash": {
"type": "long",
"_meta": { "repr": "int64" }
},
"value": {
"type": "long"
}
}
},Supported _meta attributes:
{"list":true}- property is an array (mapped tolist){"list":"sometimes"}- property is either an array or single element (mapped to json with custom ocaml module wrap that will need to be provided in scope){"optional":<true|false>}- property may be missing (mapped tooption){"ignore":true}- skip property altogether{"fields_default_optional":true}- any subfield may be missing (can be overriden by per-fieldoptional:false){"repr":"int64"}- override EStype, currently the only possible value is"int64"to ensure no bits are lost (by defaultlongis mapped to OCamlint)
Generated code allows to use application types for any fields. This is achieved by referencing specific type for each field in generated
code, instead of the primitive type from the mapping, allowing consumer of the code to map it onto custom type etc. For example the field
hash in example above will have type Counts.Hash.t in generated code. In order to compile the generated code this type must be present
in scope and mapped to something useful. Default mapping (which just maps everything to corresponding primitive types) can be generated
with esgg reflect <mapping name> <mapping.json>, e.g.:
esgg reflect hello_world src/mappings/hello_world.json >> src/mapping.mlwill generate the following, which should be edited manually as needed, e.g. by making Hash a module with an abstract type
module Counts = struct
module Hash = Id_(Int64_)
module Value = Id_(Long_)
end
Syntax for variables in template json files is as follows:
$varfor regular required variable$var?for optional variable (minimal surrounding scope is conditionally expunged)- full form
$(var:hint)wherehintcan be eitherlistorlist?currently
The _esgg field can be added to query templates to configure code generation behavior. This field is automatically filtered out before sending queries to Elasticsearch.
Supported configuration options:
{"matched_queries": true}- Includematched_queriesfield in output types even when_nameis not explicitly present in the query template. This is useful when_nameis defined inside query variables.{"inner_hits": [ ... ]}- Declare inner hits to include in output types even if the correspondingnestedqueries are provided via base/shared queries. Each entry describes one nested path.
Example:
{
"_esgg": {
"matched_queries": true
},
"query": $query,
"size": 10
}When inner hits are defined inside a base/shared query (not visible in this template), declare them explicitly so esgg can generate typed inner_hits in the output:
{
"_esgg": {
"inner_hits": [
{
"path": "comments", // required: nested path in the mapping
"name": "comments", // optional: key under inner_hits (defaults to path)
"size": 100, // optional
"from": 0, // optional
"_source": ["fieldA","fieldB"],// optional: standard ES source filtering for inner hits
"stored_fields": ["storedA"], // optional
"highlight": { // optional: ES highlight shape; fields keys are collected
"fields": { "comments.text": {} }
}
}
]
},
"query": $query
}To reuse shared definitions using the -shared <file.atd> option, the atd file must have the <esgg from="..."> annotation at the top of the file.
The value of the annotation must correspond to the OCaml module containing the shared definitions.
Example:
# file.atd
<esgg from="Your_ocaml_module_name">
...atd type definitions...
TODO document what is supported
Some notes follow:
The following aggregation types are supported:
- avg - Average of numeric values
- sum - Sum of numeric values
- min - Minimum value
- max - Maximum value
- value_count - Count of values
- cardinality - Approximate count of distinct values
- top_hits - Top matching documents per bucket
- stats - Basic statistics (count, min, max, avg, sum)
- extended_stats - Extended statistics (variance, std deviation, etc.)
- percentiles - Percentile calculations
- percentile_ranks - Percentile ranks
- median_absolute_deviation - MAD calculation
- weighted_avg - Weighted average
- geo_bounds - Geographic bounding box
- scripted_metric - Custom scripted metrics
- terms - Buckets by field values
- histogram - Fixed-size numeric interval buckets
- date_histogram - Fixed calendar/time interval buckets
- range - Numeric range buckets
- date_range - Date range buckets
- filter - Single filter bucket
- filters - Multiple filter buckets
- nested - Nested document aggregation
- reverse_nested - Reverse nested aggregation
- significant_terms - Significant terms analysis
- significant_text - Significant text analysis
- auto_date_histogram - Automatic date histogram interval
- ip_range - IP address range buckets
- geo_distance - Geographic distance buckets
- geohash_grid - Geohash grid buckets
- geotile_grid - Geotile grid buckets
- geohex_grid - Geohex grid buckets
- global - Global aggregation (all documents)
- missing - Missing field values
- children - Child documents aggregation
- parent - Parent documents aggregation
- sampler - Sampler aggregation
- diversified_sampler - Diversified sampler
- composite - Composite aggregation for pagination
- multi_terms - Multi-field terms aggregation
- adjacency_matrix - Adjacency matrix
- categorize_text - Text categorization
- frequent_item_sets - Frequent item sets
- random_sampler - Random sampling
- cumulative_sum - Cumulative sum across buckets
- bucket_sort - Sort and limit buckets
- avg_bucket - Average of bucket values
- max_bucket - Maximum bucket value
- min_bucket - Minimum bucket value
- sum_bucket - Sum of bucket values
- stats_bucket - Statistics on bucket values
- extended_stats_bucket - Extended statistics on buckets
- percentiles_bucket - Percentiles of bucket values
- moving_avg - Moving average (deprecated)
- moving_fn - Moving function
- derivative - Derivative calculation
- bucket_script - Custom bucket calculations
- bucket_selector - Filter buckets by script
- serial_diff - Serial differencing
- matrix_stats - Matrix statistics
- named
- anonymous
- dynamic (i.e. a variable)
- partial dynamic (i.e. containing variables)
- other_bucket and other_bucket_key
- other_bucket with anonymous filters (ignored, user is responsible to treat last element of result specially)
Dynamic (defined at runtime) filters are supported, as follows { "filters": { "filters": $x } }.
In this case corresponding part of output will be quite untyped. $x is assumed to be a dictionary and result will be represented with
dictionaries. For anonymous filters (ie array of filters) use $(x:list).
key_as_string is returned in output only when format is
explicitly specified,
to discourage fragile code.
Keyed aggregation expects explicit key for each range. from/to fields in response are not extracted.
Same as for range aggregation.
Specifying aggregation as variable ($var) will lead to an untyped json in place of aggregation output, this can be used as temporary
workaround for unsupported aggregation types or for truly dynamic usecase (aggregation built at run-time).
Scripts are opaque, ie no type information is extracted and result is json.
- exclude and include
- wildcards
- dynamic (i.e. a variable) NB not implemented for get and mget
make test runs regression tests in test/ verifying
that input and output atd generated from query stays unchanged.
Once there is an expected change in generated query - it should be committed.
Tests are easy to add and fast to run.
TODO tests to verify that:
- code generated for query application of input variables does actually compile and produce correct query when run
- atd description of output (generated from query) can indeed unserialize ES output from that actual query
Copyright (c) 2018 Ahrefs [email protected]
This project is distributed under the terms of GPL Version 2. See LICENSE file for full license text.
NB the output of esgg, i.e. the generated code, is all yours of course :)