Skip to content

Commit 6a29c8f

Browse files
authored
v0.5.0 (#96)
* Version 0.5.0 * pypgstac loader improvements * rework pypgstac, linting, bug fixes * remove srid lookup * get tests working with temp database * update poetry lock * psycopg copy error on ci * more tests, update readme * bug fixes * change typing for file input to allow iterator * update poetry lock for black issue, run tests with pytest * add code to change partitions and migrate data * fix for assets in includes * add incremental migration * add migration, fix atexit
1 parent 9fc5027 commit 6a29c8f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+8756
-2280
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,5 @@ pypgstac/dist
33
*.pyc
44
*.egg-info
55
*.eggs
6-
venv
6+
venv
7+
.direnv

CHANGELOG.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,32 @@
11
# Changelog
2+
## [v0.5.0]
3+
Version 0.5.0 is a major refactor of how data is stored. It is recommended to start a new database from scratch and to move data over rather than to use the inbuilt migration which will be very slow for larger amounts of data.
4+
5+
### Fixed
6+
7+
### Changed
8+
- The partition layout has been changed from being hardcoded to a partition to week to using nested partitions. The first level is by collection, for each collection, there is an attribute partition_trunc which can be set to NULL (no temporal partitions), month, or year.
9+
10+
- CQL1 and Query Code have been refactored to translate to CQL2 to reduce duplicated code in query parsing.
11+
12+
- Unused functions have been stripped from the project.
13+
14+
- Pypgstac has been changed to use Fire rather than Typo.
15+
16+
- Pypgstac has been changed to use Psycopg3 rather than Asyncpg to enable easier use as both sync and async.
17+
18+
- Indexing has been reworked to eliminate indexes that from logs were not being used. The global json index on properties has been removed. Indexes on individual properties can be added either globally or per collection using the new queryables table.
19+
20+
- Triggers for maintaining partitions have been updated to reduce lock contention and to reflect the new data layout.
21+
22+
- The data pager which optimizes "order by datetime" searches has been updated to get time periods from the new partition layout and partition metadata.
23+
24+
- Tests have been updated to reflect the many changes.
25+
26+
### Added
27+
28+
- On ingest, the content in an item is compared to the metadata available at the collection level and duplicate information is stripped out (this is primarily data in the item_assets property). Logic is added in to merge this data back in on data usage.
29+
230
## [v0.4.5]
331

432
### Fixed

Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,9 @@ RUN \
3333
python3-setuptools \
3434
&& pip3 install -U pip setuptools packaging \
3535
&& pip3 install -U psycopg2-binary \
36+
&& pip3 install -U psycopg[binary] \
3637
&& pip3 install -U migra[pg] \
37-
&& pip3 install poetry==1.1.12 \
38+
&& pip3 install poetry==1.1.13 \
3839
&& apt-get remove -y apt-transport-https \
3940
&& apt-get -y autoremove \
4041
&& rm -rf /var/lib/apt/lists/*

Dockerfile.dev

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,9 @@ ENV \
1212

1313
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
1414

15-
RUN pip install poetry==1.1.7
15+
RUN pip install --upgrade pip && \
16+
pip install --upgrade poetry==1.1.13 && \
17+
pip install --upgrade psycopg[binary]
1618

1719
RUN mkdir -p /opt/src/pypgstac
1820

README.md

Lines changed: 64 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,53 @@ STAC Client that uses PGStac available in [STAC-FastAPI](https://github.com/stac
2323
PGStac requires **Postgresql>=13** and **PostGIS>=3**. Best performance will be had using PostGIS>=3.1.
2424

2525
### PGStac Settings
26-
PGStac installs everything into the pgstac schema in the database. You will need to make sure that this schema is set up in the search_path for the database.
26+
PGStac installs everything into the pgstac schema in the database. This schema must be in the search_path in the postgresql session while using pgstac.
2727

28+
29+
#### PGStac Users
30+
The pgstac_admin role is the owner of all the objects within pgstac and should be used when running things such as migrations.
31+
32+
The pgstac_ingest role has read/write priviliges on all tables and should be used for data ingest or if using the transactions extension with stac-fastapi-pgstac.
33+
34+
The pgstac_read role has read only access to the items and collections, but will still be able to write to the logging tables.
35+
36+
You can use the roles either directly and adding a password to them or by granting them to a role you are already using.
37+
38+
To use directly:
39+
```sql
40+
ALTER ROLE pgstac_read LOGIN PASSWORD '<password>';
41+
```
42+
43+
To grant pgstac permissions to a current postgresql user:
44+
```sql
45+
GRANT pgstac_read TO <user>;
46+
```
47+
48+
#### PGStac Search Path
49+
The search_path can be set at the database level or role level or by setting within the current session. The search_path is already set if you are directly using one of the pgstac users. If you are not logging in directly as one of the pgstac users, you will need to set the search_path by adding it to the search_path of the user you are using:
50+
```sql
51+
ALTER ROLE <user> SET SEARCH_PATH TO pgstac, public;
52+
```
53+
setting the search_path on the database:
54+
```sql
55+
ALTER DATABASE <database> set search_path to pgstac, public;
56+
```
57+
58+
In psycopg the search_path can be set by passing it as a configuration when creating your connection:
59+
```python
60+
kwargs={
61+
"options": "-c search_path=pgstac,public"
62+
}
63+
```
64+
65+
#### PGStac Settings Variables
2866
There are additional variables that control the settings used for calculating and displaying context (total row count) for a search, as well as a variable to set the filter language (cql-json or cql-json2).
2967
The context is "off" by default, and the default filter language is set to "cql2-json".
3068

3169
Variables can be set either by passing them in via the connection options using your connection library, setting them in the pgstac_settings table or by setting them on the Role that is used to log in to the database.
3270

71+
Turning "context" on can be **very** expensive on larger databases. Much of what PGStac does is to optimize the search of items sorted by time where only fewer than 10,000 records are returned at a time. It does this by searching for the data in chunks and is able to "short circuit" and return as soon as it has the number of records requested. Calculating the context (the total count for a query) requires a scan of all records that match the query parameters and can take a very long time. Settting "context" to auto will use database statistics to estimate the number of rows much more quickly, but for some queries, the estimate may be quite a bit off.
72+
3373
Example for updating the pgstac_settings table with a new value:
3474
```sql
3575
INSERT INTO pgstac_settings (name, value)
@@ -41,14 +81,36 @@ ON CONFLICT ON CONSTRAINT pgstac_settings_pkey DO UPDATE SET value = excluded.va
4181
```
4282

4383
Alternatively, update the role:
44-
```
84+
```sql
4585
ALTER ROLE <username> SET SEARCH_PATH to pgstac, public;
4686
ALTER ROLE <username> SET pgstac.context TO <'on','off','auto'>;
4787
ALTER ROLE <username> SET pgstac.context_estimated_count TO '<number of estimated rows when in auto mode that when an estimated count is less than will trigger a full count>';
4888
ALTER ROLE <username> SET pgstac.context_estimated_cost TO '<estimated query cost from explain when in auto mode that when an estimated cost is less than will trigger a full count>';
4989
ALTER ROLE <username> SET pgstac.context_stats_ttl TO '<an interval string ie "1 day" after which pgstac search will force recalculation of it's estimates>>';
5090
```
5191
92+
#### PGStac Partitioning
93+
By default PGStac partitions data by collection (note: this is a change starting with version 0.5.0). Each collection can further be partitioned by either year or month. **Partitioning must be set up prior to loading any data!** Partitioning can be configured by setting the partition_trunc flag on a collection in the database.
94+
```sql
95+
UPDATE collections set partition_trunc='month' WHERE id='<collection id>';
96+
```
97+
98+
In general, you should aim to keep each partition less than a few hundred thousand rows. Further partitioning (ie setting everything to 'month' when not needed to keep the partitions below a few hundred thousand rows) can be detrimental.
99+
100+
#### PGStac Indexes / Queryables
101+
By default, PGStac includes indexes on the id, datetime, collection, geometry, and the eo:cloud_cover property. Further indexing can be added for additional properties globally or only on particular collections by modifications to the queryables table.
102+
103+
Currently indexing is the only place the queryables table is used, but in future versions, it will be extended to provide a queryables backend api.
104+
105+
To add a new global index across all partitions:
106+
```sql
107+
INSERT INTO pgstac.queryables (name, property_wrapper, property_index_type)
108+
VALUES (<property name>, <property wrapper>, <index type>);
109+
```
110+
Property wrapper should be one of to_int, to_float, to_tstz, or to_text. The index type should almost always be 'BTREE', but can be any PostgreSQL index type valid for the data type.
111+
112+
**More indexes is note necessarily better.** You should only index the primary fields that are actively being used to search. Adding too many indexes can be very detrimental to performance and ingest speed. If your primary use case is delivering items sorted by datetime and you do not use the context extension, you likely will not need any further indexes.
113+
52114
## PyPGStac
53115
PGStac includes a Python utility for bulk data loading and managing migrations.
54116

pgstac.sql

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
11
BEGIN;
22
\i sql/001_core.sql
33
\i sql/001a_jsonutils.sql
4-
\i sql/001b_cursorutils.sql
54
\i sql/001s_stacutils.sql
65
\i sql/002_collections.sql
6+
\i sql/002a_queryables.sql
7+
\i sql/002b_cql.sql
78
\i sql/003_items.sql
89
\i sql/004_search.sql
910
\i sql/005_tileutils.sql
1011
\i sql/006_tilesearch.sql
12+
\i sql/998_permissions.sql
1113
\i sql/999_version.sql
1214
COMMIT;

0 commit comments

Comments
 (0)