You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Version 0.5.0
* pypgstac loader improvements
* rework pypgstac, linting, bug fixes
* remove srid lookup
* get tests working with temp database
* update poetry lock
* psycopg copy error on ci
* more tests, update readme
* bug fixes
* change typing for file input to allow iterator
* update poetry lock for black issue, run tests with pytest
* add code to change partitions and migrate data
* fix for assets in includes
* add incremental migration
* add migration, fix atexit
Copy file name to clipboardExpand all lines: CHANGELOG.md
+28Lines changed: 28 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,32 @@
1
1
# Changelog
2
+
## [v0.5.0]
3
+
Version 0.5.0 is a major refactor of how data is stored. It is recommended to start a new database from scratch and to move data over rather than to use the inbuilt migration which will be very slow for larger amounts of data.
4
+
5
+
### Fixed
6
+
7
+
### Changed
8
+
- The partition layout has been changed from being hardcoded to a partition to week to using nested partitions. The first level is by collection, for each collection, there is an attribute partition_trunc which can be set to NULL (no temporal partitions), month, or year.
9
+
10
+
- CQL1 and Query Code have been refactored to translate to CQL2 to reduce duplicated code in query parsing.
11
+
12
+
- Unused functions have been stripped from the project.
13
+
14
+
- Pypgstac has been changed to use Fire rather than Typo.
15
+
16
+
- Pypgstac has been changed to use Psycopg3 rather than Asyncpg to enable easier use as both sync and async.
17
+
18
+
- Indexing has been reworked to eliminate indexes that from logs were not being used. The global json index on properties has been removed. Indexes on individual properties can be added either globally or per collection using the new queryables table.
19
+
20
+
- Triggers for maintaining partitions have been updated to reduce lock contention and to reflect the new data layout.
21
+
22
+
- The data pager which optimizes "order by datetime" searches has been updated to get time periods from the new partition layout and partition metadata.
23
+
24
+
- Tests have been updated to reflect the many changes.
25
+
26
+
### Added
27
+
28
+
- On ingest, the content in an item is compared to the metadata available at the collection level and duplicate information is stripped out (this is primarily data in the item_assets property). Logic is added in to merge this data back in on data usage.
Copy file name to clipboardExpand all lines: README.md
+64-2Lines changed: 64 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,13 +23,53 @@ STAC Client that uses PGStac available in [STAC-FastAPI](https://github.com/stac
23
23
PGStac requires **Postgresql>=13** and **PostGIS>=3**. Best performance will be had using PostGIS>=3.1.
24
24
25
25
### PGStac Settings
26
-
PGStac installs everything into the pgstac schema in the database. You will need to make sure that this schema is set up in the search_path for the database.
26
+
PGStac installs everything into the pgstac schema in the database. This schema must be in the search_path in the postgresql session while using pgstac.
27
27
28
+
29
+
#### PGStac Users
30
+
The pgstac_admin role is the owner of all the objects within pgstac and should be used when running things such as migrations.
31
+
32
+
The pgstac_ingest role has read/write priviliges on all tables and should be used for data ingest or if using the transactions extension with stac-fastapi-pgstac.
33
+
34
+
The pgstac_read role has read only access to the items and collections, but will still be able to write to the logging tables.
35
+
36
+
You can use the roles either directly and adding a password to them or by granting them to a role you are already using.
37
+
38
+
To use directly:
39
+
```sql
40
+
ALTER ROLE pgstac_read LOGIN PASSWORD '<password>';
41
+
```
42
+
43
+
To grant pgstac permissions to a current postgresql user:
44
+
```sql
45
+
GRANT pgstac_read TO <user>;
46
+
```
47
+
48
+
#### PGStac Search Path
49
+
The search_path can be set at the database level or role level or by setting within the current session. The search_path is already set if you are directly using one of the pgstac users. If you are not logging in directly as one of the pgstac users, you will need to set the search_path by adding it to the search_path of the user you are using:
50
+
```sql
51
+
ALTER ROLE <user>SET SEARCH_PATH TO pgstac, public;
52
+
```
53
+
setting the search_path on the database:
54
+
```sql
55
+
ALTERDATABASE<database>set search_path to pgstac, public;
56
+
```
57
+
58
+
In psycopg the search_path can be set by passing it as a configuration when creating your connection:
59
+
```python
60
+
kwargs={
61
+
"options": "-c search_path=pgstac,public"
62
+
}
63
+
```
64
+
65
+
#### PGStac Settings Variables
28
66
There are additional variables that control the settings used for calculating and displaying context (total row count) for a search, as well as a variable to set the filter language (cql-json or cql-json2).
29
67
The context is "off" by default, and the default filter language is set to "cql2-json".
30
68
31
69
Variables can be set either by passing them in via the connection options using your connection library, setting them in the pgstac_settings table or by setting them on the Role that is used to log in to the database.
32
70
71
+
Turning "context" on can be **very** expensive on larger databases. Much of what PGStac does is to optimize the search of items sorted by time where only fewer than 10,000 records are returned at a time. It does this by searching for the data in chunks and is able to "short circuit" and return as soon as it has the number of records requested. Calculating the context (the total count for a query) requires a scan of all records that match the query parameters and can take a very long time. Settting "context" to auto will use database statistics to estimate the number of rows much more quickly, but for some queries, the estimate may be quite a bit off.
72
+
33
73
Example for updating the pgstac_settings table with a new value:
34
74
```sql
35
75
INSERT INTO pgstac_settings (name, value)
@@ -41,14 +81,36 @@ ON CONFLICT ON CONSTRAINT pgstac_settings_pkey DO UPDATE SET value = excluded.va
41
81
```
42
82
43
83
Alternatively, update the role:
44
-
```
84
+
```sql
45
85
ALTER ROLE <username>SET SEARCH_PATH to pgstac, public;
46
86
ALTER ROLE <username>SETpgstac.context TO <'on','off','auto'>;
47
87
ALTER ROLE <username>SETpgstac.context_estimated_count TO '<number of estimated rows when in auto mode that when an estimated count is less than will trigger a full count>';
48
88
ALTER ROLE <username>SETpgstac.context_estimated_cost TO '<estimated query cost from explain when in auto mode that when an estimated cost is less than will trigger a full count>';
49
89
ALTER ROLE <username>SETpgstac.context_stats_ttl TO '<an interval string ie "1 day" after which pgstac search will force recalculation of it's estimates>>';
50
90
```
51
91
92
+
#### PGStac Partitioning
93
+
By default PGStac partitions data by collection (note: this is a change starting with version 0.5.0). Each collection can further be partitioned by either year or month. **Partitioning must be set up prior to loading any data!** Partitioning can be configured by setting the partition_trunc flag on a collection in the database.
94
+
```sql
95
+
UPDATE collections set partition_trunc='month' WHERE id='<collection id>';
96
+
```
97
+
98
+
In general, you should aim to keep each partition less than a few hundred thousand rows. Further partitioning (ie setting everything to 'month' when not needed to keep the partitions below a few hundred thousand rows) can be detrimental.
99
+
100
+
#### PGStac Indexes / Queryables
101
+
By default, PGStac includes indexes on the id, datetime, collection, geometry, and the eo:cloud_cover property. Further indexing can be added for additional properties globally or only on particular collections by modifications to the queryables table.
102
+
103
+
Currently indexing is the only place the queryables table is used, but in future versions, it will be extended to provide a queryables backend api.
104
+
105
+
To add a new global index across all partitions:
106
+
```sql
107
+
INSERT INTO pgstac.queryables (name, property_wrapper, property_index_type)
Property wrapper should be one of to_int, to_float, to_tstz, or to_text. The index type should almost always be 'BTREE', but can be any PostgreSQL index type valid for the data type.
111
+
112
+
**More indexes is note necessarily better.** You should only index the primary fields that are actively being used to search. Adding too many indexes can be very detrimental to performance and ingest speed. If your primary use case is delivering items sorted by datetime and you do not use the context extension, you likely will not need any further indexes.
113
+
52
114
## PyPGStac
53
115
PGStac includes a Python utility for bulk data loading and managing migrations.
0 commit comments