IA Code Review (2019-12-31)

Hi! I'm Internet Archive staff helping out with this project. Hope it is alright for me to leave some direct code review here.

These are comments on the existing state of the repo (commit `ad108a0`), plus peaking at PR #14, and looking at the uploaded test item: <https://archive.org/details/osf-io-p8deu>.

## `IA_consume_logs.py`

The current behavior is to fetch JSON log entries via API pagination, and write each page as a separate JSON file. Combining all the log entries into a single JSON file would result in a simpler output file structure. However, this isn't important or blocking.

## `IA_upload.py`

Overall, the upload implementation uses low-level routines in the `boto3` S3 library to implement a partially `aysnc` version of multi-part uploads. At a high level, I would recommend using either the `internetarchive` python library or `boto3`'s higher-level upload routines to do file transfers. While I don't think either of these are natively `async`, then can be wrapped in an executor at the item or file level. Implementing chunked uploading yourself (including all the corner cases of retries and partial success) seems fragile and error prone. Having two upload paths also seems error prone; code to set headers is currently duplicated. With `boto3`, the `multipart_theshold` config value can control whether any particular upload goes directly or multipart. This is our recommendation, but not a requirement.

The upload function is `async`. For IA buckets in particular (compared to AWS S3), the boto3 call `conn.create_bucket()` will probably block for a few seconds. I think         it needs to be wrapped in an executor? I'm not very experienced with the         python3 async/await patterns, so I may be wrong.

When using the `Authentication` header, it is especially important to use transport encryption. The current code uses `http://`, but we strongly recommend using `https://` instead.

`CHUNK_SIZE` defaults to 1000 bytes, which I believe is too small for production use (though it is probably helpful in development). Default is often around 15 MB; anywhere from 5 MB to 200 MByte is probably reasonable.

You are checking for HTTP 503 status and trying; great!

It looks like `.DS_Store` OS X files got uploaded as part of the Bag. The directory walking code should probably filter these and any other OS-specific files. I think the `internetarchive` library may do this automatically.

You are setting some upload metadata headers; great! Here are some more you might want to break out as config flags:

- `x-archive-queue-derive (`0` or `1`): controls whether "derivative" files (like OCR and fulltext search for PDFs) should be run on the files uploaded. IA default is `1`.
- `x-archive-keep-old-version` (`0` or `1`): controls whether old copies of files are kept if you re-upload. `1` is safer, but consumes more storage space.
- `x-archive-size-hint` (size of file in bytes): hints to our server how large the file you are uploading is. If you upload files more than 50 GByte or so, this is helpful for us to pre-allocate disk space an ensure the upload is successful.

## Chunked Upload Stubs

I noticed that the `osf-io-p8deu` item had a bunch of chunked upload files left around:

<https://archive.org/download/osf-io-p8deu/__ia_spool/>
<https://ia601401.us.archive.org/21/items/osf-io-p8deu/__ia_spool/2c044e9b-afc0-47f2-a452-d859fbfe83ca_ias3_part_meta.txt>

The `__ia_spool` directory is used by our servers to store partial upload state.

Because these happened a day before the final uploads, and all the files     seem to be in the correct place now, i'm guessing that these were due to a     bug in an early version of `IA_upload.py`, and our server code kept the     chunks around "just in case". For example, the upload completion didn't     succeed. Is that safe to assume, or should I check if there was a problem     on our end?

## BagPack/BagIt Structure

My memory is that in bagpack, metadata files like the datacite.xml should
end up in a top-level `metadata/` directory. In the current implementation
they are ending up under `data/`.

Some of the nesting of directories might also be removable; eg instead of
`data/p8deu/files/Archive of OSF Storage/originally published e08245.pdf`,
could just be `data/files/originally published e08245.pdf`.

In both of these cases just commenting. We are mostly agnostic to the
layout and would defer to you or any consulted librarians.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IA Code Review (2019-12-31) #15

`IA_consume_logs.py`

`IA_upload.py`

Chunked Upload Stubs

BagPack/BagIt Structure

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

IA Code Review (2019-12-31) #15

Description

IA_consume_logs.py

IA_upload.py

Chunked Upload Stubs

BagPack/BagIt Structure

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`IA_consume_logs.py`

`IA_upload.py`