Skip to content

IA Code Review (2019-12-31) #15

@bnewbold

Description

@bnewbold

Hi! I'm Internet Archive staff helping out with this project. Hope it is alright for me to leave some direct code review here.

These are comments on the existing state of the repo (commit ad108a0), plus peaking at PR #14, and looking at the uploaded test item: https://archive.org/details/osf-io-p8deu.

IA_consume_logs.py

The current behavior is to fetch JSON log entries via API pagination, and write each page as a separate JSON file. Combining all the log entries into a single JSON file would result in a simpler output file structure. However, this isn't important or blocking.

IA_upload.py

Overall, the upload implementation uses low-level routines in the boto3 S3 library to implement a partially aysnc version of multi-part uploads. At a high level, I would recommend using either the internetarchive python library or boto3's higher-level upload routines to do file transfers. While I don't think either of these are natively async, then can be wrapped in an executor at the item or file level. Implementing chunked uploading yourself (including all the corner cases of retries and partial success) seems fragile and error prone. Having two upload paths also seems error prone; code to set headers is currently duplicated. With boto3, the multipart_theshold config value can control whether any particular upload goes directly or multipart. This is our recommendation, but not a requirement.

The upload function is async. For IA buckets in particular (compared to AWS S3), the boto3 call conn.create_bucket() will probably block for a few seconds. I think it needs to be wrapped in an executor? I'm not very experienced with the python3 async/await patterns, so I may be wrong.

When using the Authentication header, it is especially important to use transport encryption. The current code uses http://, but we strongly recommend using https:// instead.

CHUNK_SIZE defaults to 1000 bytes, which I believe is too small for production use (though it is probably helpful in development). Default is often around 15 MB; anywhere from 5 MB to 200 MByte is probably reasonable.

You are checking for HTTP 503 status and trying; great!

It looks like .DS_Store OS X files got uploaded as part of the Bag. The directory walking code should probably filter these and any other OS-specific files. I think the internetarchive library may do this automatically.

You are setting some upload metadata headers; great! Here are some more you might want to break out as config flags:

  • x-archive-queue-derive (0or1): controls whether "derivative" files (like OCR and fulltext search for PDFs) should be run on the files uploaded. IA default is 1`.
  • x-archive-keep-old-version (0 or 1): controls whether old copies of files are kept if you re-upload. 1 is safer, but consumes more storage space.
  • x-archive-size-hint (size of file in bytes): hints to our server how large the file you are uploading is. If you upload files more than 50 GByte or so, this is helpful for us to pre-allocate disk space an ensure the upload is successful.

Chunked Upload Stubs

I noticed that the osf-io-p8deu item had a bunch of chunked upload files left around:

https://archive.org/download/osf-io-p8deu/__ia_spool/
https://ia601401.us.archive.org/21/items/osf-io-p8deu/__ia_spool/2c044e9b-afc0-47f2-a452-d859fbfe83ca_ias3_part_meta.txt

The __ia_spool directory is used by our servers to store partial upload state.

Because these happened a day before the final uploads, and all the files seem to be in the correct place now, i'm guessing that these were due to a bug in an early version of IA_upload.py, and our server code kept the chunks around "just in case". For example, the upload completion didn't succeed. Is that safe to assume, or should I check if there was a problem on our end?

BagPack/BagIt Structure

My memory is that in bagpack, metadata files like the datacite.xml should
end up in a top-level metadata/ directory. In the current implementation
they are ending up under data/.

Some of the nesting of directories might also be removable; eg instead of
data/p8deu/files/Archive of OSF Storage/originally published e08245.pdf,
could just be data/files/originally published e08245.pdf.

In both of these cases just commenting. We are mostly agnostic to the
layout and would defer to you or any consulted librarians.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions