Small PRs to scientific infrastructure

The third time I wrote the same file-by-file download loop against a TRS server, I stopped. GET one file, GET the next, stitch them together locally, repeat. It was tedious, and I was clearly not the first person to write it. Last August I spent a few weekends on trs-filer, the GA4GH Tool Registry Service implementation maintained by the ELIXIR Cloud & AAI group. I'd been around that community since my GSoC days, but back then as a mentor. This was the first time in a while I came back as a contributor. Most of what I shipped is boring. That's the point.

The zip endpoint

TRS stores workflow descriptors, parameter files, and container recipes as individual objects. Before this change, every client that needed the full bundle had to call GET /tools/{id}/versions/{version_id}/{type}/files for each file type, then assemble them locally. So every TRS client reimplemented the loop I got sick of writing. Fixing it once, server-side, felt like the obvious trade.

The implementation is a streaming zip response using Python's zipfile.ZipFile writing into a BytesIO buffer. The tricky part is that trs-filer stores files in either MongoDB GridFS or on a local filesystem, depending on configuration. The zip handler had to abstract over both without knowing which one it was talking to:

def get_tool_files_zip(tool_id: str, version_id: str, file_type: str):
    buffer = BytesIO()
    with ZipFile(buffer, 'w', ZIP_DEFLATED) as zf:
        accessor = get_file_accessor()  # GridFS or LocalFS, from config
        for file_meta in accessor.list_files(tool_id, version_id, file_type):
            data = accessor.read_file(file_meta.id)
            zf.writestr(file_meta.filename, data)
    buffer.seek(0)
    return Response(
        buffer.getvalue(),
        mimetype='application/zip',
        headers={
            'Content-Disposition': f'attachment; filename={tool_id}_{version_id}.zip'
        }
    )

The FileAccessor interface has two implementations. GridFSAccessor reads from GridFS using gridfs.GridFS(db).get(file_id). LocalFSAccessor reads from a configured directory with pathlib.Path.read_bytes(). The zip handler stays ignorant of both.

I wrote the tests before I trusted the abstraction. A pytest fixture seeds a tool version with known files across both backends. The assertion unzips the response body and checks filenames and content against what was seeded:

def test_zip_download(client, seeded_tool):
    resp = client.get(f'/tools/{seeded_tool.id}/versions/{seeded_tool.version}/WDL/files',
                      headers={'Accept': 'application/zip'})
    assert resp.status_code == 200
    with ZipFile(BytesIO(resp.data)) as zf:
        assert set(zf.namelist()) == {'workflow.wdl', 'inputs.json'}
        assert zf.read('workflow.wdl') == b'version 1.0\n...'

The CI matrix runs this test against both GridFS and local filesystem configurations. Connexion, the OpenAPI framework trs-filer uses, routes on the Accept header: application/zip goes to the new handler, application/json to the existing one. Same URL, two behaviors, no breaking change for anyone already parsing JSON.

MinIO as a storage backend

The second PR added S3-compatible object storage as a file storage option. The two existing backends both had a ceiling I kept hitting. GridFS has a 16MB document limit per chunk. You can store larger files by splitting across chunks, but that's GridFS doing it internally, and debugging GridFS cursor issues is its own afternoon. Local filesystem simply doesn't work in multi-replica deployments, because replicas don't share disk. Neither is a real answer for a 2 GB container image.

The S3 backend uses boto3 with a configurable endpoint URL, so it points at MinIO, Ceph, or actual S3 without code changes:

class S3FileAccessor:
    def __init__(self, config):
        self.client = boto3.client(
            's3',
            endpoint_url=config.get('S3_ENDPOINT'),  # MinIO, Ceph, etc.
            aws_access_key_id=config['S3_ACCESS_KEY'],
            aws_secret_access_key=config['S3_SECRET_KEY'],
        )
        self.bucket = config['S3_BUCKET']

    def store_file(self, tool_id, version_id, file_type, filename, data):
        key = f"{tool_id}/{version_id}/{file_type}/{filename}"
        self.client.upload_fileobj(
            BytesIO(data),
            self.bucket,
            key,
            Config=TransferConfig(multipart_threshold=8 * 1024 * 1024)
        )

    def read_file_stream(self, tool_id, version_id, file_type, filename):
        key = f"{tool_id}/{version_id}/{file_type}/{filename}"
        response = self.client.get_object(Bucket=self.bucket, Key=key)
        return response['Body']  # StreamingBody, reads on demand

Downloads stream straight from S3 to the HTTP response through the StreamingBody wrapper, so the server never buffers a 2 GB Singularity image in memory. That detail mattered more than I expected. The first version I sketched read the whole object into a bytes buffer, which is fine on a laptop and a disaster on a shared node with other services fighting for RAM. Uploads use multipart with an 8 MB threshold; anything larger gets split into parts and uploaded with configurable concurrency.

FOCA's MONGO_URI patch

FOCA is the Flask microservice framework most of these GA4GH services are built on. It assembled MongoDB connection strings from individual config fields: host, port, databaseName. That's fine until you point it at a managed provider. Atlas and Azure CosmosDB's Mongo API hand you a full connection URI with auth credentials, replica set options, tls=true, retryWrites=true, and read preferences already baked in:

mongodb+srv://user:pass@cluster0.abc123.mongodb.net/?retryWrites=true&w=majority&tls=true

You can't decompose that into host and port without losing the SRV lookup, the TLS setting, and the write concern. I tried, briefly, before admitting the assembly approach was the wrong shape for anything managed.

The fix was two commits. First, accept a MONGO_URI environment variable that bypasses assembly entirely:

def create_mongo_client(config):
    uri = os.environ.get('MONGO_URI')
    if uri:
        client = MongoClient(uri)
        # Extract database name from URI if not explicitly configured
        db_name = config.get('databaseName') or uri_parser.parse_uri(uri).get('database')
    else:
        # Original host/port assembly path
        client = MongoClient(host=config['host'], port=config['port'])
        db_name = config['databaseName']
    return client[db_name]

pymongo.uri_parser.parse_uri pulls the database name out of the URI path component. If the URI has none, which is valid, the explicit databaseName config takes over. The old path is completely untouched, so existing deployments don't notice the change at all. That was the whole design constraint: add the new door without moving the old one.

None of this is glamorous. But this software runs federated genomics analysis across European compute centers, maintained by maybe a dozen part-time people. A zip endpoint or a connection string fix has a weirdly high impact-to-effort ratio next to a drive-by PR on some huge project, where your change sits in a queue behind 400 others and nobody's short-handed enough to care. Boring is where the payoff is. I keep coming back for exactly that.