Most of the open source work I'm proud of is boring. Last August I spent a few weekends on trs-filer, a GA4GH Tool Registry Service implementation maintained by the ELIXIR Cloud & AAI group. I'd been around that community since my GSoC days, but this was the first time in a while I'd come back as a contributor rather than a mentor.
The zip endpoint
TRS stores workflow descriptors, parameter files, and container recipes as individual objects. Before this change, every client that needed the full bundle had to call GET /tools/{id}/versions/{version_id}/{type}/files for each file type, then assemble them locally. Every TRS client reimplemented the same download loop.
The implementation is a streaming zip response using Python's zipfile.ZipFile writing into a BytesIO buffer. The tricky part is that trs-filer stores files in either MongoDB GridFS or on a local filesystem, depending on configuration. The zip handler needed to abstract over both:
def get_tool_files_zip(tool_id: str, version_id: str, file_type: str):
buffer = BytesIO()
with ZipFile(buffer, 'w', ZIP_DEFLATED) as zf:
accessor = get_file_accessor() # GridFS or LocalFS, from config
for file_meta in accessor.list_files(tool_id, version_id, file_type):
data = accessor.read_file(file_meta.id)
zf.writestr(file_meta.filename, data)
buffer.seek(0)
return Response(
buffer.getvalue(),
mimetype='application/zip',
headers={
'Content-Disposition': f'attachment; filename={tool_id}_{version_id}.zip'
}
)
The FileAccessor interface has two implementations: GridFSAccessor reads from GridFS using gridfs.GridFS(db).get(file_id), and LocalFSAccessor reads from a configured directory with pathlib.Path.read_bytes(). The zip handler doesn't know which backend it's talking to.
Tests use a pytest fixture that seeds a tool version with known files across both backends. The assertion unzips the response body, checks filenames and content against the seeded data:
def test_zip_download(client, seeded_tool):
resp = client.get(f'/tools/{seeded_tool.id}/versions/{seeded_tool.version}/WDL/files',
headers={'Accept': 'application/zip'})
assert resp.status_code == 200
with ZipFile(BytesIO(resp.data)) as zf:
assert set(zf.namelist()) == {'workflow.wdl', 'inputs.json'}
assert zf.read('workflow.wdl') == b'version 1.0\n...'
The CI matrix runs this test against both GridFS and local filesystem configurations. Connexion (the OpenAPI framework trs-filer uses) routes the request based on the Accept header: application/zip goes to the new handler, application/json goes to the existing one.
MinIO as a storage backend
The second PR added S3-compatible object storage as a file storage option. GridFS has a 16MB document limit per chunk (you can store larger files by splitting across chunks, but that's what GridFS does internally, and debugging GridFS cursor issues is its own thing). Local filesystem doesn't work in multi-replica deployments because replicas don't share disk.
The S3 backend uses boto3 with a configurable endpoint URL:
class S3FileAccessor:
def __init__(self, config):
self.client = boto3.client(
's3',
endpoint_url=config.get('S3_ENDPOINT'), # MinIO, Ceph, etc.
aws_access_key_id=config['S3_ACCESS_KEY'],
aws_secret_access_key=config['S3_SECRET_KEY'],
)
self.bucket = config['S3_BUCKET']
def store_file(self, tool_id, version_id, file_type, filename, data):
key = f"{tool_id}/{version_id}/{file_type}/{filename}"
self.client.upload_fileobj(
BytesIO(data),
self.bucket,
key,
Config=TransferConfig(multipart_threshold=8 * 1024 * 1024)
)
def read_file_stream(self, tool_id, version_id, file_type, filename):
key = f"{tool_id}/{version_id}/{file_type}/{filename}"
response = self.client.get_object(Bucket=self.bucket, Key=key)
return response['Body'] # StreamingBody, reads on demand
Downloads stream directly from S3 to the HTTP response via the StreamingBody wrapper, so the server never buffers a 2 GB Singularity image in memory. Uploads use multipart upload with an 8 MB threshold; anything larger gets split into parts and uploaded with configurable concurrency.
FOCA's MONGO_URI patch
FOCA is the Flask microservice framework most of these GA4GH services are built on. It assembled MongoDB connection strings from individual config fields: host, port, databaseName. The problem is managed MongoDB providers (Atlas, Azure CosmosDB's Mongo API) hand you a connection URI with auth credentials, replica set options, tls=true, retryWrites=true, and read preferences baked in:
mongodb+srv://user:[email protected]/?retryWrites=true&w=majority&tls=true
You can't decompose that into host and port without losing the SRV lookup, the TLS setting, and the write concern.
The fix was two commits. First, accept a MONGO_URI environment variable that bypasses assembly entirely:
def create_mongo_client(config):
uri = os.environ.get('MONGO_URI')
if uri:
client = MongoClient(uri)
# Extract database name from URI if not explicitly configured
db_name = config.get('databaseName') or uri_parser.parse_uri(uri).get('database')
else:
# Original host/port assembly path
client = MongoClient(host=config['host'], port=config['port'])
db_name = config['databaseName']
return client[db_name]
pymongo.uri_parser.parse_uri extracts the database name from the URI path component. If the URI has no database (which is valid), the explicit databaseName config takes over. The old path is completely untouched for existing deployments.
None of this is glamorous. But this software runs federated genomics analysis across European compute centers, maintained by maybe a dozen part-time people. A zip endpoint or a connection string fix has a weirdly high impact-to-effort ratio compared to drive-by PRs on some huge project where your change sits in a queue behind 400 others.