The month we self-hosted everything

In December 2025 I went on a self-hosting spree at Farmako. Typesense Cloud first, then Metabase, then an observability stack to watch the first two. The whole thing started with a pricing page, and pricing pages make people do rash things. I was one of those people.

Typesense on GKE

Typesense powers our medicine search. Someone types "dolo" and we have to surface the right paracetamol brand in milliseconds, with typo tolerance, because people type drug names from memory of a doctor's handwriting. Typesense Cloud worked fine. It was just expensive for what is, underneath, a single stateful binary with a data directory. Once I said that sentence out loud I could not unsay it.

The manifest is a StatefulSet with one replica. We do not need HA here, the index rebuilds from Postgres in under a minute. The PVC uses an SSD-backed storage class (pd-ssd on GKE) because Typesense mmaps its index files and read latency feeds straight into p99 query latency:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: typesense
spec:
  serviceName: typesense
  replicas: 1
  template:
    spec:
      containers:
        - name: typesense
          image: typesense/typesense:27.1
          args:
            - --data-dir=/data
            - --api-key=$(TYPESENSE_API_KEY)
            - --memory-limit-mb=768
          resources:
            requests:
              memory: 512Mi
              cpu: 250m
            limits:
              memory: 1Gi
              cpu: 1000m
          volumeMounts:
            - name: data
              mountPath: /data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: ssd
        resources:
          requests:
            storage: 10Gi

The --memory-limit-mb=768 flag is the one that matters, and I learned that the hard way (more on that below). Without it, Typesense mmaps aggressively and the resident set climbs until the kernel OOM-kills it. With it, Typesense runs internal compaction before it hits the ceiling. The container limit sits at 1Gi to leave 256MB of headroom above the Typesense limit for the OS page cache and process overhead.

The re-index job is a nightly CronJob. The strategy is alias-based atomic swaps:

# 1. Create new collection with timestamp suffix
new_name = f"medicines_{int(time.time())}"
client.collections.create({
    "name": new_name,
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "generic", "type": "string", "optional": True},
        {"name": "manufacturer", "type": "string", "facet": True},
        {"name": "mrp", "type": "float"},
        {"name": "in_stock", "type": "bool", "facet": True},
    ]
})

# 2. Bulk import from Postgres
rows = pg_conn.execute("SELECT * FROM medicines WHERE active = true")
documents = [format_for_typesense(row) for row in rows]
client.collections[new_name].documents.import_(documents, {"action": "create"})

# 3. Atomic alias swap
client.aliases.upsert("medicines", {"collection_name": new_name})

# 4. Drop old collection
old_collections = [c for c in client.collections.retrieve()
                   if c["name"].startswith("medicines_") and c["name"] != new_name]
for old in old_collections:
    client.collections[old["name"]].delete()

Search always queries the medicines alias. During a re-index the alias keeps pointing at the old collection until the swap, and the swap is atomic from the query path's point of view. If the import dies halfway, the alias never moved and nobody notices. That property is the whole reason I did it this way instead of reindexing in place.

Metabase self-hosted

Metabase prices per seat. A pharmacy operations team is a lot of people who each need to glance at one dashboard once a day. Paying per seat for that is painful. Self-hosted Metabase runs in a Deployment, not a StatefulSet, because its state lives in its own Postgres database rather than on local disk.

The migration off Metabase Cloud was mostly boring in the good way: pg_dump of the application database, pg_restore into the self-hosted instance, update the connection strings. Then I hit SAML SSO with Keycloak, and boring ended. Metabase's SAML integration expects specific attribute names: http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress for email, http://schemas.xmlsoap.org/ws/2005/05/identity/claims/givenname for first name. Keycloak's default SAML mappers emit different URIs.

The fix lives in Keycloak client mapper config. For each attribute you create a "SAML User Attribute" mapper, set SAML Attribute NameFormat to URI Reference, and set SAML Attribute Name to the exact URI Metabase wants. Finding that out was miserable, because Metabase silently ignores attributes whose URIs do not match. No error. No warning. Just an empty user profile staring back at you. I gave up guessing and intercepted the SAML response with a browser extension to read the attribute names straight out of the XML. That is the only reason I found it.

Observability

The stack is Grafana, Prometheus, Loki, and Alertmanager, deployed with the kube-prometheus-stack Helm chart. Prometheus scrapes Typesense's /metrics endpoint (request latency histograms, index size, memory RSS, active connections) and Metabase's JMX metrics through a jmx_exporter sidecar.

Here is the alert that paid for the entire stack. A Prometheus rule on container_memory_working_set_bytes{container="typesense"} / on(pod) kube_pod_container_resource_limits{resource="memory"} > 0.8 fired three weeks after the migration. Typesense was creeping toward 1Gi because the --memory-limit-mb flag was not set yet. I added it after this. Grafana showed a clean upward ramp over 12 hours, the kind of slope that ends in an OOM kill at 3am. We set the flag, deployed, and watched the memory curve flatten into a sawtooth as compaction started doing its job. No customer ever saw it. That is exactly the outcome the stack was supposed to buy, and it bought it inside a month.

Would I recommend self-hosting in general? It comes down to team size, and I will not pretend otherwise. The cloud versions exist so you do not need a person who thinks about disks. We had one, me, so the math worked. If your infra person is also your only backend engineer and your only on-call, pay the SaaS bill and go build your actual product.