Using Gemini to list gemstones on Etsy for my mom

My mom sells gemstones and crystals. She is very good at sourcing them and genuinely terrible at writing product listings. Every stone needs a title, a description, SEO keywords, healing properties (yes, buyers expect those), sizing, a quality grade, and six clean photos in an Etsy-friendly grid. I watched her spend an hour on a single stone. She has hundreds of stones. The math was depressing.

So I built her a tool. She photographs a stone, the image goes to Gemini on Vertex AI, and a full structured listing comes back. A second pass turns the one photo into a six-image product set.

Structured output with JSON schema

The interesting problem here is getting reliable structured data out of a multimodal model. Gemini's API takes a response_schema parameter that constrains output to valid JSON matching a schema you provide. This is not "please format as JSON" prompt engineering, which I tried first and which lies to you about one listing in ten. The decoding itself is constrained by the schema during token generation, so the model physically cannot emit invalid structure.

schema = {
    "type": "object",
    "required": ["title", "gemstone_type", "color", "transparency",
                  "cut_shape", "is_dyed", "quality_grade", "chakra",
                  "description", "seo_title", "seo_description"],
    "properties": {
        "title": {"type": "string", "maxLength": 140},
        "gemstone_type": {"type": "string"},
        "color": {"type": "string"},
        "transparency": {
            "type": "string",
            "enum": ["transparent", "translucent", "opaque"]
        },
        "cut_shape": {"type": "string"},
        "size_estimate": {"type": "string"},
        "is_dyed": {"type": "boolean"},
        "quality_grade": {
            "type": "string",
            "enum": ["AAA", "AA", "A", "B", "C"]
        },
        "chakra": {
            "type": "array",
            "items": {
                "type": "string",
                "enum": ["root", "sacral", "solar_plexus",
                         "heart", "throat", "third_eye", "crown"]
            }
        },
        "healing_properties": {"type": "string"},
        "description": {"type": "string"},
        "care_tip": {"type": "string"},
        "origin_story": {"type": "string"},
        "seo_title": {"type": "string", "maxLength": 70},
        "seo_description": {"type": "string", "maxLength": 160}
    }
}

response = model.generate_content(
    [image_part, prompt],
    generation_config=GenerationConfig(
        response_mime_type="application/json",
        response_schema=schema
    )
)
listing = json.loads(response.text)

The enum constraints are the part that earns its keep. quality_grade can only be one of five values. chakra can only hold valid chakra names. transparency is one of three. Drop those constraints and the model happily invents categories like "semi-translucent" or "near-opaque" that map onto none of Etsy's attribute filters. I know because my early version did exactly that, and my mom's listings quietly stopped showing up under the filters buyers actually use.

The is_dyed boolean exists because my mom insists on honesty, and she is right. The surprise was how good the model is at spotting treated stones from a photo. Dyed agate has a color saturation pattern that natural coloring does not. Heat-treated amethyst sold as citrine has a specific orange-yellow that real citrine never quite hits. I tested it against 50 known-treated and 50 natural stones. It catches the treatment about 80% of the time. Not perfect, but better than a human eyeballing a phone photo.

Marketplace-safe description generation

Healing properties are a minefield. Etsy will flag a listing that makes medical claims ("amethyst cures headaches"), but the same buyers expect metaphysical descriptions and filter by them. You cannot skip it and you cannot overdo it. The prompt tells the model to hedge every healing reference: "traditionally associated with," "believed by crystal practitioners to," "historically used for." The schema cannot enforce that, since it is free text, so I added a post-processing check that rejects anything containing banned phrases:

BANNED = ["cures", "heals", "treats", "medical", "therapy",
          "prescription", "diagnosis", "clinical"]

def validate_description(text: str) -> bool:
    lower = text.lower()
    return not any(word in lower for word in BANNED)

When validation fails, the tool calls Gemini again with the rejection reason appended to the prompt. In practice that fires about 5% of the time, usually when the model gets too specific about a supposed benefit. The retry almost always clears it.

Image grid generation

The image pipeline is Pillow. Etsy's gallery does best with a set layout: first image is a clean product shot on a neutral background, second is a scale reference (a hand holding the stone), third through fifth are detail angles, sixth is a lifestyle shot. We only start with one photo, so the tool fabricates the rest:

Hero crop: center-crop to 1:1, normalize white balance, bump contrast +10%
Detail crops: four quadrant crops at 2x zoom, each centered on the highest-detail region found via Laplacian variance
Grid composite: arrange the six into a 2x3 grid at 2000x3000 pixels (Etsy's recommended size)

The detail-region detection is not clever, and I like that it is not. Compute the Laplacian of the grayscale image, threshold it, find connected components, sort by area, crop around the largest ones. It reliably lands on the most textured part of the stone instead of cropping dead center on nothing.

gray = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2GRAY)
laplacian = cv2.Laplacian(gray, cv2.CV_64F)
variance_map = ndimage.uniform_filter(laplacian**2, size=50)
# Find top-4 peaks in variance map for detail crops

The backgrounds are a neutral off-white (#F5F5F5) on purpose. It makes a real difference in how a small seller's shop reads next to the operations with studio lighting. Buyers judge fast, and a gray-tinged photo loses before they read a word.

Is the code elegant? No. It is a 400-line Python script that calls Vertex AI and writes files to disk. But it took my mom from an hour per stone down to about five minutes, and she never has to open the code to use it. I will take an ugly script that a non-programmer uses every day over an elegant one that only I can run.