From Machine Learning

Training a Flux LoRA of My Face

I spent two days trying to generate professional-looking headshots of myself by training a personal Flux LoRA and running it through a ComfyUI inference pipeline on RunPod. The intent was practical, a LinkedIn photo without booking a photographer, but the actual experience was a tour through every failure mode of the open-source generative-image stack circa May 2026.

This is the log. What worked, what didn’t, what I’d do differently. Saving it so the next attempt picks up from here instead of from scratch.

The shape of the pipeline

Three stages, all driven by a small hsg CLI:

hsg prepare runs locally. Resizes my photos, generates captions with Florence-2-large, strips identity descriptors from the captions so the LoRA learns identity from pixels rather than from words, and prepends a trigger token (ohwx_jh) to each caption.
hsg train rents a RunPod GPU, rsyncs the dataset up, runs ai-toolkit against Flux.1-dev for 2800 steps at rank 32, then pulls all the checkpoints down.
hsg generate rents a different GPU, runs ComfyUI with my LoRA, generates images with FaceDetailer and a 2x upscale, downloads them.

Each stage launches a fresh pod, does its work, destroys the pod. Nothing persistent between runs. The local machine is just the controller.

That was the plan.

What actually happened, stage by stage

Stage 0: every pod silently failed

The first three pod launches sat in “pending” for 25 minutes each, charging me for hardware that was never actually doing anything. The RunPod API reported desiredStatus: RUNNING while runtime: null and dockerId: None, which I read as “still booting.” It wasn’t. The container was failing to start because the Docker image name in my config was wrong: ostris/ai-toolkit:latest doesn’t exist on Docker Hub. The real image is ostris/aitoolkit (no hyphen). My code had been written speculatively and never validated end-to-end.

The failure was silent because RunPod’s API doesn’t surface image-pull errors. The console’s per-pod Logs tab does, but you have to know to look there.

Lesson: when a pod sits at runtime: null for more than 2-3 minutes, the container is almost certainly not booting. Don’t wait, kill it and check the console.

Stage 1: the training image path was wrong

Fixed the image name. Pod came up. SSH worked. Then training failed because my code tried to run python /ai-toolkit/run.py and the actual path inside the image was /app/ai-toolkit/run.py. Trivial fix, two more cold starts to discover and confirm.

Stage 2: macOS ships an ancient rsync

The rsync_up call used --info=progress2, a flag added in rsync 3.x. macOS ships rsync 2.6.9 from 2006. Changed it to --progress which works on both old and new rsync.

Stage 3: HF auth was silent

Training proceeded to the model-download step and crashed with “model not found.” The actual error was authentication: Flux.1-dev is a gated repository on Hugging Face, and my code was calling huggingface-cli login --token $HF_TOKEN || true which silently no-opped in the non-interactive SSH shell. Then ai-toolkit tried to download the gated model, got a 404 (HF’s generic “no auth” response), and the user-facing error said “model not found” rather than “your token didn’t work.”

The fix was to write the token directly into ~/.cache/huggingface/token and add an explicit whoami() check that crashes loudly if auth fails. Better failure mode.

Stage 4: training actually worked

Once auth held, training ran clean. 2800 steps in about 2 hours 15 minutes on an A100, loss descended cleanly from ~0.5 to ~0.38, no instability. Sample images every 250 steps showed identity locking in around step 500 and continuing to sharpen through the end. The final LoRA at step 2750 produced samples that looked unmistakably like me, with the moles on my cheek in the right places and my actual hair pattern visible. The training side of the pipeline was the easiest part of the whole exercise.

Stage 5: inference image was broken

Stage 1 was a warmup. The inference stage was where the real difficulty lived.

The first image I tried, hearmeman/comfyui-flux-template:v4, exists on Docker Hub and gets 13,000 pulls per month. It also doesn’t work right now. ComfyUI crashes on startup with ModuleNotFoundError: No module named 'comfy_aimdo'. This is from an upstream regression in ComfyUI committed Jan 31, 2026 that affects any image whose boot script runs git pull against the ComfyUI repo. The hearmeman image was built before the regression but pulls fresh code at boot, so it hits the broken state.

I burned an hour trying to diagnose this before discovering the actual error by SSHing into the pod and reading /comfyui_<podid>_nohup.log. RunPod’s API didn’t surface this. The pod showed as RUNNING. ComfyUI was failing silently.

Stage 6: the second image had different problems

Switched to valyriantech/comfyui-with-flux:05042026, which has dependencies baked into the image layer rather than fetched at boot. ComfyUI started cleanly. But:

The image’s start.sh ignored my HF_TOKEN and downloaded flux1-kontext-dev.safetensors (Kontext, not Krea) into the wrong directory.
rsync wasn’t installed in the image, so the LoRA upload failed.
ComfyUI rejected my workflow JSON with three different validation errors: it didn’t have t5xxl_fp16.safetensors (only fp8), didn’t have ae.safetensors (only ae.sft), and the UltimateSDUpscale node required a batch_size parameter that wasn’t in my workflow.

Each of these was a one-line fix. Each cost a new cold start to discover. Cold start for this image is 13-15 minutes because the image is 35GB and the pod has to pull it from Docker Hub every time. The cumulative wait was real.

I eventually wrote a separate script that connected to an existing warm pod and ran generation against it, so I could iterate at 30 seconds per workflow attempt instead of 15 minutes.

Stage 7: first images

Once everything connected, the first batch of 8 images took about 12 minutes to generate at ~90 seconds per image on A100. The first image to land was unmistakably me, in the right office setting, in athletic clothing. The face was correct enough that someone who didn’t know me well would not pause. People who knew me well would catch a slight smoothness and a vaguely posed quality that made it read as AI-generated.

The remaining 7 images of the batch showed the same pattern: identity was solid, framing was tighter and more “studio” than I wanted, skin had a subtle gloss.

Stage 8: the iteration loop, briefly

I rewrote the prompts for wider framing, layered clothing (open quarter-zip over a t-shirt), and explicitly anti-bokeh (“deep depth of field, background mostly in focus”) to fight the DSLR-portrait default. Dropped the FaceDetailer denoise from 0.3 to 0.18 to stop the post-pass from re-smoothing skin. Dropped guidance from 3.5 to 2.8.

I also wrote a post-processing script that takes a clean AI-generated image and degrades it toward “candid iPhone snapshot” with light blur, sensor noise, and JPEG compression. The hypothesis was that the most common AI tells (over-rendered skin, too-clean backgrounds, too-sharp focus) hide behind the compression and noise that real phone photos always have. The first version of the script over-shot and produced security-camera footage. The dialed-back version was closer.

Batch 2 came out and the identity dropped noticeably. The wider framing meant the LoRA had less surface area to imprint, and the result looked less like me than batch 1. The skin gloss was reduced but the trade was real.

This is where I stopped for the day.

What I learned

Some of this is generic to the field. Some is specific to the open-source stack as it stands in May 2026.

Generic:

LoRA strength is the highest-leverage knob. The default 0.9 favors strong identity at the cost of variation. Dropping to 0.85 gives more pose and expression variety. Dropping below 0.8 starts losing the face. The training is binary in a sense, you either have a working LoRA or you don’t, but at inference time the strength dial is continuous and you have to actively tune it per shot.
A character LoRA learns everything you didn’t strip from the data, not just what you intended. My training set was 39 selfies of me in a hoodie under fluorescent office light. The LoRA learned my face. It also learned “iPhone front-camera selfie aesthetic in an office.” That secondary learning is what produces the gluey, posed quality that’s hard to prompt away. The dataset is the model. More variety in the dataset (different cameras, different lighting, different distances, different clothing) would have produced a more flexible LoRA.
Sample images during training are not the real output. ai-toolkit’s sample prompts are generic placeholder text and don’t reflect what your generation pipeline will actually do. Judge the LoRA on inference output, not on training samples.

Specific to the 2026 stack:

Flux is a guidance-distilled model. CFG runs at 1.0 internally. Negative prompts have almost no effect at the default distilled-guidance settings. Hours of received SDXL wisdom about loading negatives with plastic, waxy, oversmoothed, 3d render is essentially decorative. Move the intent into positive prompts as descriptions of what you want.
The community-maintained ComfyUI images on Docker Hub are not production-grade. They break, the breakage is silent, and the breakage is rarely the image’s own fault, more often it’s a regression in some dependency the image pulls at boot. Pin everything you can pin, including the image tag itself.
Cold-starting a fresh inference pod for every generation batch is the wrong design. RunPod offers a “network volume” feature that gives the pod a persistent disk. Models live on the disk, the pod attaches the disk at boot, no re-download. First boot 30 minutes, every subsequent boot under 2 minutes. I didn’t set this up because I didn’t expect to need it. I do.

Specific to “make the photo look like a real photo”:

The 2026 AI tells, in order of how much they trigger the “this is AI” reaction for a casual viewer: glossy skin, symmetric direct-camera stare, professional studio bokeh, hands in frame, drawn-looking hair, generic over-rendered features.
Most of these are fightable in the prompt, except hands, which are still partially broken in 2026.
The fights you have to pick consciously: Flux defaults toward polished, your face-LoRA defaults toward the dataset’s most common framing, the post-processing pipeline defaults toward smoother, the upscaler defaults toward sharper. Every default in the stack pulls toward “AI portrait.” You’re not making a photograph, you’re swimming upstream against a current.

The unexpected lesson:

If I’d known going in that the dominant failure mode would be “the AI tells in skin and framing,” I’d have spent more time on the data layer and less on the generation layer. A more diverse dataset, 60-80 photos with real cameras and varied contexts, probably matters more than any post-hoc prompt engineering.

What I’d do differently next time

In order of expected impact:

Better dataset. Reshoot. 50-70 photos with at least three different cameras (real DSLR, friend’s phone, candid taken-by-someone-else), three different lighting setups (window, outdoor, fluorescent), at least three different outfits, at least one with motion or a real laugh. The LoRA can only generalize as far as the data supports.
Set up a RunPod network volume before doing inference. The cold-start cost dominates everything else right now and it’s avoidable.
Pin image tags and version every dependency. Both the training image and the inference image should be specific tags with verified dates, not :latest.
Write the validation script first. Before any real run, a smoke test that boots the pod, runs the smallest possible workflow, confirms each integration point. Would have surfaced the rsync issue, the ae.sft issue, and the ComfyUI workflow validation issues in 5 minutes instead of 5 hours.
Iterate on the warm pod. The pattern I eventually arrived at (keep the pod alive on error, re-run generation against it) saved me hours once I had it. Should have built it in from the start.

Where it stands

I have a trained LoRA, a working end-to-end pipeline, a post-processing script, and a clear list of dials to turn on the next iteration. The LoRA is good. The infra is precarious but functional. The output is somewhere between “passable for a stranger on LinkedIn” and “obviously AI to anyone looking closely,” and the gap between those is one or two more iteration rounds of prompt and dataset work.

I’ll come back to it. Not today.