From Audio Interviews to Structured Data

Combining Whisper, speaker diarization, and a local LLM into a one-command pipeline

Ruby WhisperX pyannote LM Studio

The Problem

A researcher has 40+ hours of interview audio sitting on a hard drive. Each recording is a conversation between an interviewer and one narrator. The goal is a CSV where every row is one utterance — either a question or a response — tagged with the narrator's ID and a topic code. From there, the researcher can group, count, quote, and compare across the corpus.

Doing this by hand is the bottleneck. Off-the-shelf transcription gets you part of the way: it produces a wall of text. Diarization gets you a different part of the way: it labels who was talking, but not very reliably. Neither output is "research-ready" on its own.

The fabricated example used throughout this write-up is an oral history archive: interviewers sitting down with narrators to record stories about migration, family, work, faith, and place. The actual project this pipeline was built for is private; the technique is what's worth sharing.

The Goal

One command: ruby pipeline.rb ./interviews. In: a folder of MP3s. Out: a validated CSV where every row is one utterance, attributed to either the interviewer or the narrator, tagged with a topic code from a closed list.

Pipeline Overview

Four stages, each handling one job:

MP3 folder
   │
   ▼
WhisperX service  ──►  raw.txt   (verbatim, no labels)
                  ──►  diarized.txt   (with speaker labels)
   │
   ▼
LLM (Gemma via LM Studio)
   - aligns raw vs diarized
   - decides true speaker from context
   - emits CSV rows with topic codes
   │
   ▼
Per-row validation  ──►  master_output.csv

The Ruby script (pipeline.rb) is the glue. It calls a local HTTP transcription service, calls a local LLM, validates the output, and appends to one master CSV. Everything runs on a LAN, with no data leaving the network.

Step 1: Naming as Metadata

The filenames themselves carry the metadata the LLM needs later. The convention is [narrator_id]_[theme_tag][anything].mp3:

1042_Migration.mp3
1051_Family.mp3
1062_War_session2.m4a
1078_Faith.mp3

One regex pulls out both fields and the rest of the filename is ignored, so the researcher can add notes to the filename without breaking the parser:

FILENAME_PATTERN = /\A(\d+)_([A-Za-z]+)/

def parse_filename(filepath)
  basename = File.basename(filepath, File.extname(filepath))
  match = basename.match(FILENAME_PATTERN)
  return nil unless match
  {
    narrator_id: match[1],
    theme: match[2].upcase,
    base_path: File.join(File.dirname(filepath), basename)
  }
end

This is a small choice that pays off later: the narrator ID and the theme tag flow all the way through the pipeline into every CSV row, with no separate metadata file to keep in sync.

Step 2: A Self-Hosted WhisperX Service

WhisperX wraps OpenAI Whisper with forced alignment and speaker diarization (via pyannote). Running it as a CLI works fine for one file. Running it as a small HTTP service makes more sense once you have dozens, because:

The model only loads into GPU memory once.
The Ruby pipeline doesn't need Python in its environment.
Other tools on the LAN can use the same transcriber.

The service exposes four endpoints: POST /jobs to submit a file, GET /jobs/:id to poll status, and GET /jobs/:id/result/{txt,json} to fetch results. The Ruby client is small:

def submit_transcription_job(filepath)
  uri = URI("#{TRANSCRIBE_HOST}/jobs")
  boundary = "----RubyFormBoundary#{rand(1_000_000)}"
  filename = File.basename(filepath)
  file_data = File.binread(filepath)

  body = ""
  body << "--#{boundary}\r\n"
  body << "Content-Disposition: form-data; name=\"file\"; filename=\"#{filename}\"\r\n"
  body << "Content-Type: application/octet-stream\r\n\r\n"
  body << file_data
  body << "\r\n--#{boundary}--\r\n"

  request = Net::HTTP::Post.new(uri)
  request["Content-Type"] = "multipart/form-data; boundary=#{boundary}"
  request.body = body

  http = Net::HTTP.new(uri.host, uri.port)
  http.read_timeout = 300
  response = http.request(request)
  JSON.parse(response.body)["job_id"]
end

def poll_job(job_id)
  uri = URI("#{TRANSCRIBE_HOST}/jobs/#{job_id}")
  loop do
    result = JSON.parse(Net::HTTP.get(uri))
    case result["status"]
    when "done"  then return true
    when "error" then return false
    else sleep POLL_SECS
    end
  end
end

Two output files matter. The .diarized.txt has speaker labels, in the format pyannote produces. The .raw.txt is derived from the JSON segments — a clean verbatim transcript with no labels, suitable for quoting:

json_text = Net::HTTP.get(URI("#{TRANSCRIBE_HOST}/jobs/#{job_id}/result/json"))
File.write("#{base_path}.result.json", json_text)

raw_text = JSON.parse(json_text)["segments"]
  .map { |s| s["text"] }
  .join(" ")
  .gsub(/  +/, " ")
  .strip
File.write("#{base_path}.raw.txt", raw_text)

Transcription is the slowest step, so the pipeline skips it when both transcript files already exist and are non-empty. That makes re-runs cheap when iterating on the LLM prompt.

Step 3: Why Diarization Alone Isn't Enough

Diarization is the part everyone underestimates. It is confident, and it is often wrong in ways that break downstream coding. Two failure modes show up over and over:

Failure mode A: identity drift

pyannote assigns SPEAKER_00 and SPEAKER_01 at the start of the recording. Halfway through, after a long pause, it sometimes flips the assignment. Now the interviewer is labelled SPEAKER_01 for the rest of the file.

Failure mode B: role confusion

When the narrator speaks for two minutes uninterrupted — telling a long story about leaving home — and then the interviewer asks a question, the diarizer can split the narrator's monologue across both labels. Or, when the interviewer reads a long preamble at the start, the diarizer labels them as the answerer because they spoke first and longest.

Here is a fabricated snippet that shows the kind of mess you have to clean up:

[SPEAKER_00] So tell me about the journey across.
[SPEAKER_01] We left in the spring of '52. My father had cousins in
             Saskatchewan, so we went there first. I remember
             the boat — the Beaverbrae. I can still see it.
[SPEAKER_01] Mm-hm.                                  ← interviewer
[SPEAKER_00] My mother was seasick the whole crossing.    ← narrator
[SPEAKER_00] What did you bring with you?
[SPEAKER_01] One trunk. My mother's sewing machine took up half of it.

A naive script that trusts the speaker labels will assign the narrator's seasick line to the interviewer and the interviewer's "Mm-hm" to the narrator. By itself, diarization is not enough. The verbatim text from Whisper is also not enough, because it has no speakers at all. Both together are enough, if you have something that can reason about them.

The Trick

Send both the raw and the diarized transcript to an LLM. Tell it the diarization labels are a hint, not the truth. Tell it which role asks questions and which role answers. Let it override the labels when context contradicts them.

Step 4: The LLM as Aligner and Coder

The LLM does three things in one shot: align the raw transcript against the diarized transcript line-by-line, decide the true speaker for each line, and assign each question a topic code from a closed list. It runs locally via LM Studio with Gemma loaded — same model called through an OpenAI-compatible endpoint.

The system prompt

The prompt is the load-bearing piece. It names the two inputs, names the two roles, explicitly downgrades the diarization labels to a hint, and pins the output format:

You are an expert qualitative data analyst processing oral history transcripts.

You will receive two inputs:
1. A RAW transcript (verbatim, no speaker labels).
2. A DIARIZED transcript (with speaker labels from an AI diarization
   tool — labels may contain errors).

YOUR TASK:
1. Align the RAW transcript lines with the DIARIZED transcript lines by
   matching the text content.
2. Determine the TRUE speaker for each line using CONTEXT:
   - Use the DIARIZATION speaker labels as a strong starting hint.
   - OVERRIDE the diarization if the context clearly contradicts it
     (e.g., the "Interviewer" is sharing a personal story, or the
     "Narrator" is asking interview questions).
   - The Interviewer asks questions and gives prompts. The Narrator
     answers and shares experiences.
3. Output ONLY valid Q&A rows as CSV.

OUTPUT FORMAT (CSV, no header row):
Narrator ID,Order,Type of Text,Text,Theme,Root Question

- Type of Text: "question" if the true speaker is the Interviewer;
  "response" if the true speaker is the Narrator.
- Text: VERBATIM from the RAW transcript (quote if it contains commas).
- Root Question: classify into one of these themes:
    "Family"     — family relationships and home life
    "Work"       — occupations, trades, daily labour
    "Migration"  — moving, arriving, leaving
    "Place"      — landscapes, buildings, neighbourhoods
    "Faith"      — religious practice and belief
    "War"        — military service, displacement, loss
    "Education"  — schooling, mentors, learning
    "Community"  — neighbours, clubs, mutual aid
- For response rows: carry forward the Root Question from the most
  recent preceding question.
- For follow-up questions ("Say more", "Tell me about that"): carry
  forward the Root Question from the question they follow up on.

Two rules quietly do most of the work. The first is "use the labels as a hint, override on context." The second is "responses inherit the most recent question's theme." That second rule is what makes the output usable for analysis. Without it, most of the rows in the CSV would have a blank theme.

The Ruby call

Standard ruby-openai against the LM Studio endpoint, with a generous timeout because local inference on a long interview is not fast:

CLIENT = OpenAI::Client.new(
  access_token: "lm-studio",
  uri_base: LM_STUDIO_URL,
  request_timeout: 1800  # 30 minutes
)

response = CLIENT.chat(
  parameters: {
    model: MODEL_NAME,
    messages: [
      { role: "system", content: SYSTEM_PROMPT },
      { role: "user",   content: user_prompt }
    ],
    temperature: 0.1,
    max_tokens: 8192
  }
)

Temperature is low because the job is alignment and classification, not generation. The user prompt is just the narrator ID, the theme tag, the raw transcript, and the diarized transcript, with a closing reminder: "Output the CSV rows now. No headers, no markdown, no explanation."

Step 5: Validation Before Trust

Even at temperature: 0.1, LLMs occasionally produce a malformed line. They miss a closing quote, drop a column, or invent a topic code that wasn't in the list. The fix isn't more prompt engineering. It's validation per row, so one bad line doesn't poison the whole file:

VALID_THEMES = %w[Family Work Migration Place Faith War Education Community].freeze

def validate_and_parse_rows(csv_text, narrator_id, theme)
  valid_rows = []
  rejected = 0

  csv_text.each_line.with_index(1) do |line, line_num|
    line = line.strip
    next if line.empty?

    begin
      parsed = CSV.parse_line(line)
    rescue CSV::MalformedCSVError
      log("  REJECTED line #{line_num} (malformed): #{line[0..80]}...")
      rejected += 1
      next
    end

    if parsed.nil? || parsed.length != 6
      log("  REJECTED line #{line_num} (expected 6 cols, got #{parsed&.length})")
      rejected += 1
      next
    end

    type_of_text = parsed[2]&.downcase&.strip
    unless %w[question response].include?(type_of_text)
      log("  REJECTED line #{line_num} (invalid type '#{type_of_text}')")
      rejected += 1
      next
    end
    parsed[2] = type_of_text

    rq = parsed[5]&.strip || ""
    parsed[5] = rq
    if !rq.empty? && !VALID_THEMES.include?(rq)
      log("  WARNING line #{line_num}: unexpected theme '#{rq}'")
    end

    valid_rows << parsed
  end

  log("  Validated: #{valid_rows.length} accepted, #{rejected} rejected")
  valid_rows
end

Two design notes. Malformed lines are rejected (skipped, with a log entry pointing at the offending text). Unexpected theme values are warned — kept, but flagged. The asymmetry matters: a row with a slightly off theme is still data the researcher can review, but a row that can't even be parsed as CSV is just noise.

Step 6: Putting It Together

The main loop is short. For each parsed filename: transcribe, read transcripts, send to LLM, dump the raw LLM output to a debug folder, validate, append valid rows to the master CSV.

entries.each do |entry|
  sid       = entry[:narrator_id]
  theme     = entry[:theme]
  filepath  = entry[:filepath]
  base_path = entry[:base_path]

  log("Processing: #{File.basename(filepath)}  narrator=#{sid}  theme=#{theme}")

  next unless transcribe(filepath, base_path)

  raw = File.read("#{base_path}.raw.txt", encoding: "utf-8").strip
  dia = File.read("#{base_path}.diarized.txt", encoding: "utf-8").strip
  next if raw.empty?

  csv_text = process_with_llm(raw, dia, sid, theme)
  next if csv_text.nil? || csv_text.empty?

  # Keep the raw LLM output for debugging.
  File.write("#{DEBUG_DIR}/llm_output_#{sid}.csv", csv_text)

  rows = validate_and_parse_rows(csv_text, sid, theme)
  next if rows.empty?

  CSV.open(OUTPUT_FILE, "a") { |csv| rows.each { |row| csv << row } }
  log("  SUCCESS: #{rows.length} rows for narrator #{sid}")
end

Three things make the loop forgiving:

Per-file next on any failure — a single bad recording never blocks the rest of the batch.
Idempotent transcription — re-running the pipeline doesn't re-transcribe files that already have transcripts.
Raw LLM output dumped to debug_output/llm_output_<id>.csv — when validation rejects half the rows for one file, you can open the raw output and see what the LLM produced.

A typical log line looks like:

[2026-05-12 14:03:17] Processing: 1042_Migration.mp3  narrator=1042  theme=MIGRATION
[2026-05-12 14:03:17]   Submitting for transcription...
[2026-05-12 14:03:18]   Job ID: 8f3e2a91
[2026-05-12 14:11:42]   Transcription complete.
[2026-05-12 14:11:43]   Sending to LLM (raw: 18421 chars, diarized: 21770 chars)
[2026-05-12 14:18:09]   LLM returned 184 lines
[2026-05-12 14:18:09]   Validated: 182 rows accepted, 2 rejected
[2026-05-12 14:18:09]   SUCCESS: 182 rows for narrator 1042

Sample Output

A snippet of the master CSV (fabricated, from the oral history example):

Narrator ID,Order,Type of Text,Text,Theme,Root Question
1042,1,question,"Can you tell me about leaving home?",MIGRATION,Migration
1042,2,response,"It was 1952. The papers came through in March.",MIGRATION,Migration
1042,3,question,"And who travelled with you?",MIGRATION,Migration
1042,4,response,"My mother, my two sisters, and me. My father had gone ahead.",MIGRATION,Migration
1042,5,question,"What did you bring?",MIGRATION,Migration
1042,6,response,"One trunk. My mother's sewing machine took up half of it.",MIGRATION,Migration
1042,7,question,"Say more about the sewing machine.",MIGRATION,Migration
1042,8,response,"It was a Singer treadle. She'd had it since she was married.",MIGRATION,Migration
1051,1,question,"Tell me about your mother.",FAMILY,Family
1051,2,response,"She kept the house and worked the garden. Eight of us, three rooms.",FAMILY,Family

The shape is what the analysis tool wants. Every row stands alone: who said it, in what order, what kind of utterance, the verbatim text, the theme tag from the filename, and the topic code carried forward from the most recent question. From here, the researcher can group by theme and read every response on migration across all 40 narrators.

What I'd Do Differently

Stream the LLM call

Right now the script waits for the full CSV response. Streaming would let validation start as soon as the first row arrives, and surface a hung generation sooner.

Externalize the taxonomy

The list of valid themes is hard-coded in the prompt and in the validator. A JSON file loaded by both would make it one edit instead of two.

Add a review pass

A second LLM call that re-reads its own CSV against the raw transcript and flags rows where the speaker assignment looks wrong, before the human ever sees them.

SQLite instead of CSV append

Appending to a master CSV means re-running a file duplicates its rows. A SQLite table keyed on narrator ID would make re-runs idempotent end-to-end.

Links

WhisperX Transcription pyannote-audio Diarization LM Studio Local LLM host Gemma Model