打断支持
一、标准时序图:正常一轮播放
这是最理想的链路。
用户开始说话
|
v
前端 startMic -> 采集 PCM16k -> ws.send(binary)
|
v
后端 bridge.sendPcm16k(...)
|
v
Gemini 实时识别 / 实时生成
|
+---- inputTranscription ----------> 前端: {"type":"transcript_in","text":"..."}
|
+---- 第一块 assistant 输出出现
| |
| +--> 后端生成 turnId
| +--> 前端: {"type":"assistant_turn_start","turnId":"asst_1"}
|
+---- outputTranscription ---------> 前端: {"type":"transcript_out","text":"...","turnId":"asst_1"}
|
+---- inlineData(audio pcm 24k) ---> 前端: binary audio chunk #1
+---- inlineData(audio pcm 24k) ---> 前端: binary audio chunk #2
+---- inlineData(audio pcm 24k) ---> 前端: binary audio chunk #3
|
+---- outputTranscription ---------> 前端: {"type":"transcript_out","text":"...","turnId":"asst_1"}
|
+---- turnComplete ----------------> 前端: {"type":"assistant_turn_complete","turnId":"asst_1"}
| 前端: {"type":"turn_complete"}
v
前端把最后已入队音频播完
正常场景下,前端应该看到的日志顺序
理想上类似这样:
[in ] 你好
[turn] assistant_turn_start turnId=asst_1
[out] 好的
[out] 我来帮你
[turn] assistant_turn_complete turnId=asst_1
[turn] complete
注意两点:
1. assistant_turn_start 必须早于第一包音频
如果音频先到了,前端没有 activeAssistantTurnId,会直接丢弃。
2. assistant_turn_complete 不一定意味着“音频已经播完”
它表示服务端这轮输出结束,但前端本地可能还有最后几帧已经 schedule 的音频在播放,这是正常的。
二、标准时序图:用户打断 assistant
这是你最关心的场景。
assistant 正在说话
|
+---- 前端正在播放 asst_1 的音频 chunk
|
用户开始说话
|
v
前端 startMic 持续上行 PCM
|
v
后端 / Gemini 检测到新的 speech activity
|
+---- 前端: {"type":"speech_started"} (可选,有些链路会有)
|
+---- 前端: {"type":"assistant_turn_interrupt","turnId":"asst_1"}
+---- 前端: {"type":"interrupted"}
|
v
前端执行:
- stop 所有 activeSources
- activeAssistantTurnId = null
- interruptedTurnIds.add("asst_1")
- 后续属于 asst_1 的残留二进制包全部丢弃
|
v
Gemini 开始生成新一轮回复
|
+---- 前端: {"type":"assistant_turn_start","turnId":"asst_2"}
+---- 前端: binary audio chunk (属于 asst_2)
+---- 前端: {"type":"assistant_turn_complete","turnId":"asst_2"}
打断场景下,理想日志顺序
[turn] assistant_turn_start turnId=asst_1
[out] 砖
[out] 房
[out] 却
[turn] assistant_turn_interrupt turnId=asst_1
[event] interrupted
[interrupt] playback cleared: assistant_turn_interrupt, turnId=asst_1
[in ] 拿多啊
[turn] assistant_turn_start turnId=asst_2
[out] 好的
[out] 第一
如果做到这里,说明协议级隔离已经生效。
三、realtime模型打断参数设置
1、先解释这几个参数在干什么
你现在的配置:
AutomaticActivityDetection vad = AutomaticActivityDetection.builder()
.disabled(false)
.startOfSpeechSensitivity(StartSensitivity.Known.START_SENSITIVITY_HIGH)
.endOfSpeechSensitivity(EndSensitivity.Known.END_SENSITIVITY_LOW)
.prefixPaddingMs(100)
.silenceDurationMs(500)
.build();
我们逐个拆一下它们对“打断”的影响:
startOfSpeechSensitivity(开始说话灵敏度)
START_SENSITIVITY_HIGH
作用:多快认为“用户开始说话了”
- LOW → 更保守,不容易误触发
- HIGH → 更容易检测到开始说话(更灵敏)
这个你已经设为 HIGH,是对的
silenceDurationMs(结束说话判定时间)
silenceDurationMs(500)
作用:多长时间无声音才认为“用户说完了”
⚠️ 这个参数不会影响打断开始,但会影响:
- turn 切换速度
- 下一轮 assistant 开始的时机
prefixPaddingMs(前缀补偿)
prefixPaddingMs(100)
作用:在检测到语音开始时,往前补多少音频
这个主要影响:
- ASR 完整性
- 不影响打断速度
endOfSpeechSensitivity(结束灵敏度)
END_SENSITIVITY_LOW
作用:多快认为“用户停止说话”
- LOW → 更慢结束(更稳)
- HIGH → 更快结束
2、真正影响“打断速度”的关键
打断触发的核心链路是:
用户说话
↓
VAD 检测到 start_of_speech
↓
Gemini 触发 interrupt
↓
bridge 收到 interrupted
↓
前端 stop 播放
真正影响“打断是否快”的只有 2 个点:
startOfSpeechSensitivity(最关键)
你已经设 HIGH ✔️
activityHandling(你代码里这个更关键)
你现在是:
.activityHandling(ActivityHandling.Known.START_OF_ACTIVITY_INTERRUPTS)
这个已经是最强打断模式
含义:
一旦检测到用户开始说话,立即打断当前输出
这个你也已经设置正确
推荐“更激进的打断配置”
如果你的目标是:
用户一开口就立刻打断(类似 ChatGPT 语音)
建议这样调:
推荐配置(更灵敏)
AutomaticActivityDetection vad = AutomaticActivityDetection.builder()
.disabled(false)
.startOfSpeechSensitivity(StartSensitivity.Known.START_SENSITIVITY_HIGH) // 已经是最高
.endOfSpeechSensitivity(EndSensitivity.Known.END_SENSITIVITY_HIGH) // 更快结束
.prefixPaddingMs(20) // 减少缓冲
.silenceDurationMs(150) // 关键:从 500 → 150
.build();
后端代码
package com.litongjava.voice.agent.bridge;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;
import java.util.Optional;
import java.util.UUID;
import java.util.concurrent.CompletableFuture;
import com.google.genai.AsyncSession;
import com.google.genai.Client;
import com.google.genai.types.ActivityHandling;
import com.google.genai.types.AudioTranscriptionConfig;
import com.google.genai.types.AutomaticActivityDetection;
import com.google.genai.types.Blob;
import com.google.genai.types.ClientOptions;
import com.google.genai.types.Content;
import com.google.genai.types.EndSensitivity;
import com.google.genai.types.LiveConnectConfig;
import com.google.genai.types.LiveSendClientContentParameters;
import com.google.genai.types.LiveSendRealtimeInputParameters;
import com.google.genai.types.LiveServerContent;
import com.google.genai.types.LiveServerMessage;
import com.google.genai.types.Modality;
import com.google.genai.types.Part;
import com.google.genai.types.PrebuiltVoiceConfig;
import com.google.genai.types.RealtimeInputConfig;
import com.google.genai.types.SpeechConfig;
import com.google.genai.types.StartSensitivity;
import com.google.genai.types.ThinkingConfig;
import com.google.genai.types.TurnCoverage;
import com.google.genai.types.VoiceConfig;
import com.litongjava.gemini.GeminiClient;
import com.litongjava.tio.utils.hutool.StrUtil;
import com.litongjava.voice.agent.model.WsVoiceAgentResponseMessage;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class GoogleGeminiRealtimeBridge implements RealtimeModelBridge {
private static final String INPUT_MIME = "audio/pcm;rate=16000";
private static final String OUTPUT_MIME_PREFIX = "audio/pcm";
private String model = "models/gemini-2.5-flash-native-audio-preview-12-2025";
private String voiceName = "Puck";
private final Object transcriptLock = new Object();
private final StringBuilder turnUserTranscript = new StringBuilder();
private final StringBuilder turnAssistantTranscript = new StringBuilder();
/**
* 协议级 turn 控制
*/
private final Object assistantTurnLock = new Object();
private volatile String currentAssistantTurnId;
private volatile boolean assistantTurnOpen = false;
private final Client client;
private volatile AsyncSession session;
private final RealtimeBridgeCallback callback;
public GoogleGeminiRealtimeBridge(RealtimeBridgeCallback sender, String url, String model, String voiceName) {
this.callback = sender;
Client.Builder b = Client.builder().apiKey(GeminiClient.GEMINI_API_KEY);
ClientOptions clientOptions = ClientOptions.builder().build();
b.clientOptions(clientOptions);
this.client = b.build();
if (model != null) {
this.model = model;
}
if (voiceName != null) {
this.voiceName = voiceName;
}
}
public GoogleGeminiRealtimeBridge(RealtimeBridgeCallback sender) {
this(sender, null, null, null);
}
@Override
public CompletableFuture<Void> connect(RealtimeSetup realtimeSetup) {
LiveConnectConfig config = buildLiveConfig();
return client.async.live.connect(model, config).thenCompose(sess -> {
this.session = sess;
String sessionId = sess.sessionId();
callback.session(sessionId);
send(new WsVoiceAgentResponseMessage("gemini_connected", sessionId));
try {
sendPromptsIfAny(sess, realtimeSetup);
} catch (Exception ex) {
log.error("send setup prompts error(connect)", ex);
send(new WsVoiceAgentResponseMessage("error", safe(ex.getMessage())));
}
CompletableFuture<Void> receiveFuture = sess.receive(this::onGeminiMessage);
receiveFuture.whenComplete((v, ex) -> {
log.info("gemini receive completed, v:{}, ex:{}", v, ex);
if (ex != null) {
log.error("gemini receive error", ex);
send(new WsVoiceAgentResponseMessage("error", safe(ex.getMessage())));
}
});
return receiveFuture;
}).exceptionally(ex -> {
log.error("Gemini live connect failed", ex);
send(new WsVoiceAgentResponseMessage("error", safe(ex.getMessage())));
callback.close("gemini connect failed");
return null;
});
}
@Override
public CompletableFuture<Void> close() {
try {
AsyncSession s = this.session;
if (s != null) {
return s.close().exceptionally(ex -> null);
}
} finally {
closeAssistantTurnSilently();
try {
client.close();
} catch (Exception ignore) {
}
try {
callback.close("close");
} catch (Exception ignore) {
}
}
return CompletableFuture.completedFuture(null);
}
/**
* 前端推来的 16k PCM 裸流
*/
@Override
public CompletableFuture<Void> sendPcm16k(byte[] pcm16k) {
AsyncSession s = this.session;
if (s == null) {
return CompletableFuture.completedFuture(null);
}
Blob audioBlob = Blob.builder().mimeType(INPUT_MIME).data(pcm16k).build();
LiveSendRealtimeInputParameters params = LiveSendRealtimeInputParameters.builder().audio(audioBlob).build();
return s.sendRealtimeInput(params).exceptionally(ex -> {
String message = ex.getMessage();
log.error("sendPcm16k error: {}", message, ex);
send(new WsVoiceAgentResponseMessage("error", safe(message)));
if ("org.java_websocket.exceptions.WebsocketNotConnectedException".equals(message)) {
close();
}
return null;
});
}
/**
* 前端发文本输入
*/
@Override
public CompletableFuture<Void> sendText(String text) {
AsyncSession s = this.session;
if (s == null) {
return CompletableFuture.completedFuture(null);
}
Content userMessage = Content.fromParts(Part.fromText(text));
LiveSendClientContentParameters cc = LiveSendClientContentParameters.builder().turns(List.of(userMessage))
.turnComplete(true).build();
return s.sendClientContent(cc).exceptionally(ex -> {
log.error("sendText error: {}", ex.getMessage(), ex);
send(new WsVoiceAgentResponseMessage("error", safe(ex.getMessage())));
return null;
});
}
private void sendPromptsIfAny(AsyncSession s, RealtimeSetup realtimeSetup) {
if (realtimeSetup == null) {
return;
}
String systemPrompt = realtimeSetup.getSystem_prompt();
String jobDescription = realtimeSetup.getJob_description();
String resume = realtimeSetup.getResume();
String questions = realtimeSetup.getQuestions();
String greeting = realtimeSetup.getGreeting();
List<Content> initialTurns = new ArrayList<>();
if (StrUtil.notBlank(systemPrompt)) {
initialTurns.add(Content.fromParts(Part.fromText(systemPrompt)));
}
if (StrUtil.notBlank(jobDescription)) {
initialTurns.add(Content.fromParts(Part.fromText(jobDescription)));
}
if (StrUtil.notBlank(resume)) {
initialTurns.add(Content.fromParts(Part.fromText(resume)));
}
if (StrUtil.notBlank(questions) || StrUtil.notBlank(greeting)) {
initialTurns.add(Content.fromParts(
Part.fromText((greeting == null ? "" : greeting) + "\n\n" + (questions == null ? "" : questions))));
}
if (!initialTurns.isEmpty()) {
LiveSendClientContentParameters cc = LiveSendClientContentParameters.builder().turns(initialTurns)
.turnComplete(true).build();
s.sendClientContent(cc).exceptionally(ex -> {
log.error("sendPromptsIfAny error: {}", ex.getMessage(), ex);
send(new WsVoiceAgentResponseMessage("error", safe(ex.getMessage())));
return null;
});
send(new WsVoiceAgentResponseMessage("setup_sent_to_model"));
}
}
private LiveConnectConfig buildLiveConfig() {
AutomaticActivityDetection vad = AutomaticActivityDetection.builder().disabled(false)
.startOfSpeechSensitivity(StartSensitivity.Known.START_SENSITIVITY_HIGH)
//
.endOfSpeechSensitivity(EndSensitivity.Known.END_SENSITIVITY_HIGH)
//
.prefixPaddingMs(20).silenceDurationMs(150)
//
.build();
RealtimeInputConfig realtimeInput = RealtimeInputConfig.builder().automaticActivityDetection(vad)
.activityHandling(ActivityHandling.Known.START_OF_ACTIVITY_INTERRUPTS)
.turnCoverage(TurnCoverage.Known.TURN_INCLUDES_ONLY_ACTIVITY).build();
PrebuiltVoiceConfig prebuiltVoiceConfig = PrebuiltVoiceConfig.builder().voiceName(voiceName).build();
VoiceConfig voiceConfig = VoiceConfig.builder().prebuiltVoiceConfig(prebuiltVoiceConfig).build();
SpeechConfig speech = SpeechConfig.builder().voiceConfig(voiceConfig).build();
ThinkingConfig thinkingConfig = ThinkingConfig.builder().thinkingBudget(0).build();
AudioTranscriptionConfig audioTranscriptionConfig = AudioTranscriptionConfig.builder().build();
return LiveConnectConfig.builder().responseModalities(List.of(new Modality(Modality.Known.AUDIO)))
.speechConfig(speech).thinkingConfig(thinkingConfig).realtimeInputConfig(realtimeInput)
.inputAudioTranscription(audioTranscriptionConfig).outputAudioTranscription(audioTranscriptionConfig).build();
}
/**
* Gemini -> 前端
*/
private void onGeminiMessage(LiveServerMessage msg) {
try {
if (msg == null) {
return;
}
msg.serverContent().ifPresent(this::handleServerContent);
msg.usageMetadata().ifPresent(usage -> {
WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("usage");
m.setPromptTokenCount(usage.promptTokenCount());
m.setResponseTokenCount(usage.responseTokenCount());
m.setTotalTokenCount(usage.totalTokenCount());
send(m);
});
msg.goAway().ifPresent(goAway -> {
WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("go_away");
Optional<Duration> timeLeft = goAway.timeLeft();
m.setTimeLeft(timeLeft.orElse(null));
send(m);
});
msg.toolCall().ifPresent(toolCall -> {
WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("tool_call");
m.setText(toolCall.toString());
send(m);
});
msg.toolCallCancellation().ifPresent(cancel -> {
WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("tool_call_cancellation");
m.setText(cancel.toString());
send(m);
});
} catch (Exception e) {
log.error("onGeminiMessage error", e);
send(new WsVoiceAgentResponseMessage("error", safe(e.getMessage())));
}
}
private void handleServerContent(LiveServerContent sc) {
if (sc == null) {
return;
}
sc.inputTranscription().ifPresent(t -> {
String text = t.text().orElse("");
if (StrUtil.isNotBlank(text)) {
appendUserTranscript(text);
WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("transcript_in");
m.setText(text);
send(m);
}
});
sc.outputTranscription().ifPresent(t -> {
String text = t.text().orElse("");
if (StrUtil.isNotBlank(text)) {
ensureAssistantTurnStarted();
appendAssistantTranscript(text);
WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("transcript_out");
m.setText(text);
m.setTurnId(currentAssistantTurnId);
send(m);
}
});
sc.modelTurn().ifPresent(modelTurn -> {
List<Part> parts = modelTurn.parts().orElse(List.of());
for (Part p : parts) {
if (p == null) {
continue;
}
p.text().ifPresent(text -> {
if (StrUtil.isNotBlank(text)) {
ensureAssistantTurnStarted();
appendAssistantTranscript(text);
WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("text");
m.setText(text);
m.setTurnId(currentAssistantTurnId);
send(m);
}
});
p.inlineData().ifPresent(blob -> {
String mt = blob.mimeType().orElse("");
byte[] data = blob.data().orElse(null);
if (data != null && mt.startsWith(OUTPUT_MIME_PREFIX)) {
ensureAssistantTurnStarted();
callback.sendBinary(data);
}
});
}
});
sc.interrupted().ifPresent(v -> {
if (Boolean.TRUE.equals(v)) {
String turnId = currentAssistantTurnId;
log.info("interrupted:{}", turnId);
if (turnId != null) {
WsVoiceAgentResponseMessage turnInterrupt = new WsVoiceAgentResponseMessage("assistant_turn_interrupt");
turnInterrupt.setTurnId(turnId);
send(turnInterrupt);
}
send(new WsVoiceAgentResponseMessage("interrupted"));
closeAssistantTurnSilently();
}
});
if (sc.turnComplete().orElse(false)) {
String turnId = currentAssistantTurnId;
if (turnId != null) {
WsVoiceAgentResponseMessage turnComplete = new WsVoiceAgentResponseMessage("assistant_turn_complete");
turnComplete.setTurnId(turnId);
send(turnComplete);
}
send(new WsVoiceAgentResponseMessage("turn_complete"));
flushTurnTranscriptOnComplete();
closeAssistantTurnSilently();
}
}
private String ensureAssistantTurnStarted() {
synchronized (assistantTurnLock) {
if (assistantTurnOpen && currentAssistantTurnId != null) {
return currentAssistantTurnId;
}
currentAssistantTurnId = newAssistantTurnId();
assistantTurnOpen = true;
WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("assistant_turn_start");
m.setTurnId(currentAssistantTurnId);
send(m);
return currentAssistantTurnId;
}
}
private void closeAssistantTurnSilently() {
synchronized (assistantTurnLock) {
assistantTurnOpen = false;
currentAssistantTurnId = null;
}
}
private String newAssistantTurnId() {
return "asst_" + System.currentTimeMillis() + "_" + UUID.randomUUID().toString().replace("-", "");
}
private void appendUserTranscript(String text) {
synchronized (transcriptLock) {
if (turnUserTranscript.length() > 0) {
turnUserTranscript.append(' ');
}
turnUserTranscript.append(text);
}
}
private void appendAssistantTranscript(String text) {
synchronized (transcriptLock) {
if (turnAssistantTranscript.length() > 0) {
turnAssistantTranscript.append(' ');
}
turnAssistantTranscript.append(text);
}
}
private void flushTurnTranscriptOnComplete() {
synchronized (transcriptLock) {
String userText = turnUserTranscript.toString().trim();
String assistantText = turnAssistantTranscript.toString().trim();
if (StrUtil.isNotBlank(userText) || StrUtil.isNotBlank(assistantText)) {
WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("turn_transcript");
m.setInputText(userText);
m.setOutputText(assistantText);
send(m);
}
turnUserTranscript.setLength(0);
turnAssistantTranscript.setLength(0);
}
}
private void send(WsVoiceAgentResponseMessage msg) {
try {
callback.sendText(msg.toJson());
} catch (Exception e) {
log.error("send ws message error: {}", msg, e);
}
}
private String safe(String s) {
if (s == null) {
return "";
}
return s.length() > 1000 ? s.substring(0, 1000) : s;
}
/**
* 前端说“音频结束”
*/
@Override
public CompletableFuture<Void> endAudioInput() {
AsyncSession s = this.session;
if (s == null) {
return CompletableFuture.completedFuture(null);
}
LiveSendRealtimeInputParameters params = LiveSendRealtimeInputParameters.builder().audioStreamEnd(true).build();
return s.sendRealtimeInput(params).exceptionally(ex -> {
String message = ex.getMessage();
log.error("sendAudioStreamEnd error: {}", message, ex);
send(new WsVoiceAgentResponseMessage("error", safe(message)));
if ("org.java_websocket.exceptions.WebsocketNotConnectedException".equals(message)) {
close();
}
return null;
});
}
}
前端代码
app.js
const el = (id) => document.getElementById(id);
const wsUrlInput = el("wsUrl");
const btnConnect = el("btnConnect");
const btnDisconnect = el("btnDisconnect");
const btnStartMic = el("btnStartMic");
const btnStopMic = el("btnStopMic");
const btnAudioEnd = el("btnAudioEnd");
const textInput = el("textInput");
const btnSendText = el("btnSendText");
const logEl = el("log");
const playStateEl = el("playState");
const btnClearLog = el("btnClearLog");
// 新增上下文字段元素
const systemPromptEl = el("systemPrompt");
const jobDescriptionEl = el("jobDescription");
const resumeEl = el("resume");
const questionsEl = el("questions");
const greetingEl = el("greeting");
// sessionId 显示元素
const sessionIdEl = el("sessionId");
function logLine(s) {
logEl.textContent += s + "\n";
logEl.scrollTop = logEl.scrollHeight;
}
function setPlayState(obj) {
playStateEl.textContent = JSON.stringify(obj, null, 2);
}
function setSessionId(sessionId) {
sessionIdEl.textContent = sessionId || "-";
}
function setSessionDisconnected(disconnected) {
sessionIdEl.classList.toggle("disconnected", !!disconnected);
}
function defaultWsUrl() {
const loc = window.location;
const proto = loc.protocol === "https:" ? "wss:" : "ws:";
return `${proto}//${loc.host}/api/v1/voice/agent`;
}
wsUrlInput.value = defaultWsUrl();
/** ---------- WebSocket ---------- */
let ws = null;
/** ---------- Audio (Mic) ---------- */
let micStream = null;
let micCtx = null;
let micNode = null; // AudioWorkletNode 或 ScriptProcessorNode
let micEnabled = false;
/** ---------- Audio (Playback) ---------- */
let playCtx = null;
let masterGain = null;
let nextPlayTime = 0;
let playedChunks = 0;
let droppedChunks = 0;
// 当前允许接收并播放的 assistant turn
let activeAssistantTurnId = null;
// 当前已被打断 / 完成的 turn,后续若再有残留二进制音频,直接丢弃
const interruptedTurnIds = new Set();
const completedTurnIds = new Set();
// 当前已经创建并可能仍在播放/等待播放的 source
const activeSources = new Set();
const INPUT_RATE = 16000;
const OUTPUT_RATE = 24000;
function pcm16ToFloat32(int16) {
const f32 = new Float32Array(int16.length);
for (let i = 0; i < int16.length; i++) {
f32[i] = int16[i] / 32768;
}
return f32;
}
function float32ToInt16PCM(f32) {
const out = new Int16Array(f32.length);
for (let i = 0; i < f32.length; i++) {
let s = Math.max(-1, Math.min(1, f32[i]));
out[i] = s < 0 ? s * 32768 : s * 32767;
}
return out;
}
// 把 24k Float32 重采样到 playCtx.sampleRate(通常 48k)
function resampleLinear(input, inRate, outRate) {
if (inRate === outRate) return input;
const ratio = inRate / outRate;
const outLen = Math.floor(input.length / ratio);
const out = new Float32Array(outLen);
for (let i = 0; i < outLen; i++) {
const t = i * ratio;
const i0 = Math.floor(t);
const i1 = Math.min(i0 + 1, input.length - 1);
const frac = t - i0;
out[i] = input[i0] * (1 - frac) + input[i1] * frac;
}
return out;
}
function getPlaybackState() {
return {
playSampleRate: playCtx?.sampleRate || null,
nextPlayTime,
playedChunks,
droppedChunks,
activeSources: activeSources.size,
activeAssistantTurnId,
interruptedTurnIds: Array.from(interruptedTurnIds),
completedTurnIds: Array.from(completedTurnIds),
playCtxState: playCtx?.state || "none"
};
}
function updatePlayState() {
setPlayState(getPlaybackState());
}
async function ensurePlaybackContext() {
if (playCtx) return;
playCtx = new (window.AudioContext || window.webkitAudioContext)();
masterGain = playCtx.createGain();
masterGain.gain.value = 1;
masterGain.connect(playCtx.destination);
nextPlayTime = playCtx.currentTime + 0.05;
updatePlayState();
}
function canAcceptAudioForTurn(turnId) {
if (!turnId) return false;
if (!activeAssistantTurnId) return false;
if (turnId !== activeAssistantTurnId) return false;
if (interruptedTurnIds.has(turnId)) return false;
if (completedTurnIds.has(turnId)) return false;
return true;
}
function clearAllScheduledSources() {
for (const src of activeSources) {
try {
src.stop(0);
} catch {}
}
activeSources.clear();
}
function interruptPlayback(reason = "", turnId = null) {
clearAllScheduledSources();
if (playCtx) {
const now = playCtx.currentTime;
nextPlayTime = now + 0.02;
}
if (turnId && activeAssistantTurnId === turnId) {
activeAssistantTurnId = null;
}
logLine(`[interrupt] playback cleared: ${reason}${turnId ? `, turnId=${turnId}` : ""}`);
updatePlayState();
}
function resetPlaybackRouting() {
activeAssistantTurnId = null;
interruptedTurnIds.clear();
completedTurnIds.clear();
clearAllScheduledSources();
if (playCtx) {
nextPlayTime = playCtx.currentTime + 0.02;
} else {
nextPlayTime = 0;
}
updatePlayState();
}
function schedulePcmPlayback(pcmInt16_24k, turnId) {
if (!playCtx || !masterGain) return;
// 不是当前 turn 的音频,直接丢弃
if (!canAcceptAudioForTurn(turnId)) {
droppedChunks++;
updatePlayState();
return;
}
const f32_24k = pcm16ToFloat32(pcmInt16_24k);
const f32 = resampleLinear(f32_24k, OUTPUT_RATE, playCtx.sampleRate);
const buffer = playCtx.createBuffer(1, f32.length, playCtx.sampleRate);
buffer.copyToChannel(f32, 0);
const src = playCtx.createBufferSource();
src.buffer = buffer;
src.connect(masterGain);
const now = playCtx.currentTime;
if (nextPlayTime < now) {
nextPlayTime = now + 0.01;
droppedChunks++;
}
const startAt = nextPlayTime;
nextPlayTime += buffer.duration;
activeSources.add(src);
src.onended = () => {
activeSources.delete(src);
updatePlayState();
};
// start 前再做一次 turn 校验,避免开始前刚好收到 interrupt/complete
if (!canAcceptAudioForTurn(turnId)) {
activeSources.delete(src);
droppedChunks++;
updatePlayState();
return;
}
src.start(startAt);
playedChunks++;
updatePlayState();
}
/** ---------- WS Handlers ---------- */
function setUiConnected(connected) {
btnConnect.disabled = connected;
btnDisconnect.disabled = !connected;
btnStartMic.disabled = !connected;
btnStopMic.disabled = !connected;
btnAudioEnd.disabled = !connected;
btnSendText.disabled = !connected;
}
function connectWs() {
const url = wsUrlInput.value.trim();
ws = new WebSocket(url);
ws.binaryType = "arraybuffer";
ws.onopen = async () => {
logLine(`[ws] open: ${url}`);
setUiConnected(true);
setSessionDisconnected(false);
resetPlaybackRouting();
await ensurePlaybackContext();
try {
const systemPrompt = systemPromptEl.value?.trim() || "";
const jobDescription = jobDescriptionEl.value?.trim() || "";
const resume = resumeEl.value?.trim() || "";
const questions = questionsEl.value?.trim() || "";
const greeting = greetingEl.value?.trim() || "";
const setupMsg = {
type: "setup",
system_prompt: systemPrompt,
job_description: jobDescription,
resume: resume,
questions: questions,
greeting: greeting
};
ws.send(JSON.stringify(setupMsg));
logLine(`[send] setup: ${JSON.stringify({
system_prompt: systemPrompt,
job_description: jobDescription,
resume: resume,
questions: questions,
greeting: greeting
})}`);
} catch (e) {
logLine("[send] setup error: " + (e?.message || e));
}
};
ws.onclose = (e) => {
logLine(`[ws] close: code=${e.code} reason=${e.reason || ""}`);
setUiConnected(false);
setSessionDisconnected(true);
resetPlaybackRouting();
stopMic().catch(() => {});
};
ws.onerror = () => {
logLine("[ws] error");
};
ws.onmessage = async (evt) => {
if (typeof evt.data === "string") {
try {
const obj = JSON.parse(evt.data);
if (obj.type === "SETUP_RECEIVED") {
setSessionId(obj.sessionId || "-");
setSessionDisconnected(false);
logLine(`[setup_received] sessionId=${obj.sessionId || ""}`);
} else if (obj.type === "assistant_turn_start") {
const turnId = obj.turnId || null;
// 新 turn 开始时,先把旧播放彻底打断,避免两轮重叠
if (activeAssistantTurnId && activeAssistantTurnId !== turnId) {
interruptPlayback("new assistant_turn_start replaces previous turn", activeAssistantTurnId);
}
activeAssistantTurnId = turnId;
// 新 turn 到来时,把同名 turn 从历史失效集合中清理掉
if (turnId) {
interruptedTurnIds.delete(turnId);
completedTurnIds.delete(turnId);
}
logLine(`[turn] assistant_turn_start turnId=${turnId || ""}`);
updatePlayState();
} else if (obj.type === "assistant_turn_interrupt") {
const turnId = obj.turnId || null;
if (turnId) interruptedTurnIds.add(turnId);
logLine(`[turn] assistant_turn_interrupt turnId=${turnId || ""}`);
interruptPlayback("assistant_turn_interrupt", turnId);
} else if (obj.type === "assistant_turn_complete") {
const turnId = obj.turnId || null;
if (turnId) completedTurnIds.add(turnId);
logLine(`[turn] assistant_turn_complete turnId=${turnId || ""}`);
// complete 只关闭路由,不主动 stop;
// 已经进来的最后一小段正常播完,complete 之后若还有残留包会被丢弃
if (turnId && activeAssistantTurnId === turnId) {
activeAssistantTurnId = null;
}
updatePlayState();
} else if (obj.type === "speech_started") {
logLine("[event] speech_started");
// 用户开始说话,属于打断信号
if (activeAssistantTurnId) {
interruptedTurnIds.add(activeAssistantTurnId);
}
interruptPlayback("speech_started", activeAssistantTurnId);
} else if (obj.type === "interrupted") {
logLine("[event] interrupted");
if (activeAssistantTurnId) {
interruptedTurnIds.add(activeAssistantTurnId);
}
interruptPlayback("interrupted", activeAssistantTurnId);
} else if (obj.type === "transcript_in") {
logLine(`[in ] ${obj.text || ""}`);
} else if (obj.type === "transcript_out") {
logLine(`[out] ${obj.text || ""}`);
} else if (obj.type === "text") {
logLine(`[txt] ${obj.text || ""}`);
} else if (obj.type === "turn_transcript") {
logLine(`[turn_transcript] in=${obj.inputText || ""} | out=${obj.outputText || ""}`);
} else if (obj.type === "turn_complete") {
logLine("[turn] complete");
} else if (obj.type === "setup_complete") {
logLine("[setup] complete");
} else if (obj.type === "setup_sent_to_model") {
logLine("[setup] sent_to_model");
} else if (obj.type === "gemini_connected") {
if (obj.sessionId) {
setSessionId(obj.sessionId);
setSessionDisconnected(false);
}
logLine(`[gemini] connected sessionId=${obj.sessionId || ""}`);
} else if (obj.type === "usage") {
logLine(
`[usage] prompt=${obj.promptTokenCount} response=${obj.responseTokenCount} total=${obj.totalTokenCount}`
);
} else if (obj.type === "go_away") {
logLine(`[goAway] timeLeft=${obj.timeLeft}`);
} else if (obj.type === "error") {
logLine(`[err] ${obj.where || ""}: ${obj.message || obj.text || ""}`);
} else {
logLine(`[evt] ${evt.data}`);
}
} catch {
logLine(`[text] ${evt.data}`);
}
return;
}
// 二进制:24k 16-bit PCM mono
if (evt.data instanceof ArrayBuffer) {
// 没有激活中的 assistant turn,说明这包音频没有合法归属,直接丢弃
const turnId = activeAssistantTurnId;
if (!turnId) {
droppedChunks++;
updatePlayState();
return;
}
const bytes = new Uint8Array(evt.data);
const i16 = new Int16Array(bytes.buffer, bytes.byteOffset, Math.floor(bytes.byteLength / 2));
if (playCtx && playCtx.state === "suspended") {
await playCtx.resume();
}
schedulePcmPlayback(i16, turnId);
}
};
}
/** ---------- Mic Capture ---------- */
async function startMic() {
if (!ws || ws.readyState !== WebSocket.OPEN) {
logLine("WS 未连接");
return;
}
if (micEnabled) return;
await ensurePlaybackContext();
if (playCtx.state === "suspended") await playCtx.resume();
micStream = await navigator.mediaDevices.getUserMedia({
audio: {
channelCount: 1,
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true
}
});
micCtx = new (window.AudioContext || window.webkitAudioContext)();
const source = micCtx.createMediaStreamSource(micStream);
try {
await micCtx.audioWorklet.addModule("./mic-worklet.js");
micNode = new AudioWorkletNode(micCtx, "mic-processor");
micNode.port.onmessage = (e) => {
const msg = e.data || {};
if (msg.type === "pcm_f32_16k") {
const f32 = new Float32Array(msg.data);
const i16 = float32ToInt16PCM(f32);
if (ws && ws.readyState === WebSocket.OPEN) {
ws.send(i16.buffer);
}
}
};
source.connect(micNode);
micNode.connect(micCtx.destination);
micNode.port.postMessage({ type: "enable" });
micEnabled = true;
logLine("[mic] started (AudioWorklet)");
} catch (err) {
logLine("[mic] AudioWorklet 不可用,回退到 ScriptProcessor");
const bufferSize = 4096;
const sp = micCtx.createScriptProcessor(bufferSize, 1, 1);
micNode = sp;
sp.onaudioprocess = (e) => {
const input = e.inputBuffer.getChannelData(0);
const resampled = resampleLinear(input, micCtx.sampleRate, INPUT_RATE);
const i16 = float32ToInt16PCM(resampled);
if (ws && ws.readyState === WebSocket.OPEN) {
ws.send(i16.buffer);
}
};
source.connect(sp);
sp.connect(micCtx.destination);
micEnabled = true;
logLine("[mic] started (ScriptProcessor)");
}
btnStartMic.disabled = true;
btnStopMic.disabled = false;
}
async function stopMic() {
micEnabled = false;
try {
if (micNode && micNode.port) {
micNode.port.postMessage({ type: "disable" });
}
} catch {}
if (micStream) {
micStream.getTracks().forEach((t) => t.stop());
micStream = null;
}
if (micCtx) {
try {
await micCtx.close();
} catch {}
micCtx = null;
}
micNode = null;
btnStartMic.disabled = !ws || ws.readyState !== WebSocket.OPEN;
btnStopMic.disabled = true;
logLine("[mic] stopped");
}
/** ---------- UI actions ---------- */
btnConnect.onclick = () => {
if (ws && ws.readyState === WebSocket.OPEN) return;
connectWs();
};
btnDisconnect.onclick = () => {
if (ws) ws.close(1000, "client close");
};
btnStartMic.onclick = () => startMic().catch((e) => logLine("[mic] start error: " + (e?.message || e)));
btnStopMic.onclick = () => stopMic().catch(() => {});
btnAudioEnd.onclick = () => {
if (!ws || ws.readyState !== WebSocket.OPEN) return;
ws.send(JSON.stringify({ type: "audio_end" }));
logLine("[send] audio_end");
};
btnSendText.onclick = () => {
if (!ws || ws.readyState !== WebSocket.OPEN) return;
const t = textInput.value.trim();
if (!t) return;
ws.send(JSON.stringify({
type: "text",
text: t
}));
logLine("[send] text: " + t);
textInput.value = "";
};
btnClearLog.onclick = () => {
logEl.textContent = "";
};
setUiConnected(false);
btnStopMic.disabled = true;
setSessionId("-");
setSessionDisconnected(true);
updatePlayState();
