片段汇总
项目目标
本项目旨在构建一个功能完备的 RAG(Retrieval-Augmented Generation)系统,主要目标包括:
- 知识库管理:支持创建、更新和删除知识库,便于用户高效维护内容。
- 文档处理:包括文档的拆分、片段的向量化处理,以提升检索效率和准确性。
- 问答系统:提供高效的向量检索和实时生成回答的能力,支持复杂汇总类问题的处理。
- 系统优化:通过统计分析和推理问答调试,不断优化系统性能和用户体验。
系统核心概念
在 RAG 系统中,以下是几个核心概念:
- 应用:知识库的集合。每个应用可以自定义提示词,以满足不同的个性化需求。
- 知识库:由多个文档组成,便于用户对内容进行分类和管理。
- 文档:系统中对应的真实文档内容。
- 片段:文档经过拆分后的最小内容单元,用于更高效的处理和检索。
功能实现步骤
数据库设计 查看 01.md
设计并实现项目所需的数据表结构与数据库方案,为后续的数据操作打下坚实基础。用户登录 查看 02.md
实现了安全可靠的用户认证系统,保护用户数据并限制未经授权的访问。模型管理 查看 03.md
支持针对不同平台的模型(如 OpenAI、Google Gemini、Claude)进行管理与配置。知识库管理 查看 04.md
提供创建、更新及删除知识库的功能,方便用户维护与管理文档内容。文档拆分 查看 05.md
可将文档拆分为多个片段,便于后续向量化和检索操作。片段向量 查看 06.md
将文本片段进行向量化处理,以便进行语义相似度计算及高效检索。命中率测试 查看 07.md
通过语义相似度和 Top-N 算法,检索并返回与用户问题最相关的文档片段,用于评估检索的准确性。文档管理 查看 08.md
提供上传和管理文档的功能,上传后可自动拆分为片段便于进一步处理。片段管理 查看 09.md
允许对已拆分的片段进行增、删、改、查等操作,确保内容更新灵活可控。问题管理 查看 10.md
为片段指定相关问题,以提升检索时的准确性与关联度。应用管理 查看 11.md
提供创建和配置应用(智能体)的功能,并可关联指定模型和知识库。向量检索 查看 12.md
基于语义相似度,在知识库中高效检索与用户问题最匹配的片段。推理问答调试 查看 13.md
提供检索与问答性能的评估工具,帮助开发者进行系统优化与调试。对话问答 查看 14.md
为用户提供友好的人机交互界面,结合检索到的片段与用户问题实时生成回答。统计分析 查看 15.md
对用户的提问与系统回答进行数据化分析,并以可视化图表的形式呈现系统使用情况。用户管理 查看 16.md
提供多用户管理功能,包括用户的增删改查及权限控制。API 管理 查看 17.md
对外提供标准化 API,便于外部系统集成和调用本系统的功能。存储文件到 S3 查看 18.md
将用户上传的文件存储至 S3 等对象存储平台,提升文件管理的灵活性与可扩展性。文档解析优化 查看 19.md
介绍与对比常见的文档解析方案,并提供提升文档解析速度和准确性的优化建议。片段汇总 查看 20.md
对片段内容进行汇总,以提升总结类问题的查询与回答效率。文档多分块与检索 查看 21.md
将片段进一步拆分为句子并进行向量检索,提升检索的准确度与灵活度。多文档支持 查看 22.md
兼容多种文档格式,包括.doc
,.docx
,.xls
,.xlsx
,.ppt
,.pptx
等。对话日志 查看 23.md
记录并展示对话日志,用于后续分析和问题回溯。检索性能优化 查看 24.md
提供整库扫描和分区检索等多种方式,进一步提高检索速度和效率。Milvus 查看 25.md
将向量数据库切换至 Milvus,以在大规模向量检索场景中获得更佳的性能与可扩展性。文档解析方案和费用对比 查看 26.md
对比不同文档解析方案在成本、速度、稳定性等方面的差异,为用户提供更加经济高效的选择。爬取网页数据 查看 27.md
支持从网页中抓取所需内容,后续处理流程与本地文档一致:分段、向量化、存储与检索。
片段汇总
功能背景
在 RAG 系统中,向量检索对于单个问题的检索准确度较高,但在处理复杂的汇总类问题时,检索的准确度会有所下降。这是因为大模型将多个片段添加到上下文中,导致信息的冗余和干扰。为了解决这一问题,本项目引入了片段汇总功能,通过对大量片段内容进行总结,提高复杂问题的回答准确率。
实现流程
- 文档拆分:将原始文档拆分为多个片段,每个片段约 1000 个 token。
- 片段总结:
- 使用预定义的总结提示词对每个片段内容进行总结,生成简洁且结构化的摘要。
- 将生成的摘要存储到
max_kb_sentence
表中,类型设置为 2,表示为总结内容。
- 向量化处理:
- 对总结后的片段内容进行向量化,以便后续的语义相似度计算和高效检索。
- 存储与管理:
- 将总结内容及其向量化结果存储到数据库中,便于快速检索和管理。
提示词
ParagraphSummaryPrompt.txt
文件内容
Summarize the following text in a concise, structured manner, ensuring that it covers the core themes and key information, suitable for use in vector similarity calculations:
Summary of
代码解析
MaxKbParagraphSummaryService
类
该类负责对段落内容进行总结,并将总结结果缓存到数据库中,以提高总结效率和避免重复请求。
package com.litongjava.maxkb.service.kb;
import java.util.concurrent.locks.Lock;
import com.google.common.util.concurrent.Striped;
import com.litongjava.db.activerecord.Db;
import com.litongjava.maxkb.model.MaxKbParagraphSummaryCache;
import com.litongjava.openai.chat.OpenAiChatResponseVo;
import com.litongjava.openai.client.OpenAiClient;
import com.litongjava.template.PromptEngine;
import com.litongjava.tio.utils.crypto.Md5Utils;
import com.litongjava.tio.utils.snowflake.SnowflakeIdUtils;
public class MaxKbParagraphSummaryService {
private static final Striped<Lock> locks = Striped.lock(1024);
String prompt = PromptEngine.renderToString("ParagraphSummaryPrompt.txt");
String selectContentSql = "select content from max_kb_paragraph_summary_cache where md5=?";
public String summary(String paragraphContent) {
String md5 = Md5Utils.getMD5(paragraphContent);
String summaryContent = Db.queryStr(selectContentSql, md5);
if (summaryContent != null) {
return summaryContent;
}
prompt += "\r\n\r\n" + paragraphContent;
Lock lock = locks.get(md5);
try {
summaryContent = Db.queryStr(selectContentSql, md5);
if (summaryContent != null) {
return summaryContent;
}
OpenAiChatResponseVo chat = null;
long start = System.currentTimeMillis();
try {
chat = OpenAiClient.chat(prompt);
} catch (Exception e) {
try {
chat = OpenAiClient.chat(prompt);
} catch (Exception e1) {
chat = OpenAiClient.chat(prompt);
}
}
long end = System.currentTimeMillis();
if (chat != null) {
summaryContent = chat.getChoices().get(0).getMessage().getContent();
MaxKbParagraphSummaryCache model = new MaxKbParagraphSummaryCache();
model.setId(SnowflakeIdUtils.id()).setMd5(md5).setSrc(paragraphContent).setContent(summaryContent);
model.setElapsed((end - start)).setSystemFingerprint(chat.getSystem_fingerprint()).setModel(chat.getModel())
//
.setCompletionTokens(chat.getUsage().getCompletion_tokens())
//
.setPromptTokens(chat.getUsage().getPrompt_tokens())
//
.setTotalTokens(chat.getUsage().getTotal_tokens());
model.save();
return summaryContent;
}
} finally {
lock.unlock();
}
return null;
}
}
代码说明:
- 锁机制:使用
Striped<Lock>
确保在多线程环境下对同一段落内容的总结请求不会重复执行,避免资源浪费。 - 摘要缓存:在对段落内容进行总结前,首先检查数据库中是否已存在该段落的摘要(通过 MD5 校验),若存在则直接返回,避免重复调用外部 API。
- 外部 API 调用:若摘要不存在,则调用 OpenAI 的 Chat API 生成摘要,并将结果存储到
max_kb_paragraph_summary_cache
表中,以供后续使用。
MaxKbSentenceService
类
该类负责将段落摘要转换为句子级别的记录,并进行向量化处理,最终存储到数据库中。
package com.litongjava.maxkb.service.kb;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CompletionService;
import java.util.concurrent.ExecutorCompletionService;
import java.util.concurrent.Future;
import org.postgresql.util.PGobject;
import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import com.litongjava.jfinal.aop.Aop;
import com.litongjava.maxkb.constant.TableNames;
import com.litongjava.maxkb.model.MaxKbSentence;
import com.litongjava.maxkb.utils.ExecutorServiceUtils;
import com.litongjava.tio.utils.crypto.Md5Utils;
import com.litongjava.tio.utils.snowflake.SnowflakeIdUtils;
import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.DocumentSplitter;
import dev.langchain4j.data.document.splitter.DocumentSplitters;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.openai.OpenAiTokenizer;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class MaxKbSentenceService {
MaxKbEmbeddingService maxKbEmbeddingService = Aop.get(MaxKbEmbeddingService.class);
CompletionService<MaxKbSentence> completionServiceSentence = new ExecutorCompletionService<>(ExecutorServiceUtils.getExecutorService());
CompletionService<Row> completionServiceRow = new ExecutorCompletionService<>(ExecutorServiceUtils.getExecutorService());
MaxKbParagraphSummaryService maxKbParagraphSummaryService = Aop.get(MaxKbParagraphSummaryService.class);
public boolean summaryToSentenceAndSave(Long dataset_id, String modelName, List<Row> paragraphRecords, Long documentIdFinal) {
boolean transactionSuccess;
// Step 1: Generate summaries asynchronously
List<Future<MaxKbSentence>> summaryFutures = new ArrayList<>(paragraphRecords.size());
for (Row paragraph : paragraphRecords) {
Future<MaxKbSentence> future = completionServiceSentence.submit(() -> {
String paragraphContent = paragraph.getStr("content");
// Generate summary for the paragraph
String sentenceContent = maxKbParagraphSummaryService.summary(paragraphContent);
// Create MaxKbSentence object with the summary
MaxKbSentence maxKbSentence = new MaxKbSentence();
maxKbSentence.setId(SnowflakeIdUtils.id()).setType(2) // Assuming type 2 indicates a summary
.setHitNum(0).setMd5(Md5Utils.getMD5(sentenceContent)).setContent(sentenceContent)
//
.setDatasetId(dataset_id).setDocumentId(documentIdFinal).setParagraphId(paragraph.getLong("id"));
return maxKbSentence;
});
summaryFutures.add(future);
}
// Step 2: Retrieve summaries from futures
List<MaxKbSentence> sentences = new ArrayList<>();
for (Future<MaxKbSentence> future : summaryFutures) {
try {
MaxKbSentence sentence = future.get(); // This will block until the summary is ready
if (sentence != null) {
sentences.add(sentence);
}
} catch (Exception e) {
log.error("Error generating summary: {}", e.getMessage(), e);
}
}
// Step 3: Generate vectors asynchronously for each summary
List<Future<Row>> vectorFutures = new ArrayList<>(sentences.size());
for (MaxKbSentence sentence : sentences) {
Future<Row> future = completionServiceRow.submit(() -> {
// Generate vector for the summary content
PGobject vector = maxKbEmbeddingService.getVector(sentence.getContent(), modelName);
Row record = sentence.toRecord();
record.set("embedding", vector);
return record;
});
vectorFutures.add(future);
}
// Step 4: Retrieve vectors from futures
List<Row> sentenceRows = new ArrayList<>();
for (Future<Row> future : vectorFutures) {
try {
Row record = future.get(); // This will block until the vector is ready
if (record != null) {
sentenceRows.add(record);
}
} catch (Exception e) {
log.error("Error generating vector: {}", e.getMessage(), e);
}
}
// Step 5: Save all sentence records with embeddings to the database within a transaction
transactionSuccess = Db.tx(() -> {
// Delete existing sentences for the document to avoid duplicates
Db.deleteById(TableNames.max_kb_sentence, "document_id", documentIdFinal);
// Batch save the new sentences with embeddings
Db.batchSave(TableNames.max_kb_sentence, sentenceRows, 2000);
return true;
});
return transactionSuccess;
}
}
代码说明:
- 并行处理:利用
CompletionService
实现异步处理,提升系统的处理效率。- 生成摘要:对每个段落内容调用
MaxKbParagraphSummaryService
生成摘要。 - 生成向量:对生成的摘要内容调用
MaxKbEmbeddingService
进行向量化处理。
- 生成摘要:对每个段落内容调用
- 事务管理:在数据库操作中使用事务,确保数据的一致性和完整性。在保存新的句子记录前,删除已有的相关记录,避免数据重复。
- 错误处理:对异步任务中的异常进行捕获和日志记录,确保系统的稳定性。
汇总前后的片段对比
为了直观展示片段汇总功能的效果,以下展示了在应用汇总功能前后,片段内容的对比。
汇总前 max_kb_paragraph_summary_cache
表内容示例
He preferred to stick to the essential facts—birth, death, and emancipation.
## Conclusion
Sometime in 1865, a Union soldier approached Ella in the middle of her chores.
“A soldier rode up to my ma and told her she was free.”
The starkness of Ella’s story stunned me.
Her life consisted of two essential facts—slavery and freedom juxtaposed to mark the beginning and end of the chronicle.
But this was what slavery did: it stripped your history to bare facts and
# Family Stories
I don’t know if it was the bare bones of Ella’s story or the hopefulness and despair that lurked in Poppa’s words as he recounted it, as if he were weighing the promise of freedom against the vast stretches of stolen land before him, that made me eager to know more than what Poppa remembered or wanted to share.
Peter and I listened, silent.
We didn’t know what to say.
Poppa didn’t remember any kin before his grandmother, who smoked a corncob pipe.
He had inherited his love of pipes from her.
It was one of the things I adored about him.
He always smelled sweet like the maple tobacco smoke rising from his pipe.
What he knew about our family ended with his grandmother Ellen.
He remembered no other names.
When he spoke of these things, I saw how the sadness and anger of not knowing his people distorted the soft lines of Poppa’s face.
It surprised me; he had always seemed invincible, strapping, six foot two, and handsome, even at eighty-five.
I had seen this ache in others too.
At a barbecue at my grandmother’s, two of her cousins nearly came to blows disputing a grandfather’s name.
I was still too young then to recognize the same feelings inside me.
But I wondered about my great-great-grandmother’s mother, as well as all the others who had been forgotten.
If Poppa’s mother or grandmother shared any details about their lives in slavery, he didn’t share them with my brother and me. No doubt he was unwilling to disclose what he considered unspeakable. Still, he shared more with my brother and me than he had with my mother. Even now, he liked to call her “little girl.” When I returned home and asked her if Poppa had ever spoken to her about slavery or her great-grandmother Ella, the girl on the road, she replied, “When I was growing up we didn’t talk of such things.” Her great-grandmother had died before she was born, so my mother recalled nothing about her, not even her name.
At twelve I became obsessed with the maternal great-great-grandmother I had never known, endlessly constructing and rearranging the scene: her unease as the soldier advanced toward her, or the soldier on horseback looming over her and the smile inching across her face as she digested his words, or the peal of laughter trailing behind her as she turned upon her heels, or the war between disbelief and wonder that overcame Ella as she bolted toward her mother. Mulling over the details Poppa had shared with me, I tried to fill in the blank spaces of the story, but I never succeeded. Since that afternoon with my great-grandfather, I had been looking for relatives whose only proof of existence was fragments of stories and names that repeated themselves across generations.
Unlike friends who possessed a great trove of family photographs, I had no idea what my great-grandmother looked like or even my grandaunts when they were girls. All these things were gone; some of the photographs were given as tokens to dead relatives and buried with them; others were lost. The images I possessed of them were drawn from memory and imagination. My aunt Mosella, whose name itself was a memorial to my great-grandfather Moses and his mother Ella, once described a photograph taken of her mother and my great-great-grandmother Polly, whom everyone called Big Momma. In the photograph, her mother, Lou, was wearing a ruffled dress with bloomers and seated on Big Momma’s lap. She didn’t remember what my great-great-grandmother wore but only echoed my mother’s description of her: she was a big-boned woman with a round face the color of dark chocolate.
# Big Momma's Stories
Big Momma had never spoken of her life in slavery, nor had Ellen or Ella. Poppa could fill in only the bare outlines of their lives. The gaps and silences of my family were not unusual: slavery made the past a mystery, unknown and unspeakable.
## The Tales of the Past
汇总后
### Summary of "Afrotopia"
#### Core Themes:
1. **Strangerhood and Identity**: The author's feeling of being an outsider in Ghana and the broader implications of being labeled as an "obruni" highlight themes of alienation and cultural disconnection.
2. **Historical Legacy of Slavery**: The narrative emphasizes the long-lasting impact of slavery on identity and belonging, conveying personal stories that interweave familial history with the collective trauma of African descendants.
3. **Disillusionment with Idealization**: The text critiques the romantic notions surrounding a return to Africa, revealing the realities faced by expatriates, and contrasts the expectations of a utopian African experience with the complexities of its history and current socio-economic challenges.
#### Key Information:
- **Experiences in Ghana**: The author navigates her feelings of isolation and alienation, grappling with the term "obruni" as a constant reminder of her foreignness.
- **Family History and Memory**: Personal anecdotes reflect the struggle to connect with ancestors’ experiences of slavery, and a longing to reclaim lost identities and narratives, despite the silence surrounding these histories.
- **Cultural Critique**: The text discusses the dichotomy of freedom and slavery, exploring how the legacies of colonialism and neocolonialism shape modern Ghanaian society and the expatriate experience.
- **Reflections on Independence**: The celebration of Ghana’s independence is juxtaposed with the ongoing struggles faced by its people, revealing the unmet expectations that arose from decolonization.
- **Coup Experience**: The author's encounter with a perceived coup highlights the tension and fears surrounding political instability, serving as a metaphor for the broader struggles of identity and belonging in a post-colonial context.
#### Conclusion:
The narrative underscores the complexities of returning to one's roots, blending personal, historical, and socio-political reflections to articulate the challenges and disillusionments faced by those navigating their identities in the wake of slavery and colonialism.
对比分析:
- 汇总前:原始内容较为冗长,涵盖了多个主题和细节,适合深入阅读和理解。然而,对于向量检索和生成模型来说,过长的文本可能导致上下文信息过载,影响检索准确性。
- 汇总后:通过结构化的总结,提炼出核心主题和关键信息,使内容更加简洁明了。这不仅有助于提高向量检索的效率和准确性,也为生成模型提供了更为聚焦和相关的上下文信息,提升复杂汇总类问题的回答质量。
结论
本项目成功构建了一个功能完备的 RAG 系统,涵盖了从用户认证、模型管理到知识库和文档管理的各个环节。通过引入片段汇总功能,显著提升了系统在处理复杂汇总类问题时的准确性和效率。未来,随着系统的不断优化和功能的扩展,RAG 系统将能够更好地服务于各类用户,满足其在信息检索和问答方面的多样化需求。