爬取一个静态网站的所有数据
这是一份完整的文档,对整个爬取静态网站数据的程序进行详细说明和代码组织,同时保留所有原有代码。该程序主要目标是将静态网站的所有网页数据(HTML、PDF、转换后的 Markdown 等)存入数据库,为后续使用大模型回答用户问题提供参考依据。整个项目基于 litongjava/playwright-server 开发,并利用 Java 21 的虚拟线程来提升并发爬取的性能。
项目背景
本项目的核心功能是爬取一个静态网站的所有数据,提取每个页面的 HTML 内容、PDF 文件文本以及转换为 Markdown 格式后的数据,并将这些数据存入数据库中。系统利用 Playwright 进行页面访问和渲染,结合 Java 21 的虚拟线程和线程池管理来实现高并发爬取,并采用 Jsoup、PDFBox、Flexmark 等第三方工具完成数据提取与转换。
项目整体架构与模块说明
项目主要分为以下几个模块,各模块协同完成静态网站数据爬取、转换和存储的全流程:
配置与初始化
通过 PlaywrightConfig
类在系统启动时完成 Playwright 相关资源的初始化,包括根据环境配置初始化固定数量的浏览器上下文池。同时,注册关闭钩子以确保系统退出时能够自动释放资源。
浏览器上下文池管理
利用 PlaywrightPool
类和内部包装的 PooledPage
类,实现浏览器上下文的池化管理。这样避免了频繁创建与销毁浏览器上下文的性能开销,并通过虚拟线程提高并发效率。
爬虫服务与任务调度
- 任务入口:由
CrawlController
提供 HTTP 接口,启动爬虫任务。 - 任务调度与爬虫逻辑:
CrawlWebPageTask
类负责不断从 URL 任务队列中获取任务,并针对 HTML 页面和 PDF 文件分别进行处理。页面解析中利用 Jsoup 解析 DOM,提取所有有效链接,加入任务队列中保证完整爬取同一域下所有数据。 - 并发执行:使用
TaskExecutorUtils
中的线程池来并发执行爬虫任务,利用 Java 虚拟线程实现高效任务调度。
数据持久化
使用 WebPageDao
类进行数据存储,保存原始 HTML、转换后的 Markdown 和 PDF 文本数据到数据库中。系统采用基于雪花算法生成的唯一 ID,并通过数据库表 web_page_cache
等记录数据。
工具类与辅助功能
项目中还提供了若干辅助工具类:
- MarkdownUtils:利用 FlexmarkHtmlConverter 将 HTML 转换为 Markdown 格式。
- PDFUtils:借助 PDFBox 解析 PDF 文档,提取其中文本内容。
- WebsiteUrlUtils:对 URL 进行域名提取、标准化和规范化处理,保证同一页面不会重复爬取。
- TaskExecutorUtils:管理爬虫任务的线程池,确保任务提交和拒绝策略得当。
- 内部任务模型:
CrawlTask
用于记录待爬取 URL 及其深度信息(后续可扩展)。
各模块代码详细说明
下面按照合理的组织顺序依次展示各个 Java 类的代码,并附加必要的解释说明。
1. 数据库表结构 SQL
下面提供了创建数据缓存、页面数据、URL 管理等相关表的 SQL 语句。这些表分别存储了网页缓存数据、目标页面数据以及待爬取 URL 任务队列。
数据库创建语句定义了用于缓存页面数据、URL 队列和目标网站页面数据的多个表。
drop table if exists web_page_cache;
CREATE TABLE "public"."web_page_cache" (
"id" BIGINT NOT NULL PRIMARY KEY,
"url" VARCHAR UNIQUE,
"title" VARCHAR,
"type" VARCHAR,
"text" text,
"html" text,
"markdown" text,urlurl
"remark" VARCHAR(256),
"creator" VARCHAR(64) DEFAULT '',
"create_time" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
"updater" VARCHAR(64) DEFAULT '',
"update_time" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
"deleted" SMALLINT DEFAULT 0,
"tenant_id" BIGINT NOT NULL DEFAULT 0
);
CREATE INDEX "web_page_cache_url" ON "web_page_cache" USING btree ("url" varchar_pattern_ops);
CREATE INDEX "web_page_cache_title" ON "web_page_cache" USING btree ("title" varchar_pattern_ops);
drop table if exists hawaii_web_page;
CREATE TABLE "public"."hawaii_web_page" (
"id" BIGINT NOT NULL PRIMARY KEY,
"url" VARCHAR UNIQUE,
"title" VARCHAR,
"type" VARCHAR,
"text" text,
"html" text,
"markdown" text,
"remark" VARCHAR(256),
"creator" VARCHAR(64) DEFAULT '',
"create_time" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
"updater" VARCHAR(64) DEFAULT '',
"update_time" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
"deleted" SMALLINT DEFAULT 0,
"tenant_id" BIGINT NOT NULL DEFAULT 0
);
CREATE INDEX "hawaii_web_page_url" ON "hawaii_web_page" USING btree ("url" varchar_pattern_ops);
CREATE INDEX "hawaii_web_page_title" ON "hawaii_web_page" USING btree ("title" varchar_pattern_ops);
drop table if exists web_page_url;
CREATE TABLE "public"."web_page_url" (
"id" BIGINT NOT NULL PRIMARY KEY,
"url" VARCHAR UNIQUE,
"status" int default 0,
"tried" int default 0,
"remark" VARCHAR(256),
"creator" VARCHAR(64) DEFAULT '',
"create_time" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
"updater" VARCHAR(64) DEFAULT '',
"update_time" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
"deleted" SMALLINT DEFAULT 0,
"tenant_id" BIGINT NOT NULL DEFAULT 0
);
CREATE INDEX "web_page_url_url" ON "web_page_url" USING btree ("url" varchar_pattern_ops);
-- status 0 添加到任务队列 1 爬取完成 2.如何判断爬取失败
drop table if exists hawaii_kapiolani_web_page;
CREATE TABLE "public"."hawaii_kapiolani_web_page" (
"id" BIGINT NOT NULL PRIMARY KEY,
"url" VARCHAR UNIQUE,
"title" VARCHAR,
"type" VARCHAR,
"text" text,
"html" text,
"markdown" text,
"remark" VARCHAR(256),
"creator" VARCHAR(64) DEFAULT '',
"create_time" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
"updater" VARCHAR(64) DEFAULT '',
"update_time" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
"deleted" SMALLINT DEFAULT 0,
"tenant_id" BIGINT NOT NULL DEFAULT 0
);
2. 配置与初始化
在 PlaywrightConfig
中,根据当前环境启动 Playwright,并初始化浏览器池。还注册了服务关闭时自动释放资源的钩子。
package com.litongjava.playwright.config;
import com.litongjava.annotation.AConfiguration;
import com.litongjava.annotation.Initialization;
import com.litongjava.hook.HookCan;
import com.litongjava.playwright.pool.PlaywrightPool;
import com.litongjava.tio.utils.environment.EnvUtils;
import lombok.extern.slf4j.Slf4j;
@AConfiguration
@Slf4j
public class PlaywrightConfig {
@Initialization
public void config() {
if (EnvUtils.getBoolean("playwright.enable", true)) {
// 启动
log.info("start init playwright");
if (EnvUtils.isDev()) {
PlaywrightPool.init(2);
} else {
int cpuCount = Runtime.getRuntime().availableProcessors();
PlaywrightPool.init(cpuCount);
}
log.info("end init playwright");
// 服务关闭时,自动关闭浏览器和 Playwright 实例
HookCan.me().addDestroyMethod(() -> {
PlaywrightPool.close();
});
}
}
}
3. 浏览器池管理
3.1 PlaywrightPool
该类负责初始化 Playwright 实例、启动浏览器、创建固定数量的 BrowserContext,并通过 BlockingQueue 实现池化管理。同时通过定时任务监控可用数量。
package com.litongjava.playwright.pool;
import java.util.Random;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import com.microsoft.playwright.Browser;
import com.microsoft.playwright.BrowserContext;
import com.microsoft.playwright.BrowserType;
import com.microsoft.playwright.Page;
import com.microsoft.playwright.Playwright;
import lombok.extern.slf4j.Slf4j;
import com.microsoft.playwright.BrowserType.LaunchOptions;
/**
* PlaywrightPool 用于管理 BrowserContext 对象,减少频繁创建造成的性能开销。
* 每次 acquirePage() 从池中取出一个 BrowserContext,并创建一个 Page,
* 当 Page 关闭时自动将 BrowserContext 归还池中。
*/
@Slf4j
public class PlaywrightPool {
private static BlockingQueue<BrowserContext> pool = null;
private static BlockingQueue<Playwright> playwrightPool = null;
private static BlockingQueue<Browser> browserPool = null;
private static int poolSize = 0;
public static LaunchOptions launchOptions = new BrowserType.LaunchOptions().setHeadless(true);
private static ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor(Thread.ofVirtual().factory());
/**
* 构造时初始化 Playwright、Browser 以及固定数量的 BrowserContext
*
* @param poolSize 池大小
*/
public static void init(int poolSize) {
PlaywrightPool.poolSize = poolSize;
PlaywrightPool.pool = new ArrayBlockingQueue<>(poolSize);
PlaywrightPool.playwrightPool = new ArrayBlockingQueue<>(poolSize);
PlaywrightPool.browserPool = new ArrayBlockingQueue<>(poolSize);
for (int i = 0; i < poolSize; i++) {
Playwright playwright = Playwright.create();
Browser browser = playwright.chromium().launch(launchOptions);
BrowserContext context = browser.newContext();
playwrightPool.offer(playwright);
browserPool.offer(browser);
pool.offer(context);
}
scheduler.scheduleAtFixedRate(() -> {
new Random().nextInt(1,10);
log.info("PlaywrightPool - Available: {}/{}", PlaywrightPool.availableCount(), PlaywrightPool.totalCount());
}, 0, 30, TimeUnit.SECONDS);
}
/**
* 获取一个 Page 对象,内部会从池中取出一个 BrowserContext,
* 并包装为 PooledPage(实现了 Page 接口、AutoCloseable)。
*
* @return PooledPage 对象,使用完毕后调用 close() 归还 BrowserContext
* @throws InterruptedException 如果等待过程中被中断
*/
public static Page acquirePage() throws InterruptedException {
BrowserContext context = pool.take();
Page page = context.newPage();
return new PooledPage(page, context, pool);
}
/**
* 返回当前池中可用的 BrowserContext 数量
*
* @return 可用数量
*/
public static int availableCount() {
return pool.size();
}
/**
* 池的总大小
*
* @return 池大小
*/
public static int totalCount() {
return poolSize;
}
/**
* 关闭池中所有 BrowserContext 以及 Browser、Playwright 实例
*/
public static void close() {
for (BrowserContext context : pool) {
context.close();
}
for (Playwright context : playwrightPool) {
context.close();
}
for (Browser context : browserPool) {
context.close();
}
}
}
3.2 PooledPage
包装了 Playwright 的 Page 对象,当 Page 关闭时自动归还 BrowserContext 到池中,从而简化资源管理。 代码太长,省略
4. 数据持久化与 DAO
WebPageDao
类用于将页面数据(HTML、Markdown 等)写入数据库,同时利用 Striped 锁保证同一 URL 的并发写入安全。
package com.litongjava.playwright.dao;
import java.util.concurrent.locks.Lock;
import org.jsoup.Jsoup;
import com.google.common.util.concurrent.Striped;
import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import com.litongjava.playwright.utils.WebsiteUrlUtils;
import com.litongjava.tio.utils.snowflake.SnowflakeIdUtils;
public class WebPageDao {
private final Striped<Lock> stripedLocks = Striped.lock(256);
public boolean exists(String cacheTableName, String field, String value) {
return Db.exists(cacheTableName, field, value);
}
public void saveMarkdown(String tableName, String url, String title, String type, String markdown) {
String canonical = WebsiteUrlUtils.canonicalizeUrl(url);
if (exists(tableName, "url", canonical)) {
return;
}
Lock lock = stripedLocks.get(canonical);
lock.lock();
try {
if (exists(tableName, "url", canonical)) {
return;
}
Row row = new Row().set("id", SnowflakeIdUtils.id()).set("url", canonical).set("title", title).set("type", type).set("markdown", markdown);
Db.save(tableName, row);
} finally {
lock.unlock();
}
}
/**
* 保存页面内容到数据库(使用 Striped 锁防止并发写入同一 URL)
*/
public void saveContent(String tableName, String url, String title, String type, String html) {
String canonical = WebsiteUrlUtils.canonicalizeUrl(url);
if (exists(tableName, "url", canonical)) {
return;
}
Lock lock = stripedLocks.get(canonical);
lock.lock();
try {
if (exists(tableName, "url", canonical)) {
return;
}
Row row = new Row().set("id", SnowflakeIdUtils.id()).set("url", canonical).set("title", title).set("type", type).set("html", html).set("text", Jsoup.parse(html).text());
Db.save(tableName, row);
} finally {
lock.unlock();
}
}
public void saveHtmlAndMarkdown(String tableName, String url, String title, String html, String markdown) {
String canonical = WebsiteUrlUtils.canonicalizeUrl(url);
if (exists(tableName, "url", canonical)) {
return;
}
Lock lock = stripedLocks.get(canonical);
lock.lock();
try {
if (exists(tableName, "url", canonical)) {
return;
}
Row row = new Row().set("id", SnowflakeIdUtils.id()).set("url", canonical).set("title", title).set("type", "html")
//
.set("html", html).set("markdown", markdown);
Db.save(tableName, row);
} finally {
lock.unlock();
}
}
}
5. 爬虫服务
5.1 WebPageCrawlService
负责爬取页面数据,包括针对 PDF 与 HTML 页面分别处理。对于已存在的 URL,先从数据库缓存中直接获取数据;否则利用浏览器池打开页面进行爬取、等待页面加载完成后提取 HTML 和标题,再保存至数据库。
package com.litongjava.playwright.service;
import java.util.concurrent.locks.Lock;
import com.google.common.util.concurrent.Striped;
import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import com.litongjava.model.web.WebPageContent;
import com.litongjava.playwright.consts.TableNames;
import com.litongjava.playwright.pool.PlaywrightPool;
import com.litongjava.playwright.utils.PDFUtils;
import com.litongjava.tio.utils.hutool.FilenameUtils;
import com.litongjava.tio.utils.snowflake.SnowflakeIdUtils;
import com.microsoft.playwright.Page;
import com.microsoft.playwright.options.LoadState;
public class WebPageCrawlService {
private final Striped<Lock> stripedLocks = Striped.lock(256);
/**
* cache
* @param url
* @return
*/
public String getPdfContent(String url) {
Lock lock = stripedLocks.get(url);
lock.lock();
try {
String sql = "select html from %s where type=? and url=?";
sql = String.format(sql, TableNames.web_page_cache);
String title = FilenameUtils.getBaseName(url);
//db cache
String content = Db.queryStr(sql, "pdf", url);
if (content != null) {
return content;
}
// http
content = PDFUtils.getContent(url);
content = content.replace("\u0000", "").trim();
Row row = new Row().set("id", SnowflakeIdUtils.id()).set("url", url).set("title", title).set("type", "pdf")
//
.set("html", content);
Db.save(TableNames.web_page_cache, row);
return content;
} finally {
lock.unlock();
}
}
public WebPageContent getHtml(String url) throws InterruptedException {
Lock lock = stripedLocks.get(url);
lock.lock();
try {
String sql = "select title,html from %s where type=? and url=?";
sql = String.format(sql, TableNames.web_page_cache);
//db cache
Row first = Db.findFirst(sql, "html", url);
if (first != null) {
String title = first.getStr("title");
String html = first.getString("html");
WebPageContent content = new WebPageContent(title, url).setContent(html);
return content;
}
// http
try (Page page = PlaywrightPool.acquirePage()) {
// 设置页面超时时间为 1 分钟(60000ms)
page.setDefaultNavigationTimeout(60000);
page.setDefaultTimeout(60000);
// 控制爬取速率
try {
Thread.sleep(500);
} catch (InterruptedException e) {
e.printStackTrace();
}
// 导航至目标 URL
page.navigate(encodeUrl(url));
// 等待页面达到网络空闲状态和加载完成状态
page.waitForLoadState(LoadState.NETWORKIDLE);
page.waitForLoadState(LoadState.LOAD);
String html = page.content();
String title = page.title();
Row row = new Row().set("id", SnowflakeIdUtils.id()).set("url", url).set("title", title).set("type", "html")
//
.set("html", html);
Db.save(TableNames.web_page_cache, row);
return new WebPageContent(title, url, "", html);
}
} finally {
lock.unlock();
}
}
/**
* 对 URL 进行预处理,针对路径中的非法字符进行编码。
*
* @param url 原始 URL
* @return 处理后的 URL
*/
private String encodeUrl(String url) {
if (url == null || url.isEmpty()) {
return url;
}
return url.replace("[", "%5B").replace("]", "%5D");
}
}
5.2 CrawlWebPageTask
此类为爬虫任务的核心逻辑,负责从数据库任务队列中取出待爬取 URL,然后根据页面类型(HTML 或 PDF)调用对应的处理方法。
在处理 HTML 页面时,还会利用 Jsoup 提取页面中所有有效链接(只保留与目标域名相同的链接)并加入任务队列中,从而保证整个网站的完整爬取。
package com.litongjava.playwright.service;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.litongjava.db.activerecord.Db;
import com.litongjava.jfinal.aop.Aop;
import com.litongjava.model.web.WebPageContent;
import com.litongjava.playwright.dao.WebPageDao;
import com.litongjava.playwright.model.WebPageUrl;
import com.litongjava.playwright.utils.MarkdownUtils;
import com.litongjava.playwright.utils.TaskExecutorUtils;
import com.litongjava.playwright.utils.WebsiteUrlUtils;
import com.litongjava.tio.utils.hutool.FilenameUtils;
import com.litongjava.tio.utils.hutool.StrUtil;
import com.litongjava.tio.utils.snowflake.SnowflakeIdUtils;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class CrawlWebPageTask {
WebPageCrawlService crawlWebPageService = Aop.get(WebPageCrawlService.class);
WebPageDao webPageDao = Aop.get(WebPageDao.class);
String tableName;
String baseUrl;
String baseDomain;
public CrawlWebPageTask(String url, String tableName) {
this.baseUrl = url;
this.tableName = tableName;
String domain = WebsiteUrlUtils.extractDomain(url);
this.baseDomain = domain;
this.addLink(url);
}
public void run() {
while (true) {
/**
* status 0 add 1 starting 2.finish
*/
String sql = "select id,url,status,tried from web_page_url where status=0 and tried < 3 limit 10";
List<WebPageUrl> list = WebPageUrl.dao.find(sql);
if (list.size() > 0) {
for (WebPageUrl webPageUrl : list) {
this.updateUrlStatusToRunning(webPageUrl);
TaskExecutorUtils.executor.submit(() -> {
try {
this.processUrl(webPageUrl);
} catch (Exception e) {
log.error(e.getMessage(), e);
}
});
}
}
try {
Thread.sleep(1000L);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
public void processUrl(WebPageUrl webPageUrl) {
String url = "https://" + webPageUrl.getUrl();
// 针对 PDF 与 HTML 分支分别处理
if (url.endsWith(".pdf")) {
try {
// 处理 PDF 文件
String content = Aop.get(WebPageCrawlService.class).getPdfContent(url);
String filename = FilenameUtils.getBaseName(url);
if (content != null) {
content = content.replace("\u0000", "").trim();
try {
webPageDao.saveMarkdown(tableName, url, filename, "pdf", content);
updateUrlStatusToFinished(webPageUrl);
} catch (Exception e) {
log.error("Failed to save:{},{}", url, filename, e);
}
}
} catch (Exception e) {
log.error("PDF processing failed for URL {}: {}", url, e.getMessage());
updateFailureCount(webPageUrl);
}
} else {
try {
String canonical = WebsiteUrlUtils.canonicalizeUrl(url);
if (webPageDao.exists(tableName, "url", canonical)) {
return;
}
log.info("Processing URL: {} (Attempt {})", url, webPageUrl.getTried());
WebPageContent webPage = Aop.get(WebPageCrawlService.class).getHtml(url);
String title = webPage.getTitle();
String html = webPage.getContent();
// 假设 htmlString 是你的 HTML 字符串
Document document = Jsoup.parse(html, baseUrl);
Element body = document.body();
String bodyHtml = body.html();
String markdown = MarkdownUtils.toMd(bodyHtml);
try {
webPageDao.saveHtmlAndMarkdown(tableName, url, title, html, markdown);
updateUrlStatusToFinished(webPageUrl);
} catch (Exception e) {
log.error("Failed to save:{},{}", url, title, e);
}
Set<String> links = extractValidLinks(document);
for (String link : links) {
addLink(link);
}
} catch (Exception e) {
log.error(e.getMessage(), e);
updateFailureCount(webPageUrl);
}
}
}
public void addLink(String link) {
String canonical = WebsiteUrlUtils.canonicalizeUrl(link);
boolean urlExists = webPageDao.exists("web_page_url", "url", canonical);
if (!urlExists) {
new WebPageUrl().setId(SnowflakeIdUtils.id()).setUrl(canonical).save();
}
}
/**
* 使用 Jsoup 解析 DOM,提取页面中所有有效链接(剔除锚点、只保留同一域链接)
*/
public Set<String> extractValidLinks(Document doc) {
Set<String> links = new HashSet<>();
Elements elements = doc.select("a[href]");
for (Element el : elements) {
String absUrl = el.absUrl("href");
String normalized = WebsiteUrlUtils.normalizeUrl(absUrl);
if (StrUtil.isBlank(normalized)) {
continue;
}
if (isSameDomain(normalized)) {
links.add(normalized);
}
}
return links;
}
/**
* 判断 URL 是否与基础域名相同
*/
private boolean isSameDomain(String url) {
String domain = WebsiteUrlUtils.extractDomain(url);
return domain.equalsIgnoreCase(baseDomain);
}
private void updateUrlStatusToRunning(WebPageUrl webPageUrl) {
Db.updateBySql("update web_page_url set tried=1,status=1 where id=?", webPageUrl.getId());
}
private void updateUrlStatusToFinished(WebPageUrl webPageUrl) {
Db.updateBySql("update web_page_url set tried=1,status=2 where id=?", webPageUrl.getId());
}
private void updateFailureCount(WebPageUrl webPageUrl) {
Integer tried = webPageUrl.getTried();
tried++;
Db.updateBySql("update web_page_url set status =0,tried=? where id=?", tried, webPageUrl.getId());
}
}
6. 控制器入口
通过 CrawlController
提供一个 HTTP GET 接口,当请求此接口时,会启动爬虫任务,开始对目标网站进行爬取。
package com.litongjava.playwright.controller;
import com.litongjava.annotation.Get;
import com.litongjava.annotation.RequestPath;
import com.litongjava.model.body.RespBodyVo;
import com.litongjava.playwright.consts.TableNames;
import com.litongjava.playwright.service.CrawlWebPageTask;
import com.litongjava.tio.utils.thread.TioThreadUtils;
import lombok.extern.slf4j.Slf4j;
@RequestPath("/crawl")
@Slf4j
public class CrawlController {
@Get("/hawaii_kapiolani_web_page")
public RespBodyVo index() {
TioThreadUtils.execute(() -> {
String url = "https://www.kapiolani.hawaii.edu/";
// AdvancedCrawlService 构造时启动爬虫任务
CrawlWebPageTask crawlWebPageTask = new CrawlWebPageTask(url, TableNames.hawaii_kapioalni_web_page);
try {
crawlWebPageTask.run();
} catch (Exception e) {
log.error(e.getMessage(), e);
}
});
return RespBodyVo.ok();
}
}
7. 工具类与辅助功能
7.1 MarkdownUtils
利用 Flexmark 将 HTML 内容转换为 Markdown 格式。
package com.litongjava.playwright.utils;
import com.vladsch.flexmark.html2md.converter.FlexmarkHtmlConverter;
public class MarkdownUtils {
public static FlexmarkHtmlConverter converter = FlexmarkHtmlConverter.builder().build();
public static String toMd(String html) {
return converter.convert(html);
}
}
7.2 PDFUtils
使用 PDFBox 解析 PDF 文件,并提取文本内容。
package com.litongjava.playwright.utils;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import com.litongjava.tio.utils.http.HttpDownloadUtils;
import com.litongjava.tio.utils.url.UrlUtils;
public class PDFUtils {
public static String getContent(String rawUrl) {
try {
String url = UrlUtils.encodeUrl(rawUrl);
ByteArrayOutputStream download = HttpDownloadUtils.download(url, null);
try (ByteArrayInputStream inputStream = new ByteArrayInputStream(download.toByteArray()); PDDocument document = PDDocument.load(inputStream)) {
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
return text;
}
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
7.3 TaskExecutorUtils
管理线程池,利用 Java 虚拟线程执行爬虫任务,同时设置队列容量和拒绝执行策略。
package com.litongjava.playwright.utils;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.RejectedExecutionException;
import java.util.concurrent.RejectedExecutionHandler;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;
public class TaskExecutorUtils {
public static int cpuCount = Runtime.getRuntime().availableProcessors();
private static AtomicLong threadCounter = new AtomicLong(0);
private static int queueCapacity = 100;
public static ExecutorService executor;
static {
executor = new ThreadPoolExecutor(cpuCount, // corePoolSize
cpuCount, // maximumPoolSize
0L, // keepAliveTime
TimeUnit.MILLISECONDS, // time unit
new ArrayBlockingQueue<>(queueCapacity), //
runnable -> {
Thread t = Thread.ofVirtual().factory().newThread(runnable);
t.setName("crawl-thread-" + threadCounter.getAndIncrement());
return t;
},
new RejectedExecutionHandler() {
@Override
public void rejectedExecution(Runnable r, ThreadPoolExecutor executor) {
try {
executor.getQueue().put(r);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RejectedExecutionException("Task submission interrupted", e);
}
}
});
}
}
7.4 WebsiteUrlUtils
提供 URL 的域名提取、标准化和规范化方法,确保相同页面不会重复爬取。
package com.litongjava.playwright.utils;
import java.net.URL;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class WebsiteUrlUtils {
/**
* 提取 URL 中的域名(不含 www. 前缀)
*/
public static String extractDomain(String url) {
try {
URL netUrl = new URL(url);
String host = netUrl.getHost();
if (host != null) {
return host.startsWith("www.") ? host.substring(4) : host;
}
} catch (Exception e) {
log.error("Error extracting domain from url: {}", url, e);
}
return "";
}
/**
* 生成 URL 的标准形式,去除协议部分(http, https)
*/
public static String canonicalizeUrl(String url) {
String normalized = normalizeUrl(url);
return normalized.replaceFirst("(?i)^(https?://)", "");
}
/**
* 规范化 URL:剔除锚点部分,去除尾部斜杠,并做 trim
*/
public static String normalizeUrl(String url) {
if (url == null || url.isEmpty())
return "";
int index = url.indexOf("#");
if (index != -1) {
url = url.substring(0, index);
}
url = url.trim();
if (url.endsWith("/") && url.length() > 1) {
url = url.substring(0, url.length() - 1);
}
return url;
}
}
8. 内部任务模型
CrawlTask
类记录了待爬取 URL 及其爬取深度,便于后续扩展任务调度逻辑。
package com.litongjava.playwright.vo;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
/**
* 内部任务类,记录 URL 及其爬取深度
*/
@Data
@NoArgsConstructor
@AllArgsConstructor
public class CrawlTask {
private String url;
private int depth;
}
总结
本文档详细介绍了如何利用 Playwright 结合 Java 21 虚拟线程来实现一个静态网站数据爬取系统。我们提供了数据库表结构、系统配置、浏览器上下文池管理、任务调度、页面数据获取、数据持久化以及辅助工具类的完整代码,并对各模块的作用和实现逻辑进行了详细说明。整个系统既考虑到了爬取的高并发性,也注重了数据去重、异常处理和后续数据转换,为大模型应用提供了坚实的数据支持。