Backend Development 19 min read

Storing Douyin and Baidu Hot Search Data with MySQL, MyBatis Generator, and Java Crawlers

This tutorial explains how to design a MySQL table for hot‑search records, generate Java entity and mapper classes using MyBatis Generator, create unique IDs for each entry, and implement scheduled Java crawlers for Douyin and Baidu hot‑search data that persist the results via Spring‑Boot services.

Rare Earth Juejin Tech Community

Oct 29, 2024

Storing Douyin and Baidu Hot Search Data with MySQL, MyBatis Generator, and Java Crawlers

The article begins with a brief recap of previous steps (setting up an Alibaba Cloud server, JDK, Redis, MySQL) and introduces the goal of persisting crawled hot‑search data into a database for front‑end consumption.

Table Design

A MySQL table t_sbmy_hot_search is defined with fields such as hot_search_id, hot_search_title, hot_search_url, and hot_search_heat. The full CREATE TABLE statement is provided.

Generating Java Objects with MyBatis Generator

The author recommends using the MyBatis Generator plugin in IntelliJ IDEA. The steps include creating a generator folder, adding config.properties (JDBC URL, user, password, table name, entity name, mapper name), and a generatorConfiguration.xml that specifies the target packages, classpath entry for the MySQL driver, and plugin configuration.

CREATE TABLE `t_sbmy_hot_search` (
  `id` bigint(20) unsigned zerofill NOT NULL AUTO_INCREMENT COMMENT '物理主键',
  `hot_search_id` varchar(255) DEFAULT NULL COMMENT '热搜ID',
  `hot_search_excerpt` text COMMENT '热搜摘录',
  `hot_search_heat` varchar(255) DEFAULT NULL COMMENT '热搜热度',
  `hot_search_title` varchar(2048) DEFAULT NULL COMMENT '热搜标题',
  `hot_search_url` text COMMENT '热搜链接',
  `hot_search_cover` text COMMENT '热搜封面',
  `hot_search_author` varchar(255) DEFAULT NULL COMMENT '热搜作者',
  `hot_search_author_avatar` text COMMENT '热搜作者头像',
  `hot_search_resource` varchar(255) DEFAULT NULL COMMENT '热搜来源',
  `hot_search_order` int DEFAULT NULL COMMENT '热搜排名',
  `gmt_create` datetime DEFAULT NULL COMMENT '创建时间',
  `gmt_modified` datetime DEFAULT NULL COMMENT '更新时间',
  `creator_id` bigint DEFAULT NULL COMMENT '创建人',
  `modifier_id` bigint DEFAULT NULL COMMENT '更新人',
  PRIMARY KEY (`id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4;

After running the generator, the following core classes are produced (or manually added):

package com.summo.sbmy.dao.entity;

import java.util.Date;
import javax.persistence.*;
import com.baomidou.mybatisplus.annotation.*;
import com.summo.sbmy.dao.AbstractBaseDO;
import lombok.*;

@Getter
@Setter
@TableName("t_sbmy_hot_search")
@NoArgsConstructor
@AllArgsConstructor
@Builder
@ToString
public class SbmyHotSearchDO extends AbstractBaseDO<SbmyHotSearchDO> {
    @TableId(type = IdType.AUTO)
    private Long id;
    @Column(name = "hot_search_title")
    private String hotSearchTitle;
    @Column(name = "hot_search_author")
    private String hotSearchAuthor;
    @Column(name = "hot_search_resource")
    private String hotSearchResource;
    @Column(name = "hot_search_order")
    private Integer hotSearchOrder;
    @Column(name = "hot_search_id")
    private String hotSearchId;
    @Column(name = "hot_search_heat")
    private String hotSearchHeat;
    @Column(name = "hot_search_url")
    private String hotSearchUrl;
    @Column(name = "hot_search_cover")
    private String hotSearchCover;
    @Column(name = "hot_search_author_avatar")
    private String hotSearchAuthorAvatar;
    @Column(name = "hot_search_excerpt")
    private String hotSearchExcerpt;
}

Additional supporting classes such as SbmyHotSearchMapper, SbmyHotSearchRepository, AbstractBaseDO, and MetaObjectHandlerConfig are shown, handling CRUD operations, automatic timestamp filling, and MyBatis mapping.

Unique ID Generation

Because Douyin does not provide a stable ID, a deterministic UUID is generated from the title hash:

public static String getHashId(String title) {
    long seed = title.hashCode();
    Random rnd = new Random(seed);
    return new UUID(rnd.nextLong(), rnd.nextLong()).toString();
}

The same title always yields the same UUID, ensuring idempotent inserts.

Data Storage Flow

The service method saveCache2DB checks for existing IDs, filters duplicates, logs the number of new records, and performs a batch insert.

@Override
public Boolean saveCache2DB(List<SbmyHotSearchDO> sbmyHotSearchDOS) {
    if (CollectionUtils.isEmpty(sbmyHotSearchDOS)) {
        return Boolean.TRUE;
    }
    List<String> searchIdList = sbmyHotSearchDOS.stream()
        .map(SbmyHotSearchDO::getHotSearchId)
        .collect(Collectors.toList());
    List<SbmyHotSearchDO> existing = sbmyHotSearchRepository.list(
        new QueryWrapper<SbmyHotSearchDO>().lambda().in(SbmyHotSearchDO::getHotSearchId, searchIdList));
    if (CollectionUtils.isNotEmpty(existing)) {
        List<String> existingIds = existing.stream()
            .map(SbmyHotSearchDO::getHotSearchId)
            .collect(Collectors.toList());
        sbmyHotSearchDOS = sbmyHotSearchDOS.stream()
            .filter(d -> !existingIds.contains(d.getHotSearchId()))
            .collect(Collectors.toList());
    }
    if (CollectionUtils.isEmpty(sbmyHotSearchDOS)) {
        return Boolean.TRUE;
    }
    log.info("本次新增[{}]条数据", sbmyHotSearchDOS.size());
    return sbmyHotSearchRepository.saveBatch(sbmyHotSearchDOS);
}

Douyin Hot‑Search Crawler

A scheduled Spring component fetches JSON from Douyin, extracts fields, builds SbmyHotSearchDO objects, generates IDs, and persists them.

@Component
@Slf4j
public class DouyinHotSearchJob {
    @Autowired
    private SbmyHotSearchService sbmyHotSearchService;

    @Scheduled(fixedRate = 1000 * 60 * 60)
    public void hotSearch() throws IOException {
        try {
            OkHttpClient client = new OkHttpClient().newBuilder().build();
            Request request = new Request.Builder()
                .url("https://www.iesdouyin.com/web/api/v2/hotsearch/billboard/word/")
                .method("GET", null).build();
            Response response = client.newCall(request).execute();
            JSONObject jsonObject = JSONObject.parseObject(response.body().string());
            JSONArray array = jsonObject.getJSONArray("word_list");
            List<SbmyHotSearchDO> list = Lists.newArrayList();
            for (int i = 0; i < array.size(); i++) {
                JSONObject obj = (JSONObject) array.get(i);
                SbmyHotSearchDO do = SbmyHotSearchDO.builder()
                    .hotSearchResource(DOUYIN.getCode())
                    .build();
                do.setHotSearchTitle(obj.getString("word"));
                do.setHotSearchId(getHashId(DOUYIN.getCode() + do.getHotSearchTitle()));
                do.setHotSearchUrl("https://www.douyin.com/search/" + do.getHotSearchTitle() + "?type=general");
                do.setHotSearchHeat(obj.getString("hot_value"));
                do.setHotSearchOrder(i + 1);
                list.add(do);
            }
            sbmyHotSearchService.saveCache2DB(list);
        } catch (IOException e) {
            log.error("获取抖音数据异常", e);
        }
    }

    public static String getHashId(String title) { /* same as above */ }
}

Baidu Hot‑Search Crawler

Another scheduled job uses Jsoup to parse the Baidu hot‑search HTML page, extracts title, image, excerpt, URL, and heat index, creates entities, and stores them.

@Component
@Slf4j
public class BaiduHotSearchJob {
    @Autowired
    private SbmyHotSearchService sbmyHotSearchService;

    @Scheduled(fixedRate = 1000 * 60 * 60)
    public void hotSearch() throws IOException {
        try {
            String url = "https://top.baidu.com/board?tab=realtime&sa=fyb_realtime_31065";
            Document doc = Jsoup.connect(url).get();
            Elements titles = doc.select(".c-single-text-ellipsis");
            Elements imgs = doc.select(".category-wrap_iQLoo .index_1Ew5p").next("img");
            Elements contents = doc.select(".hot-desc_1m_jR.large_nSuFU");
            Elements urls = doc.select(".category-wrap_iQLoo a.img-wrapper_29V76");
            Elements levels = doc.select(".hot-index_1Bl1a");
            List<SbmyHotSearchDO> list = new ArrayList<>();
            for (int i = 0; i < levels.size(); i++) {
                SbmyHotSearchDO do = SbmyHotSearchDO.builder()
                    .hotSearchResource(BAIDU.getCode())
                    .build();
                do.setHotSearchTitle(titles.get(i).text().trim());
                do.setHotSearchId(getHashId(BAIDU.getDesc() + do.getHotSearchTitle()));
                do.setHotSearchCover(imgs.get(i).attr("src"));
                do.setHotSearchExcerpt(contents.get(i).text().replaceAll("查看更多>", ""));
                do.setHotSearchUrl(urls.get(i).attr("href"));
                do.setHotSearchHeat(levels.get(i).text().trim());
                do.setHotSearchOrder(i + 1);
                list.add(do);
            }
            sbmyHotSearchService.saveCache2DB(list);
        } catch (IOException e) {
            log.error("获取百度数据异常", e);
        }
    }

    public static String getHashId(String title) { /* same as above */ }
}

Conclusion

The tutorial demonstrates end‑to‑end backend development: designing a relational schema, auto‑generating Java data‑access layers with MyBatis, creating deterministic IDs, and implementing scheduled crawlers for two major Chinese platforms, all integrated into a Spring‑Boot service that safely persists new hot‑search records.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend Java Spring Boot MySQL MyBatis Database Design Web Crawling

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.