Artificial Intelligence 12 min read

Document Rendering and Structured Extraction Techniques in Baidu Wenku

Baidu Wenku converts all document types to PDF, parses the PDF into a proprietary format, uses absolute‑position layout for PC rendering, and transforms this into flow‑type structural data for mobile devices by re‑typing layout, extracting OOXML structures, and detecting charts, thereby enabling adaptive layouts, accurate formula rendering, and interactive chart extraction.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Document Rendering and Structured Extraction Techniques in Baidu Wenku

Baidu Wenku stores billions of documents of various formats (Word, PPT, Excel, PDF, etc.). The core service is document transcoding and rendering.

To unify the processing of dozens of document types, the final solution is to convert any document to PDF, parse the open‑source PDF data format, and then generate Baidu Wenku’s proprietary document format for both PC and mobile layout and rendering.

PC rendering uses the xreader layout data derived from PDF, where each element (text, image, vector) carries coordinate, width, height, and other descriptive information. This absolute‑position layout reproduces the original document with high fidelity on desktop screens.

Mobile devices have much smaller screens; scaling the layout data proportionally makes text and formulas unreadably small, as shown in Figure 1. Therefore, a more suitable approach is to transform layout data into flow‑type data that adapts to different screen sizes.

Flow data discards absolute coordinates and retains structural information such as sections, paragraphs, tables, formulas, and charts. This structural representation enables adaptive re‑layout on various devices.

2.1 Retype flow data (based on xreader layout) – The early “layout‑to‑flow” solution iterates over each element in the xreader layout, extracts x, y, width (w) and height (h). Elements with similar y are considered on the same line; adjacent elements are merged into lines, and lines are merged into paragraphs based on spacing, line width, punctuation, and indentation cues. For complex pages (multi‑column papers, tables, footnotes), a pre‑processing step splits the page into multiple ranges, and the same algorithm is applied within each range.

While this method extracts paragraph and line structures, its accuracy is not 100 %. Issues such as forced line breaks, misplaced inline images, and weak extraction of formulas, charts, and tables are observed.

2.2 BDJson flow data (based on OOXML) – Microsoft Office documents are converted from the binary DOC format to the OOXML‑based DOCX. The DOCX is a zip‑containing XML files; parsing Document.xml yields sections, paragraphs, tables, and other structural metadata directly. Header/footer, footnotes, and endnotes are indexed in Document.xml and assembled from their respective XML parts. Lists, numbering, and merged table cells are mapped to HTML , , and appropriate table structures. Formulas from Word (domain formulas, MathType, OMML) are converted to LaTeX for uniform rendering.

2.3 Chart extraction from PDF/images – The pipeline consists of two modules: range detection and metadata extraction.

Range detection : All page elements (text spans, images) are merged by proximity into fragments, then into lines. Lines are evaluated for validity; blank areas become candidate ranges. Overlapping ranges are split, merged, and filtered based on size, position, and OCR text density, producing final effective ranges (see Figures 4‑6).

Metadata extraction : For each range, the corresponding image is captured and analyzed to determine whether it is a chart. If it is, axis detection (pixel analysis, edge operators) identifies x‑ and y‑axes, scales, and tick marks. Sub‑ranges are OCR‑processed to obtain axis labels and data point values, which are then assembled into structured chart data (see Figures 7‑8).

The overall workflow dramatically improves document rendering on mobile devices, enabling adaptive flow layout, accurate formula display via LaTeX, and interactive chart data extraction.

Future work focuses on finer‑grained element extraction and richer user interaction capabilities, building on the current document‑wide presentation foundation.

Recruitment notice : Baidu Wenku R&D is hiring iOS & Android engineers. Interested candidates can apply via the provided email address.

Mobile Optimizationdocument renderingchart extractioncontent structuringOOXML parsingPDF conversion
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.