BabelDOC: A Comprehensive PDF Translation and Bilingual Comparison Library
BabelDOC, accessible at https://funstory-ai.github.io/BabelDOC/, is an open - source library focused on PDF scientific paper translation and bilingual comparison.
Core Functions
- Online and Self - deployment Options: It offers an online service in beta version, Immersive Translate - BabelDOC, which provides 1000 free translation pages per month. For those who prefer self - deployment, PDFMathTranslate 1.9.3+ offers experimental support for BabelDOC, along with a WebUI and more translation services.
- User - Friendly Interfaces: BabelDOC provides a simple command - line interface and a Python API. It can be directly used for basic translation tasks or embedded into other programs.
Installation Methods
- Install from PyPI: Recommended to use the uv tool. After installing uv and setting the
PATH
environment variable, users can install BabelDOC via theuv tool install --python 3.12 BabelDOC
command. - Install from Source: Involves cloning the GitHub repository, entering the project directory, and using uv to manage the virtual environment and run commands like
uv run babeldoc --help
.
Command - Line Options
There are numerous command - line options for different purposes:
- Language Settings: Options like
--lang - in
and--lang - out
are used to specify the source and target languages, with the default source language as English and the default target language as Chinese. Currently, the project mainly focuses on English - to - Chinese translation. - PDF Processing: Options such as
--files
,--pages
,--split - short - lines
, and--skip - clean
are available for PDF processing. These can be used to select input PDF files, specify pages to translate, handle short lines, and control the PDF cleaning step respectively. - Translation Service: Users can choose translation services with options like
--openai
. For OpenAI - based translation, additional options like--openai - model
,--openai - base - url
, and--openai - api - key
are required. Other options control aspects like QPS limit, cache usage, and output types. - Output Control: Options like
--output
,--debug
, and--report - interval
are used to manage the output directory, enable debug logging, and set the progress report interval. - Offline Assets Management: BabelDOC allows generating and restoring offline assets packages, which are useful for offline environments or for speeding up installations on multiple machines.
Python API
Although its Python API does not guarantee full compatibility, before the release of pdf2zh 2.0, it can be used as an alternative. After pdf2zh 2.0 is released, users are recommended to use pdf2zh's Python API. The BabelDOC Python API provides functions for offline assets management.
Background and Goals
BabelDOC aims to promote a standard pipeline and interface for document translation. It offers an intermediate representation of parsed PDF results that can be rendered into new formats. The project's first 1.0 version goal is to translate PDF Reference, Version 1.7 into Simplified Chinese, Traditional Chinese, Japanese, and Spanish, with layout error less than 1% and content loss less than 1%.
Known Issues and Future Plans
- Known Issues: Currently, it has issues such as parsing errors in the author and reference sections, lack of line support, inability to handle drop caps, and skipping large pages.
- Roadmap: Future plans include adding line support, table support, cross - page/cross - column paragraph support, more advanced typesetting features, and outline support.
Contribution
Contributions to BabelDOC are encouraged. Active contributors can receive monthly Pro membership redemption codes from Immersive Translation. Everyone involved in the project is expected to follow the YADT Code of Conduct.