Convert EPUB to Plain Text
"EPUB to plain text" is a question that means different things to different people. A researcher dumping books into an LLM wants clean prose with chapter breaks. A grep-fluent developer wants the shortest path from binary blob to searchable string. An accessibility engineer wants the structure preserved so screen readers can traverse it. Same conversion, three different answers. Here's each.
Why people actually extract text from EPUBs
- LLM/AI ingestion. Feed a book into Claude, GPT, or a local model for summarization, analysis, or retrieval. Plain text is the universal input format.
- Search and indexing. grep through your library. Build a full-text index. Tag and cross-reference across a collection.
- Text-to-speech. Some TTS engines accept EPUB, many don't. Plain text always works.
- NLP and corpus research. Word frequency, sentiment, stylometry — all want plain text, and usually want it cleaned.
- Archival. Plain text is the most durable format. A .txt file from 1985 still opens. An .epub from 2026 might not, in 40 years.
Method 1: Pandoc (recommended for most uses)
pandoc book.epub -o book.txt --wrap=none
Clean, structured, chapter-aware plain text. Headings become text with blank lines around them. Paragraphs stay as paragraphs. Italics and bold disappear (it's plain text, there's no way to represent them). Images are dropped with an alt-text note.
Useful flags:
--wrap=none— keep paragraphs on one line (easier for most downstream tools)--wrap=auto --columns=80— 80-char line wrap (for terminal reading)-t plain— use Pandoc's plain writer (default for .txt output)-t markdown— output Markdown instead of plain text if you want to keep emphasis and structure
Method 2: Unzip and strip (fastest, crudest)
EPUBs are ZIP archives of XHTML files. If you just need raw words out of the bytes:
unzip book.epub -d book-unpacked
cat book-unpacked/OEBPS/*.xhtml | sed 's/<[^>]*>//g' > book.txt
This gets you a wall of text. No chapter structure, no order guarantee (file names don't match reading order unless the book's author was tidy), HTML entities left in (&, "), and tons of whitespace.
Clean it up:
cat *.xhtml \
| sed 's/<[^>]*>//g' \
| sed 's/&/\&/g; s/</</g; s/>/>/g; s/"/"/g; s/'/'\''/g' \
| tr -s ' \n' \
> book.txt
Fine for a one-off extraction. Not fine if you need correct reading order — for that, parse content.opf's <spine> element to get files in the right sequence.
Method 3: Calibre (GUI, respects structure)
- Add your EPUB to Calibre
- Right-click → Convert books → output format TXT
- Under TXT Output: set line-ending style and whether to preserve paragraph spacing
- Click OK — output lands in Calibre's library folder
Calibre's TXT output is respectable: chapter breaks preserved, reading order correct (follows the spine), HTML entities resolved. Slower than Pandoc but more forgiving of malformed EPUBs.
Method 4: Python (scriptable, for batches)
pip install ebooklib beautifulsoup4
from ebooklib import epub, ITEM_DOCUMENT
from bs4 import BeautifulSoup
book = epub.read_epub('book.epub')
text = []
for item in book.get_items_of_type(ITEM_DOCUMENT):
soup = BeautifulSoup(item.get_content(), 'html.parser')
text.append(soup.get_text(separator='\n'))
open('book.txt', 'w').write('\n\n'.join(text))
Use this when you're processing many books in a pipeline, or when you want to filter (skip front matter, strip footnotes, chunk by chapter for RAG, etc.). ebooklib reads the spine correctly so reading order is preserved.
What gets lost, and whether it matters
- Bold, italic, underline. Gone. Plain text doesn't have them. Use Markdown output if you need them preserved.
- Tables. Flatten to cell-per-line or tab-separated, depending on converter. Complex tables lose their meaning; simple ones survive.
- Footnotes. Usually inline at the anchor point, which disrupts reading flow. Strip with a regex if they're in the way.
- Images and figures. Dropped. Alt text may or may not be kept. Check your converter.
- Page numbers, headers, footers. Not usually a problem — EPUB doesn't have real page numbers to extract. See our note on page numbers in EPUB for why.
- Code blocks. Pandoc and Calibre preserve them as-is; unzip+strip flattens indentation.
DRM-encrypted EPUBs
If the file is DRM-protected (Adobe ADEPT, Apple FairPlay, B&N), text extraction fails at the unzip step — the XHTML inside is encrypted. This is a legal/licensing boundary, not a technical puzzle. The working paths: buy a DRM-free copy, use a library service that provides one, or contact the publisher directly about accessibility formats.
Before extracting: validate the source
A malformed EPUB yields malformed text. If you're extracting at scale and getting garbage results, run the source through the EPUB validator first. Broken spine references, missing manifest entries, or encoding errors all produce weird extraction artifacts that look like extractor bugs but are really source-file bugs.
Related
- Validate EPUB — check source before extraction
- What is EPUB? — the structure you're reaching into
- EPUB to Word — extraction with formatting preserved
- Calibre vs alternatives — comparing the conversion tools
Try it now — free
Check your EPUB for errors, compatibility issues, and compliance. Free EPUB validator supporting EPUB 2 and EPUB 3 standards.
Validate EPUB Files →