Why extract plain text from an EPUB?

Common reasons: feeding content into an LLM for analysis, building a searchable index, screen-reader output, text-to-speech, bulk NLP processing, or plain archival. Stripping formatting also cuts file size by 80%+ for pure text storage.

Will I lose chapter structure?

Depends on the method. Calibre and Pandoc preserve chapter breaks (as extra blank lines or headings). A naive unzip-and-strip loses structure entirely — you get one big wall of text. Pick the method that matches your use case.

Can I extract text from an encrypted EPUB?

No. EPUBs with DRM (Adobe ADEPT, Apple FairPlay) resist plain-text extraction by design. Use the DRM-free source if you have one; otherwise this isn't a technical problem you can work around.

What about footnotes and tables?

Footnotes usually flatten into the body text at their anchor position. Tables lose their grid — Pandoc outputs a text table; most extractors just dump cells separated by tabs or spaces. Neither is great for complex tabular data.

Convert EPUB to Plain Text

EPUB Validator & Checker →

"EPUB to plain text" is a question that means different things to different people. A researcher dumping books into an LLM wants clean prose with chapter breaks. A grep-fluent developer wants the shortest path from binary blob to searchable string. An accessibility engineer wants the structure preserved so screen readers can traverse it. Same conversion, three different answers. Here's each.

Why people actually extract text from EPUBs

LLM/AI ingestion. Feed a book into Claude, GPT, or a local model for summarization, analysis, or retrieval. Plain text is the universal input format.
Search and indexing. grep through your library. Build a full-text index. Tag and cross-reference across a collection.
Text-to-speech. Some TTS engines accept EPUB, many don't. Plain text always works.
NLP and corpus research. Word frequency, sentiment, stylometry — all want plain text, and usually want it cleaned.
Archival. Plain text is the most durable format. A .txt file from 1985 still opens. An .epub from 2026 might not, in 40 years.

Method 1: Pandoc (recommended for most uses)

pandoc book.epub -o book.txt --wrap=none

Clean, structured, chapter-aware plain text. Headings become text with blank lines around them. Paragraphs stay as paragraphs. Italics and bold disappear (it's plain text, there's no way to represent them). Images are dropped with an alt-text note.

Useful flags:

--wrap=none — keep paragraphs on one line (easier for most downstream tools)
--wrap=auto --columns=80 — 80-char line wrap (for terminal reading)
-t plain — use Pandoc's plain writer (default for .txt output)
-t markdown — output Markdown instead of plain text if you want to keep emphasis and structure

Method 2: Unzip and strip (fastest, crudest)

EPUBs are ZIP archives of XHTML files. If you just need raw words out of the bytes:

unzip book.epub -d book-unpacked
cat book-unpacked/OEBPS/*.xhtml | sed 's/<[^>]*>//g' > book.txt

This gets you a wall of text. No chapter structure, no order guarantee (file names don't match reading order unless the book's author was tidy), HTML entities left in (&, "), and tons of whitespace.

Clean it up:

cat *.xhtml \
  | sed 's/<[^>]*>//g' \
  | sed 's/&amp;/\&/g; s/&lt;/</g; s/&gt;/>/g; s/&quot;/"/g; s/&#39;/'\''/g' \
  | tr -s ' \n' \
  > book.txt

Fine for a one-off extraction. Not fine if you need correct reading order — for that, parse content.opf's <spine> element to get files in the right sequence.

Method 3: Calibre (GUI, respects structure)

Add your EPUB to Calibre
Right-click → Convert books → output format TXT
Under TXT Output: set line-ending style and whether to preserve paragraph spacing
Click OK — output lands in Calibre's library folder

Calibre's TXT output is respectable: chapter breaks preserved, reading order correct (follows the spine), HTML entities resolved. Slower than Pandoc but more forgiving of malformed EPUBs.

Method 4: Python (scriptable, for batches)

pip install ebooklib beautifulsoup4

from ebooklib import epub, ITEM_DOCUMENT
from bs4 import BeautifulSoup

book = epub.read_epub('book.epub')
text = []
for item in book.get_items_of_type(ITEM_DOCUMENT):
    soup = BeautifulSoup(item.get_content(), 'html.parser')
    text.append(soup.get_text(separator='\n'))
open('book.txt', 'w').write('\n\n'.join(text))

Use this when you're processing many books in a pipeline, or when you want to filter (skip front matter, strip footnotes, chunk by chapter for RAG, etc.). ebooklib reads the spine correctly so reading order is preserved.

What gets lost, and whether it matters

Bold, italic, underline. Gone. Plain text doesn't have them. Use Markdown output if you need them preserved.
Tables. Flatten to cell-per-line or tab-separated, depending on converter. Complex tables lose their meaning; simple ones survive.
Footnotes. Usually inline at the anchor point, which disrupts reading flow. Strip with a regex if they're in the way.
Images and figures. Dropped. Alt text may or may not be kept. Check your converter.
Page numbers, headers, footers. Not usually a problem — EPUB doesn't have real page numbers to extract. See our note on page numbers in EPUB for why.
Code blocks. Pandoc and Calibre preserve them as-is; unzip+strip flattens indentation.

DRM-encrypted EPUBs

If the file is DRM-protected (Adobe ADEPT, Apple FairPlay, B&N), text extraction fails at the unzip step — the XHTML inside is encrypted. This is a legal/licensing boundary, not a technical puzzle. The working paths: buy a DRM-free copy, use a library service that provides one, or contact the publisher directly about accessibility formats.

Before extracting: validate the source

A malformed EPUB yields malformed text. If you're extracting at scale and getting garbage results, run the source through the EPUB validator first. Broken spine references, missing manifest entries, or encoding errors all produce weird extraction artifacts that look like extractor bugs but are really source-file bugs.

Validate EPUB — check source before extraction
What is EPUB? — the structure you're reaching into
EPUB to Word — extraction with formatting preserved
Calibre vs alternatives — comparing the conversion tools

Try it now — free

Free online EPUB checker and validator — no signup. Catch the structural, well-formedness, and reference errors that get books rejected from KDP and Apple Books.