microsoft/markitdown / document to Markdown

MarkItDown: read this before you install it

MarkItDown is useful when you need documents turned into text that agents can read, but I would not treat conversion as truth. I test one simple file, one ugly PDF, and one expected downstream chunking step before putting it into a RAG pipeline.

Project source: microsoft/markitdown
Author / organization: Microsoft
This page is a private experience note, not official documentation.

Future ad placement. Separated from navigation and action links.

Do not trust conversion blindly

I would not start MarkItDown by pointing it at a folder of important documents. I start with three files: one simple PDF, one messy PDF, and one Office document. If the ugly file fails, I want to know before the pipeline has hundreds of converted files.

The install choice matters because optional dependencies decide what formats work. `markitdown[all]` is convenient for testing, but in production I prefer only the extras I need.

I also decide what “good enough” means. For RAG, perfect Markdown is not required, but missing headings, broken tables, or merged columns can ruin retrieval.

When Markdown conversion is enough

MarkItDown fits when the goal is to turn common files into Markdown-like text for indexing, review, or LLM input. It is a helper in a pipeline, not the whole document-understanding system.

I would not use it as a legal-grade or layout-perfect extractor. Complex PDFs, scans, tables, and diagrams need separate evaluation or OCR/layout tools.

My fit check is whether downstream search improves after conversion. If converted text is easy to chunk and cite, the tool earns its place.

Input files, converters, optional deps, and output quality

I map MarkItDown as file detection, converter backend, Markdown output, then downstream consumer. The downstream consumer matters. A human reading Markdown and a vector indexer need different quality checks.

Optional dependencies are part of the architecture. If one environment has PDF support and another does not, the same command can behave differently.

I also watch filenames and metadata. Once documents become Markdown, I still need source references for citations and debugging.

Test the ugly document early

My setup path is a clean venv, install `markitdown[all]` for exploration, convert one file with CLI, then inspect the output manually before automating.

For a real pipeline, I narrow dependencies. If I only need PDF and DOCX, I install those extras rather than carrying everything blindly.

I create a `samples/` folder with files that represent the worst documents I expect. That folder becomes the regression test whenever I upgrade the converter.

My MarkItDown command path

Use the prep panel before bulk conversion. It checks Python, install extras, and sample files. If the sample set is not representative, your first success will be misleading.

Use the verify panel after converting each file type. Open the Markdown, check headings, tables, page breaks, and source references before indexing it.

Use the debug panel when RAG answers are bad. Go back to the converted Markdown before tuning embeddings or prompts. Bad extraction creates bad retrieval.

When the Markdown looks clean but loses meaning

If conversion fails, I check whether the needed optional dependency is installed. A base install may not support the format I assumed.

If output looks clean but answers are bad, I compare the Markdown against the original document. Tables and multi-column layouts are common trouble spots.

If a PDF is scanned, I stop expecting text extraction to work like magic. That needs OCR or a different document pipeline.

The first conversion pipeline I would keep

The first pipeline I would keep converts five representative files into Markdown, stores the original filename in front matter, and indexes only after manual review.

Then I ask one retrieval question per file and verify whether the answer points back to the right source.

Only after that do I automate batch conversion. MarkItDown is valuable when it makes documents inspectable, not when it hides extraction mistakes.

How I would use the command panel

Use the MarkItDown commands by file type

sample set — Before batch conversion, prepare tiny samples for the exact file types you need: PDF, DOCX, PPTX, XLSX, HTML, image, or audio.

one format each — Convert one file type at a time and inspect headings, tables, links, and extracted text before trusting a whole folder.

dependency by format — When conversion fails, check the specific optional dependency for that format, then reduce to one file and compare expected text to output.

Field commands I would keep beside this note

# MarkItDown prep

python --version
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
pip install "markitdown[all]"
markitdown --help

# MarkItDown verify

mkdir -p samples converted
markitdown samples/simple.pdf -o converted/simple.md
markitdown samples/ugly.pdf -o converted/ugly.md
markitdown samples/sample.docx -o converted/sample.md

# inspect output before indexing
sed -n "1,80p" converted/ugly.md

# MarkItDown debug

format fails -> check optional extras
bad table -> compare original vs markdown
scanned PDF -> use OCR/layout tool first
RAG answer bad -> inspect converted markdown before prompt tuning
missing citation -> preserve source filename metadata