Hi everyone! 👋

I’ve been testing different ways to extract tables from PDFs for a finance project. Here is a quick breakdown of the current landscape for 2026:

1. Traditional Libraries (Python)

  • PyPDF2 / PDFMiner: Good for simple text, but fails miserably on scanned documents or complex tables.
  • Tabula: Still decent for simple tables, but hasn’t been updated much.

2. OCR Engines

  • Tesseract: Open source and free, but requires a lot of pre-processing (image cleaning) to get good results. Hard to set up.

3. LLM & AI Tools (The new standard)

  • ParserData: specialized in invoices and bank statements. It reconstructs tables perfectly even from scans.
  • LlamaParse: Good for RAG pipelines.

My conclusion: If you are building a production pipeline, stop writing Regex. Using an AI API saves hours of debugging.

What tools are you using in your workflows right now?