fix pdf to excel for borderless table

This commit is contained in:
listyantidewi1
2026-05-01 17:52:36 +07:00
parent 705c6b4fd0
commit df12bc5f38
4 changed files with 113 additions and 10 deletions
+10
View File
@@ -2,6 +2,15 @@
All notable changes to **Your Everyday Tools** are documented here. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project loosely follows [Semantic Versioning](https://semver.org/).
## [0.6.2] — 2026-04-29
### Fixed
- **PDF to Excel: now finds borderless tables.** Users were reporting that the same PDF returned "no tables found" in PDF→Excel but PDF→Word (Layout mode) successfully extracted tables. Root cause: PyMuPDF's `find_tables()` defaults to `strategy="lines"` which only detects tables with visible borders, while `pdf2docx` (used by PDF→Word) detects both ruled and borderless tables. PDF→Excel now exposes a **table detection strategy** option:
- **Auto** *(default)* — tries lines first, falls back to text-alignment if no ruled tables are found. Best of both worlds with no false-positive risk on multi-column body text.
- **Lines only** — original behavior, conservative.
- **Text alignment only** — for borderless tables (financial reports, invoices, schedules).
- The "no tables found" error message now suggests the alternate strategy or directs users to PDF→Word in Layout mode if even text-strategy detection fails.
## [0.6.1] — 2026-04-29
### Added
@@ -9,6 +18,7 @@ All notable changes to **Your Everyday Tools** are documented here. The format i
### Improved
- **Fill PDF Form: human radio/checkbox labels.** PDF radio buttons store opaque on-state values (often `0`/`1`/`Yes`/arbitrary identifiers) but the human label like "Male" / "Female" is painted on the page as static text *next to* the widget — not part of the field. Form Filler now sniffs that nearby text and shows the human label in the UI, while keeping the PDF on-state value as the actual submitted value (and as a tooltip for power users). Same for checkbox labels. The sniffer correctly handles vertical lists, horizontal rows ("○ Male ○ Female"), and multi-word labels ("I agree to the terms and conditions"), stopping at gaps > 25pt to avoid grabbing the next widget's label.
- **Fill PDF Form: editable comboboxes.** PDF combobox fields can be either strict (only the listed choices are accepted) or editable (user can type a custom value not in the list — bit 19 of the field flags). Form Filler now detects this flag and renders editable comboboxes as a free-text input with the listed choices offered as suggestions via `<datalist>`, while strict comboboxes remain `<select>` dropdowns. Both render with a small hint explaining the constraint. Custom values typed into editable fields are written into the PDF correctly.
## [0.6.0] — 2026-04-29
+68 -9
View File
@@ -243,9 +243,23 @@ def pdf_to_excel_page():
title="PDF to Excel",
description="Extract tables from a PDF into an .xlsx workbook",
notes=(
"<p><strong>Tip:</strong> works best on PDFs with clearly ruled tables. "
"For scanned PDFs (images of tables), run them through "
"<a href=\"/convert/ocr-pdf\">OCR PDF</a> first so the tool has text to work with.</p>"
"<p><strong>How table detection works:</strong> we try both detection strategies in "
"order of accuracy:</p>"
"<ul style='margin:.4rem 0 .6rem 1.2rem'>"
"<li><strong>Auto (recommended)</strong> — tries ruled-line detection first; if a "
"page has no visible table borders, falls back to text-alignment detection (catches "
"borderless tables in financial reports, invoices, schedules).</li>"
"<li><strong>Lines only</strong> — only tables with visible borders. Most accurate "
"but misses borderless tables.</li>"
"<li><strong>Text alignment only</strong> — finds tables by detecting columns of "
"aligned text. Catches borderless tables but can occasionally false-positive on "
"multi-column body text.</li>"
"</ul>"
"<p style='font-size:.9em;color:var(--muted)'><strong>Still get \"no tables found\"?</strong> "
"Try our <a href='/convert/pdf-to-word'>PDF to Word</a> tool in <em>Layout</em> mode "
"instead — it uses <code>pdf2docx</code> which is more aggressive about table "
"detection. If your PDF is scanned, run it through "
"<a href='/convert/ocr-pdf'>OCR PDF</a> first.</p>"
),
endpoint="/convert/pdf-to-excel",
accept=".pdf",
@@ -253,6 +267,12 @@ def pdf_to_excel_page():
options=[
{"type": "text", "name": "pages", "label": "Pages (leave empty for all)",
"placeholder": "e.g. 1-3, 5"},
{"type": "select", "name": "strategy", "label": "Table detection strategy", "default": "auto",
"choices": [
{"value": "auto", "label": "Auto — lines first, fall back to text alignment"},
{"value": "lines", "label": "Lines only (ruled tables)"},
{"value": "text", "label": "Text alignment only (borderless tables)"},
]},
{"type": "select", "name": "mode", "label": "Extraction mode", "default": "tables",
"choices": [
{"value": "tables", "label": "Tables only (recommended)"},
@@ -952,6 +972,9 @@ def pdf_to_excel():
mode = request.form.get("mode", "tables")
organize = request.form.get("organize", "per_table")
strategy = request.form.get("strategy", "auto")
if strategy not in ("auto", "lines", "text"):
strategy = "auto"
pages_spec = request.form.get("pages", "").strip()
try:
@@ -1006,6 +1029,38 @@ def pdf_to_excel():
rows.append(parts if parts else [line])
return rows
def _find_tables_robust(page) -> list:
"""Detect tables on a page according to the user's chosen strategy.
PyMuPDF's default `find_tables()` only catches ruled (visible-border)
tables. Many real-world PDFs use borderless tables where columns are
aligned by whitespace — those need `strategy="text"`. The "auto" mode
tries lines first and only falls back to text-based when nothing is
found, which avoids the false-positive risk of text-detection picking
up multi-column body text as a "table".
"""
try:
if strategy == "lines":
return list(page.find_tables(strategy="lines"))
if strategy == "text":
return list(page.find_tables(
strategy="text",
vertical_strategy="text",
horizontal_strategy="text",
))
# auto: lines, then text fallback
tables = list(page.find_tables(strategy="lines"))
if tables:
return tables
return list(page.find_tables(
strategy="text",
vertical_strategy="text",
horizontal_strategy="text",
))
except Exception as e:
log_error(e, f"find_tables strategy={strategy}")
return []
# ── "combined" — stream everything into a single sheet ────────────
if organize == "combined":
ws = wb.create_sheet(_safe_name("Extracted"))
@@ -1015,7 +1070,7 @@ def pdf_to_excel():
page_had_content = False
if mode in ("tables", "tables_text"):
tables = list(page.find_tables())
tables = _find_tables_robust(page)
for t in tables:
rows = t.extract()
if not rows:
@@ -1043,7 +1098,7 @@ def pdf_to_excel():
tables_rows = [] # list of (label, rows)
if mode in ("tables", "tables_text"):
for tidx, t in enumerate(page.find_tables(), start=1):
for tidx, t in enumerate(_find_tables_robust(page), start=1):
rows = t.extract()
if rows:
tables_rows.append((f"Table {tidx}", rows))
@@ -1076,10 +1131,14 @@ def pdf_to_excel():
doc.close()
if not wb.sheetnames:
return jsonify(error=(
"No tables or text found on the selected pages. "
"If this is a scanned PDF, run it through OCR PDF first."
)), 400
msg = "No tables found on the selected pages."
if strategy == "lines":
msg += " Try the 'Text alignment' or 'Auto' strategy — your PDF may use borderless tables."
elif mode == "tables":
msg += " Try the 'Tables, fall back to text rows' mode, or use PDF to Word in Layout mode."
else:
msg += " If this is a scanned PDF, run it through OCR PDF first; otherwise try PDF to Word in Layout mode."
return jsonify(error=msg), 400
# Auto-size columns on every sheet (cap at 60 chars to avoid absurd widths)
for ws in wb.worksheets:
+4
View File
@@ -946,6 +946,9 @@ def _serialize_widgets(doc) -> list[dict]:
required = bool(flags & 2) # bit 2 = required
readonly = bool(flags & 1) # bit 1 = read-only
multiline = bool(flags & (1 << 12)) # bit 13 = multiline (text only)
# PDF spec bit 19 (Ff 1<<18) = combobox is editable (user can type
# values outside the choice list). Set only on combo fields.
editable_combo = (ftype == "combobox") and bool(flags & (1 << 18))
# Choice fields expose `choice_values`; treat None as empty list
choices = list(w.choice_values or []) if hasattr(w, "choice_values") else []
@@ -981,6 +984,7 @@ def _serialize_widgets(doc) -> list[dict]:
"rect": [round(c, 2) for c in (w.rect or fitz.Rect())],
"option_label": option_label,
"option_value": option_value,
"editable": editable_combo,
"required": required,
"readonly": readonly,
"multiline": multiline,
+31 -1
View File
@@ -203,7 +203,32 @@ function buildField(f, isRadioGroup) {
wrap.appendChild(lbl);
}
return wrap;
} else if (f.type === "combobox" && f.editable) {
// Editable combobox: free-text input with the choice list as suggestions.
// The user can pick from the list OR type something not in it.
input = document.createElement("input");
input.type = "text";
input.dataset.fieldName = f.name;
input.dataset.fieldType = "combobox";
input.value = f.value || "";
input.style.width = "100%";
const listId = `dl-${f.name.replace(/[^A-Za-z0-9_-]/g, "_")}`;
input.setAttribute("list", listId);
const dl = document.createElement("datalist");
dl.id = listId;
for (const c of f.choices || []) {
const opt = document.createElement("option");
if (Array.isArray(c)) opt.value = c[0];
else opt.value = c;
dl.appendChild(opt);
}
wrap.appendChild(dl);
const hint = document.createElement("small");
hint.style.cssText = "color:var(--muted);display:block;margin-top:.2rem";
hint.textContent = "Pick from suggestions or type any value";
wrap.appendChild(hint);
} else if (f.type === "listbox" || f.type === "combobox") {
// Strict choice list (combobox without Edit flag, or any listbox).
input = document.createElement("select");
input.dataset.fieldName = f.name;
input.dataset.fieldType = f.type;
@@ -213,7 +238,6 @@ function buildField(f, isRadioGroup) {
input.appendChild(empty);
for (const c of f.choices || []) {
const opt = document.createElement("option");
// choice_values entries can be string or [value, label]
if (Array.isArray(c)) {
opt.value = c[0]; opt.textContent = c[1] || c[0];
} else {
@@ -222,6 +246,12 @@ function buildField(f, isRadioGroup) {
if (f.value === opt.value) opt.selected = true;
input.appendChild(opt);
}
const hint = document.createElement("small");
hint.style.cssText = "color:var(--muted);display:block;margin-top:.2rem";
hint.textContent = "Choices are defined inside the PDF; only these values are accepted";
wrap.appendChild(input);
wrap.appendChild(hint);
return wrap;
} else if (f.multiline) {
input = document.createElement("textarea");
input.rows = 3;