fix pdf to excel for borderless table

2026-07-01 23:17:37 +08:00 · 2026-05-01 17:52:36 +07:00
parent 705c6b4fd0
commit df12bc5f38
4 changed files with 113 additions and 10 deletions
@@ -2,6 +2,15 @@

 All notable changes to **Your Everyday Tools** are documented here. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project loosely follows [Semantic Versioning](https://semver.org/).

+## [0.6.2] — 2026-04-29
+
+### Fixed
+- **PDF to Excel: now finds borderless tables.** Users were reporting that the same PDF returned "no tables found" in PDF→Excel but PDF→Word (Layout mode) successfully extracted tables. Root cause: PyMuPDF's `find_tables()` defaults to `strategy="lines"` which only detects tables with visible borders, while `pdf2docx` (used by PDF→Word) detects both ruled and borderless tables. PDF→Excel now exposes a **table detection strategy** option:
+  - **Auto** *(default)* — tries lines first, falls back to text-alignment if no ruled tables are found. Best of both worlds with no false-positive risk on multi-column body text.
+  - **Lines only** — original behavior, conservative.
+  - **Text alignment only** — for borderless tables (financial reports, invoices, schedules).
+- The "no tables found" error message now suggests the alternate strategy or directs users to PDF→Word in Layout mode if even text-strategy detection fails.
+
 ## [0.6.1] — 2026-04-29

 ### Added
@@ -9,6 +18,7 @@ All notable changes to **Your Everyday Tools** are documented here. The format i

 ### Improved
 - **Fill PDF Form: human radio/checkbox labels.** PDF radio buttons store opaque on-state values (often `0`/`1`/`Yes`/arbitrary identifiers) but the human label like "Male" / "Female" is painted on the page as static text *next to* the widget — not part of the field. Form Filler now sniffs that nearby text and shows the human label in the UI, while keeping the PDF on-state value as the actual submitted value (and as a tooltip for power users). Same for checkbox labels. The sniffer correctly handles vertical lists, horizontal rows ("○ Male  ○ Female"), and multi-word labels ("I agree to the terms and conditions"), stopping at gaps > 25pt to avoid grabbing the next widget's label.
+- **Fill PDF Form: editable comboboxes.** PDF combobox fields can be either strict (only the listed choices are accepted) or editable (user can type a custom value not in the list — bit 19 of the field flags). Form Filler now detects this flag and renders editable comboboxes as a free-text input with the listed choices offered as suggestions via `<datalist>`, while strict comboboxes remain `<select>` dropdowns. Both render with a small hint explaining the constraint. Custom values typed into editable fields are written into the PDF correctly.

 ## [0.6.0] — 2026-04-29

@@ -243,9 +243,23 @@ def pdf_to_excel_page():
        title="PDF to Excel",
        description="Extract tables from a PDF into an .xlsx workbook",
        notes=(
-            "<p><strong>Tip:</strong> works best on PDFs with clearly ruled tables. "
-            "For scanned PDFs (images of tables), run them through "
-            "<a href=\"/convert/ocr-pdf\">OCR PDF</a> first so the tool has text to work with.</p>"
+            "<p><strong>How table detection works:</strong> we try both detection strategies in "
+            "order of accuracy:</p>"
+            "<ul style='margin:.4rem 0 .6rem 1.2rem'>"
+            "<li><strong>Auto (recommended)</strong> — tries ruled-line detection first; if a "
+            "page has no visible table borders, falls back to text-alignment detection (catches "
+            "borderless tables in financial reports, invoices, schedules).</li>"
+            "<li><strong>Lines only</strong> — only tables with visible borders. Most accurate "
+            "but misses borderless tables.</li>"
+            "<li><strong>Text alignment only</strong> — finds tables by detecting columns of "
+            "aligned text. Catches borderless tables but can occasionally false-positive on "
+            "multi-column body text.</li>"
+            "</ul>"
+            "<p style='font-size:.9em;color:var(--muted)'><strong>Still get \"no tables found\"?</strong> "
+            "Try our <a href='/convert/pdf-to-word'>PDF to Word</a> tool in <em>Layout</em> mode "
+            "instead — it uses <code>pdf2docx</code> which is more aggressive about table "
+            "detection. If your PDF is scanned, run it through "
+            "<a href='/convert/ocr-pdf'>OCR PDF</a> first.</p>"
        ),
        endpoint="/convert/pdf-to-excel",
        accept=".pdf",
@@ -253,6 +267,12 @@ def pdf_to_excel_page():
        options=[
            {"type": "text", "name": "pages", "label": "Pages (leave empty for all)",
             "placeholder": "e.g. 1-3, 5"},
+            {"type": "select", "name": "strategy", "label": "Table detection strategy", "default": "auto",
+             "choices": [
+                 {"value": "auto",  "label": "Auto — lines first, fall back to text alignment"},
+                 {"value": "lines", "label": "Lines only (ruled tables)"},
+                 {"value": "text",  "label": "Text alignment only (borderless tables)"},
+             ]},
            {"type": "select", "name": "mode", "label": "Extraction mode", "default": "tables",
             "choices": [
                 {"value": "tables", "label": "Tables only (recommended)"},
@@ -952,6 +972,9 @@ def pdf_to_excel():

    mode = request.form.get("mode", "tables")
    organize = request.form.get("organize", "per_table")
+    strategy = request.form.get("strategy", "auto")
+    if strategy not in ("auto", "lines", "text"):
+        strategy = "auto"
    pages_spec = request.form.get("pages", "").strip()

    try:
@@ -1006,6 +1029,38 @@ def pdf_to_excel():
            rows.append(parts if parts else [line])
        return rows

+    def _find_tables_robust(page) -> list:
+        """Detect tables on a page according to the user's chosen strategy.
+
+        PyMuPDF's default `find_tables()` only catches ruled (visible-border)
+        tables. Many real-world PDFs use borderless tables where columns are
+        aligned by whitespace — those need `strategy="text"`. The "auto" mode
+        tries lines first and only falls back to text-based when nothing is
+        found, which avoids the false-positive risk of text-detection picking
+        up multi-column body text as a "table".
+        """
+        try:
+            if strategy == "lines":
+                return list(page.find_tables(strategy="lines"))
+            if strategy == "text":
+                return list(page.find_tables(
+                    strategy="text",
+                    vertical_strategy="text",
+                    horizontal_strategy="text",
+                ))
+            # auto: lines, then text fallback
+            tables = list(page.find_tables(strategy="lines"))
+            if tables:
+                return tables
+            return list(page.find_tables(
+                strategy="text",
+                vertical_strategy="text",
+                horizontal_strategy="text",
+            ))
+        except Exception as e:
+            log_error(e, f"find_tables strategy={strategy}")
+            return []
+
    # ── "combined" — stream everything into a single sheet ────────────
    if organize == "combined":
        ws = wb.create_sheet(_safe_name("Extracted"))
@@ -1015,7 +1070,7 @@ def pdf_to_excel():
            page_had_content = False

            if mode in ("tables", "tables_text"):
-                tables = list(page.find_tables())
+                tables = _find_tables_robust(page)
                for t in tables:
                    rows = t.extract()
                    if not rows:
@@ -1043,7 +1098,7 @@ def pdf_to_excel():
            tables_rows = []  # list of (label, rows)

            if mode in ("tables", "tables_text"):
-                for tidx, t in enumerate(page.find_tables(), start=1):
+                for tidx, t in enumerate(_find_tables_robust(page), start=1):
                    rows = t.extract()
                    if rows:
                        tables_rows.append((f"Table {tidx}", rows))
@@ -1076,10 +1131,14 @@ def pdf_to_excel():
    doc.close()

    if not wb.sheetnames:
-        return jsonify(error=(
-            "No tables or text found on the selected pages. "
-            "If this is a scanned PDF, run it through OCR PDF first."
-        )), 400
+        msg = "No tables found on the selected pages."
+        if strategy == "lines":
+            msg += " Try the 'Text alignment' or 'Auto' strategy — your PDF may use borderless tables."
+        elif mode == "tables":
+            msg += " Try the 'Tables, fall back to text rows' mode, or use PDF to Word in Layout mode."
+        else:
+            msg += " If this is a scanned PDF, run it through OCR PDF first; otherwise try PDF to Word in Layout mode."
+        return jsonify(error=msg), 400

    # Auto-size columns on every sheet (cap at 60 chars to avoid absurd widths)
    for ws in wb.worksheets:
@@ -946,6 +946,9 @@ def _serialize_widgets(doc) -> list[dict]:
            required = bool(flags & 2)        # bit 2 = required
            readonly = bool(flags & 1)        # bit 1 = read-only
            multiline = bool(flags & (1 << 12))  # bit 13 = multiline (text only)
+            # PDF spec bit 19 (Ff 1<<18) = combobox is editable (user can type
+            # values outside the choice list). Set only on combo fields.
+            editable_combo = (ftype == "combobox") and bool(flags & (1 << 18))

            # Choice fields expose `choice_values`; treat None as empty list
            choices = list(w.choice_values or []) if hasattr(w, "choice_values") else []
@@ -981,6 +984,7 @@ def _serialize_widgets(doc) -> list[dict]:
                "rect": [round(c, 2) for c in (w.rect or fitz.Rect())],
                "option_label": option_label,
                "option_value": option_value,
+                "editable": editable_combo,
                "required": required,
                "readonly": readonly,
                "multiline": multiline,
@@ -203,7 +203,32 @@ function buildField(f, isRadioGroup) {
            wrap.appendChild(lbl);
        }
        return wrap;
+    } else if (f.type === "combobox" && f.editable) {
+        // Editable combobox: free-text input with the choice list as suggestions.
+        // The user can pick from the list OR type something not in it.
+        input = document.createElement("input");
+        input.type = "text";
+        input.dataset.fieldName = f.name;
+        input.dataset.fieldType = "combobox";
+        input.value = f.value || "";
+        input.style.width = "100%";
+        const listId = `dl-${f.name.replace(/[^A-Za-z0-9_-]/g, "_")}`;
+        input.setAttribute("list", listId);
+        const dl = document.createElement("datalist");
+        dl.id = listId;
+        for (const c of f.choices || []) {
+            const opt = document.createElement("option");
+            if (Array.isArray(c)) opt.value = c[0];
+            else opt.value = c;
+            dl.appendChild(opt);
+        }
+        wrap.appendChild(dl);
+        const hint = document.createElement("small");
+        hint.style.cssText = "color:var(--muted);display:block;margin-top:.2rem";
+        hint.textContent = "Pick from suggestions or type any value";
+        wrap.appendChild(hint);
    } else if (f.type === "listbox" || f.type === "combobox") {
+        // Strict choice list (combobox without Edit flag, or any listbox).
        input = document.createElement("select");
        input.dataset.fieldName = f.name;
        input.dataset.fieldType = f.type;
@@ -213,7 +238,6 @@ function buildField(f, isRadioGroup) {
        input.appendChild(empty);
        for (const c of f.choices || []) {
            const opt = document.createElement("option");
-            // choice_values entries can be string or [value, label]
            if (Array.isArray(c)) {
                opt.value = c[0]; opt.textContent = c[1] || c[0];
            } else {
@@ -222,6 +246,12 @@ function buildField(f, isRadioGroup) {
            if (f.value === opt.value) opt.selected = true;
            input.appendChild(opt);
        }
+        const hint = document.createElement("small");
+        hint.style.cssText = "color:var(--muted);display:block;margin-top:.2rem";
+        hint.textContent = "Choices are defined inside the PDF; only these values are accepted";
+        wrap.appendChild(input);
+        wrap.appendChild(hint);
+        return wrap;
    } else if (f.multiline) {
        input = document.createElement("textarea");
        input.rows = 3;