mirror of
https://codeberg.org/listyantidewi/your-everyday-tools.git
synced 2026-07-01 23:17:37 +08:00
fix pdf to excel for borderless table
This commit is contained in:
@@ -2,6 +2,15 @@
|
||||
|
||||
All notable changes to **Your Everyday Tools** are documented here. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project loosely follows [Semantic Versioning](https://semver.org/).
|
||||
|
||||
## [0.6.2] — 2026-04-29
|
||||
|
||||
### Fixed
|
||||
- **PDF to Excel: now finds borderless tables.** Users were reporting that the same PDF returned "no tables found" in PDF→Excel but PDF→Word (Layout mode) successfully extracted tables. Root cause: PyMuPDF's `find_tables()` defaults to `strategy="lines"` which only detects tables with visible borders, while `pdf2docx` (used by PDF→Word) detects both ruled and borderless tables. PDF→Excel now exposes a **table detection strategy** option:
|
||||
- **Auto** *(default)* — tries lines first, falls back to text-alignment if no ruled tables are found. Best of both worlds with no false-positive risk on multi-column body text.
|
||||
- **Lines only** — original behavior, conservative.
|
||||
- **Text alignment only** — for borderless tables (financial reports, invoices, schedules).
|
||||
- The "no tables found" error message now suggests the alternate strategy or directs users to PDF→Word in Layout mode if even text-strategy detection fails.
|
||||
|
||||
## [0.6.1] — 2026-04-29
|
||||
|
||||
### Added
|
||||
@@ -9,6 +18,7 @@ All notable changes to **Your Everyday Tools** are documented here. The format i
|
||||
|
||||
### Improved
|
||||
- **Fill PDF Form: human radio/checkbox labels.** PDF radio buttons store opaque on-state values (often `0`/`1`/`Yes`/arbitrary identifiers) but the human label like "Male" / "Female" is painted on the page as static text *next to* the widget — not part of the field. Form Filler now sniffs that nearby text and shows the human label in the UI, while keeping the PDF on-state value as the actual submitted value (and as a tooltip for power users). Same for checkbox labels. The sniffer correctly handles vertical lists, horizontal rows ("○ Male ○ Female"), and multi-word labels ("I agree to the terms and conditions"), stopping at gaps > 25pt to avoid grabbing the next widget's label.
|
||||
- **Fill PDF Form: editable comboboxes.** PDF combobox fields can be either strict (only the listed choices are accepted) or editable (user can type a custom value not in the list — bit 19 of the field flags). Form Filler now detects this flag and renders editable comboboxes as a free-text input with the listed choices offered as suggestions via `<datalist>`, while strict comboboxes remain `<select>` dropdowns. Both render with a small hint explaining the constraint. Custom values typed into editable fields are written into the PDF correctly.
|
||||
|
||||
## [0.6.0] — 2026-04-29
|
||||
|
||||
|
||||
+68
-9
@@ -243,9 +243,23 @@ def pdf_to_excel_page():
|
||||
title="PDF to Excel",
|
||||
description="Extract tables from a PDF into an .xlsx workbook",
|
||||
notes=(
|
||||
"<p><strong>Tip:</strong> works best on PDFs with clearly ruled tables. "
|
||||
"For scanned PDFs (images of tables), run them through "
|
||||
"<a href=\"/convert/ocr-pdf\">OCR PDF</a> first so the tool has text to work with.</p>"
|
||||
"<p><strong>How table detection works:</strong> we try both detection strategies in "
|
||||
"order of accuracy:</p>"
|
||||
"<ul style='margin:.4rem 0 .6rem 1.2rem'>"
|
||||
"<li><strong>Auto (recommended)</strong> — tries ruled-line detection first; if a "
|
||||
"page has no visible table borders, falls back to text-alignment detection (catches "
|
||||
"borderless tables in financial reports, invoices, schedules).</li>"
|
||||
"<li><strong>Lines only</strong> — only tables with visible borders. Most accurate "
|
||||
"but misses borderless tables.</li>"
|
||||
"<li><strong>Text alignment only</strong> — finds tables by detecting columns of "
|
||||
"aligned text. Catches borderless tables but can occasionally false-positive on "
|
||||
"multi-column body text.</li>"
|
||||
"</ul>"
|
||||
"<p style='font-size:.9em;color:var(--muted)'><strong>Still get \"no tables found\"?</strong> "
|
||||
"Try our <a href='/convert/pdf-to-word'>PDF to Word</a> tool in <em>Layout</em> mode "
|
||||
"instead — it uses <code>pdf2docx</code> which is more aggressive about table "
|
||||
"detection. If your PDF is scanned, run it through "
|
||||
"<a href='/convert/ocr-pdf'>OCR PDF</a> first.</p>"
|
||||
),
|
||||
endpoint="/convert/pdf-to-excel",
|
||||
accept=".pdf",
|
||||
@@ -253,6 +267,12 @@ def pdf_to_excel_page():
|
||||
options=[
|
||||
{"type": "text", "name": "pages", "label": "Pages (leave empty for all)",
|
||||
"placeholder": "e.g. 1-3, 5"},
|
||||
{"type": "select", "name": "strategy", "label": "Table detection strategy", "default": "auto",
|
||||
"choices": [
|
||||
{"value": "auto", "label": "Auto — lines first, fall back to text alignment"},
|
||||
{"value": "lines", "label": "Lines only (ruled tables)"},
|
||||
{"value": "text", "label": "Text alignment only (borderless tables)"},
|
||||
]},
|
||||
{"type": "select", "name": "mode", "label": "Extraction mode", "default": "tables",
|
||||
"choices": [
|
||||
{"value": "tables", "label": "Tables only (recommended)"},
|
||||
@@ -952,6 +972,9 @@ def pdf_to_excel():
|
||||
|
||||
mode = request.form.get("mode", "tables")
|
||||
organize = request.form.get("organize", "per_table")
|
||||
strategy = request.form.get("strategy", "auto")
|
||||
if strategy not in ("auto", "lines", "text"):
|
||||
strategy = "auto"
|
||||
pages_spec = request.form.get("pages", "").strip()
|
||||
|
||||
try:
|
||||
@@ -1006,6 +1029,38 @@ def pdf_to_excel():
|
||||
rows.append(parts if parts else [line])
|
||||
return rows
|
||||
|
||||
def _find_tables_robust(page) -> list:
|
||||
"""Detect tables on a page according to the user's chosen strategy.
|
||||
|
||||
PyMuPDF's default `find_tables()` only catches ruled (visible-border)
|
||||
tables. Many real-world PDFs use borderless tables where columns are
|
||||
aligned by whitespace — those need `strategy="text"`. The "auto" mode
|
||||
tries lines first and only falls back to text-based when nothing is
|
||||
found, which avoids the false-positive risk of text-detection picking
|
||||
up multi-column body text as a "table".
|
||||
"""
|
||||
try:
|
||||
if strategy == "lines":
|
||||
return list(page.find_tables(strategy="lines"))
|
||||
if strategy == "text":
|
||||
return list(page.find_tables(
|
||||
strategy="text",
|
||||
vertical_strategy="text",
|
||||
horizontal_strategy="text",
|
||||
))
|
||||
# auto: lines, then text fallback
|
||||
tables = list(page.find_tables(strategy="lines"))
|
||||
if tables:
|
||||
return tables
|
||||
return list(page.find_tables(
|
||||
strategy="text",
|
||||
vertical_strategy="text",
|
||||
horizontal_strategy="text",
|
||||
))
|
||||
except Exception as e:
|
||||
log_error(e, f"find_tables strategy={strategy}")
|
||||
return []
|
||||
|
||||
# ── "combined" — stream everything into a single sheet ────────────
|
||||
if organize == "combined":
|
||||
ws = wb.create_sheet(_safe_name("Extracted"))
|
||||
@@ -1015,7 +1070,7 @@ def pdf_to_excel():
|
||||
page_had_content = False
|
||||
|
||||
if mode in ("tables", "tables_text"):
|
||||
tables = list(page.find_tables())
|
||||
tables = _find_tables_robust(page)
|
||||
for t in tables:
|
||||
rows = t.extract()
|
||||
if not rows:
|
||||
@@ -1043,7 +1098,7 @@ def pdf_to_excel():
|
||||
tables_rows = [] # list of (label, rows)
|
||||
|
||||
if mode in ("tables", "tables_text"):
|
||||
for tidx, t in enumerate(page.find_tables(), start=1):
|
||||
for tidx, t in enumerate(_find_tables_robust(page), start=1):
|
||||
rows = t.extract()
|
||||
if rows:
|
||||
tables_rows.append((f"Table {tidx}", rows))
|
||||
@@ -1076,10 +1131,14 @@ def pdf_to_excel():
|
||||
doc.close()
|
||||
|
||||
if not wb.sheetnames:
|
||||
return jsonify(error=(
|
||||
"No tables or text found on the selected pages. "
|
||||
"If this is a scanned PDF, run it through OCR PDF first."
|
||||
)), 400
|
||||
msg = "No tables found on the selected pages."
|
||||
if strategy == "lines":
|
||||
msg += " Try the 'Text alignment' or 'Auto' strategy — your PDF may use borderless tables."
|
||||
elif mode == "tables":
|
||||
msg += " Try the 'Tables, fall back to text rows' mode, or use PDF to Word in Layout mode."
|
||||
else:
|
||||
msg += " If this is a scanned PDF, run it through OCR PDF first; otherwise try PDF to Word in Layout mode."
|
||||
return jsonify(error=msg), 400
|
||||
|
||||
# Auto-size columns on every sheet (cap at 60 chars to avoid absurd widths)
|
||||
for ws in wb.worksheets:
|
||||
|
||||
@@ -946,6 +946,9 @@ def _serialize_widgets(doc) -> list[dict]:
|
||||
required = bool(flags & 2) # bit 2 = required
|
||||
readonly = bool(flags & 1) # bit 1 = read-only
|
||||
multiline = bool(flags & (1 << 12)) # bit 13 = multiline (text only)
|
||||
# PDF spec bit 19 (Ff 1<<18) = combobox is editable (user can type
|
||||
# values outside the choice list). Set only on combo fields.
|
||||
editable_combo = (ftype == "combobox") and bool(flags & (1 << 18))
|
||||
|
||||
# Choice fields expose `choice_values`; treat None as empty list
|
||||
choices = list(w.choice_values or []) if hasattr(w, "choice_values") else []
|
||||
@@ -981,6 +984,7 @@ def _serialize_widgets(doc) -> list[dict]:
|
||||
"rect": [round(c, 2) for c in (w.rect or fitz.Rect())],
|
||||
"option_label": option_label,
|
||||
"option_value": option_value,
|
||||
"editable": editable_combo,
|
||||
"required": required,
|
||||
"readonly": readonly,
|
||||
"multiline": multiline,
|
||||
|
||||
@@ -203,7 +203,32 @@ function buildField(f, isRadioGroup) {
|
||||
wrap.appendChild(lbl);
|
||||
}
|
||||
return wrap;
|
||||
} else if (f.type === "combobox" && f.editable) {
|
||||
// Editable combobox: free-text input with the choice list as suggestions.
|
||||
// The user can pick from the list OR type something not in it.
|
||||
input = document.createElement("input");
|
||||
input.type = "text";
|
||||
input.dataset.fieldName = f.name;
|
||||
input.dataset.fieldType = "combobox";
|
||||
input.value = f.value || "";
|
||||
input.style.width = "100%";
|
||||
const listId = `dl-${f.name.replace(/[^A-Za-z0-9_-]/g, "_")}`;
|
||||
input.setAttribute("list", listId);
|
||||
const dl = document.createElement("datalist");
|
||||
dl.id = listId;
|
||||
for (const c of f.choices || []) {
|
||||
const opt = document.createElement("option");
|
||||
if (Array.isArray(c)) opt.value = c[0];
|
||||
else opt.value = c;
|
||||
dl.appendChild(opt);
|
||||
}
|
||||
wrap.appendChild(dl);
|
||||
const hint = document.createElement("small");
|
||||
hint.style.cssText = "color:var(--muted);display:block;margin-top:.2rem";
|
||||
hint.textContent = "Pick from suggestions or type any value";
|
||||
wrap.appendChild(hint);
|
||||
} else if (f.type === "listbox" || f.type === "combobox") {
|
||||
// Strict choice list (combobox without Edit flag, or any listbox).
|
||||
input = document.createElement("select");
|
||||
input.dataset.fieldName = f.name;
|
||||
input.dataset.fieldType = f.type;
|
||||
@@ -213,7 +238,6 @@ function buildField(f, isRadioGroup) {
|
||||
input.appendChild(empty);
|
||||
for (const c of f.choices || []) {
|
||||
const opt = document.createElement("option");
|
||||
// choice_values entries can be string or [value, label]
|
||||
if (Array.isArray(c)) {
|
||||
opt.value = c[0]; opt.textContent = c[1] || c[0];
|
||||
} else {
|
||||
@@ -222,6 +246,12 @@ function buildField(f, isRadioGroup) {
|
||||
if (f.value === opt.value) opt.selected = true;
|
||||
input.appendChild(opt);
|
||||
}
|
||||
const hint = document.createElement("small");
|
||||
hint.style.cssText = "color:var(--muted);display:block;margin-top:.2rem";
|
||||
hint.textContent = "Choices are defined inside the PDF; only these values are accepted";
|
||||
wrap.appendChild(input);
|
||||
wrap.appendChild(hint);
|
||||
return wrap;
|
||||
} else if (f.multiline) {
|
||||
input = document.createElement("textarea");
|
||||
input.rows = 3;
|
||||
|
||||
Reference in New Issue
Block a user