Cách trích xuất bảng từ PDF sang Excel (Miễn phí)
Hướng dẫn từng bước trích xuất bảng từ PDF sang Excel. So sánh copy-paste thủ công, công cụ chuyển đổi online và AI.
You have a PDF with a table — an invoice with line items, a financial report with quarterly figures, a shipping manifest with package details. You need that data in Excel so you can sort it, filter it, run formulas, or import it into your accounting software. This should be simple, but anyone who has tried knows it is not.
PDFs were designed for viewing and printing, not for data extraction. When you copy a table from a PDF and paste it into Excel, the columns collapse into a single cell, the rows merge together, and you spend the next thirty minutes manually reformatting everything. Sometimes the data does not even paste in the right order.
This guide covers three approaches to extracting tables from PDFs into Excel, starting with the most effective.
Why PDF Tables Are Hard to Extract
To understand why this is difficult, it helps to know how PDFs store data. Unlike a spreadsheet where data lives in defined cells with rows and columns, a PDF stores text as positioned characters on a page. There is no concept of a "table" in the PDF format — what looks like a table to your eyes is just text placed at specific coordinates with lines drawn between them.
When you select and copy text from a PDF, your PDF viewer tries to reconstruct the reading order from those coordinates. For simple paragraphs, this works well. For tables, it often fails because the viewer cannot reliably determine which text belongs in which column, especially when columns have varying widths or when cells contain multi-line text.
This is why copy-paste from PDF to Excel produces garbled results. The structure that your eyes can see instantly is invisible to the software.
Scanned PDFs make this worse. A scanned PDF is just an image — the "table" you see is not even stored as text at all. It is pixel data, indistinguishable to the computer from any other image. Standard copy-paste produces nothing; even tools that rely on the PDF text layer will fail.
Method 1: AI-Powered Table Extraction (Recommended)
AI extraction tools solve the fundamental problem by using vision models that can see the table the same way you do. Instead of trying to reconstruct table structure from character positions, the AI looks at the entire page as an image and identifies rows, columns, headers, and cell values.
This approach works with both digital PDFs (where text is selectable) and scanned PDFs (where the table is just an image). The AI handles the additional challenges of scanned documents — skewed pages, uneven lighting, and low resolution — that make traditional extraction methods fail completely.
How to extract tables from PDF using AI extraction:
1. Upload your PDF to an AI extraction tool like DocPrivy 2. The AI analyzes the document and identifies all tables 3. Each table is extracted with correct column headers, row structure, and cell values 4. Review the extracted data in a preview table 5. Export directly to XLSX (Excel) or CSV
The key advantage is accuracy. AI extraction preserves the table structure that copy-paste destroys. Numbers stay in the right columns, dates remain formatted correctly, and multi-line cell content is handled properly.
You can also customize the output before exporting: reorder columns by dragging, hide columns you do not need, or merge multiple columns into one. This means the Excel file matches exactly what your downstream workflow expects.
Method 2: Online PDF to Excel Converters
Online converters like Smallpdf, ILovePDF, or Adobe Acrobat Online offer PDF-to-Excel conversion. You upload a PDF, and they return an XLSX file with the content arranged in cells.
These tools work by analyzing the text positions in the PDF and attempting to reconstruct the table grid. For simple tables with clear borders and consistent formatting, the results can be decent. For complex tables — merged cells, nested headers, tables that span multiple pages, or tables without visible borders — the results are often unusable.
Limitations of online converters:
They convert the entire page, not just the table. Your Excel file will contain headers, footers, page numbers, and other non-table content mixed in with your data.
They struggle with scanned PDFs. If your PDF is a scan or photo, most converters either fail entirely or produce very poor results because they rely on the PDF text layer rather than visual analysis.
Many have strict file size or page limits on their free tiers, pushing you toward paid subscriptions for regular use.
Privacy is a concern. Your PDF is uploaded to their servers and processed there. For financial documents or contracts containing sensitive data, this may not be acceptable.
Method 3: Manual Copy-Paste with Cleanup
The manual approach involves copying the table from your PDF viewer and pasting it into Excel, then fixing the formatting by hand. This is the most time-consuming method but requires no tools beyond what you already have.
Steps for manual extraction:
1. Open the PDF in Adobe Acrobat Reader, your browser, or another PDF viewer 2. Select the table text — try to select just the table area, not surrounding text 3. Copy the selection (Ctrl+C or Cmd+C) 4. Open Excel and paste into cell A1 5. Use Text to Columns (Data tab → Text to Columns) to split the pasted text into separate columns 6. Manually fix any rows that did not split correctly 7. Delete empty rows and clean up formatting
This method works for simple tables with a few rows. For tables with more than about 20 rows, the cleanup time makes this approach impractical. For tables from scanned PDFs, it does not work at all since there is no selectable text.
One improvement: if your PDF viewer supports it, try pasting into Google Sheets first instead of Excel. Google Sheets sometimes handles the paste formatting slightly better, and you can then download as XLSX.
Comparison: Which Method Should You Use?
For a quick comparison:
AI extraction: Best accuracy, works with scanned PDFs, preserves table structure, customizable output. Free with DocPrivy. Best for invoices, financial reports, and any document where accuracy matters.
Online converters: Moderate accuracy for simple tables, fails with scanned PDFs, limited customization. Free tiers have restrictions. Best for quick one-off conversions of simple digital PDFs.
Manual copy-paste: Low accuracy, does not work with scanned PDFs, very time-consuming. Always free. Best only when you have a single simple table and no other tools available.
If you extract tables from PDFs regularly — even a few times a month — AI extraction saves enough time to be worth using from the first document. The accuracy difference is not marginal; it is the difference between usable data and data that needs extensive manual correction.
Common Table Extraction Challenges
Regardless of which method you use, certain table formats are more difficult than others:
Merged cells: When a header spans multiple columns, extraction tools may duplicate the header or assign it to the wrong column. AI extraction handles this better than other methods because it can see the visual layout.
Multi-page tables: Tables that continue across page breaks are challenging because the header row may not repeat on subsequent pages, and page footers may interrupt the data. AI extraction often handles this by tracking table structure across pages.
Tables without borders: Some documents use spacing instead of lines to delineate columns. This makes it harder for any tool to determine column boundaries. AI extraction is more reliable here because it uses visual cues beyond just borders.
Mixed content: Tables that contain both text and numbers, or tables where some cells contain multi-line text while others contain single values, can cause alignment issues.
Nested tables: Some documents contain tables within tables — a summary table that contains a detail table in one of its cells. These are challenging for any automated extraction tool.
For all of these challenges, AI extraction produces the best results because the vision model understands the visual layout rather than relying on text positions or border detection.
After Extraction: Cleaning Up Your Excel Data
Even with the best extraction method, you may want to clean up the data in Excel:
Remove extra whitespace: Use TRIM() to remove leading and trailing spaces from cells. Extracted text sometimes contains invisible whitespace characters that affect sorting and matching.
Fix number formatting: Extracted numbers may be formatted as text if the original PDF stored them as text strings. Select the column, go to Data → Text to Columns → Finish, and Excel will convert text to numbers.
Standardize dates: Different documents use different date formats. Use Find & Replace or a formula to standardize to your preferred format (YYYY-MM-DD is universally unambiguous).
Split or merge columns: If a "Name" column contains both first and last names, use Text to Columns with a space delimiter to split them. If you need to combine columns (like separate "City" and "Country" columns into a single address field), use CONCATENATE() or the & operator.
Verify financial math: After extraction, verify that columns of numbers add up correctly. Use SUM() formulas to check that line items sum to stated subtotals, and that subtotal plus tax equals the stated total.
With AI extraction tools like DocPrivy, you can do much of this customization before exporting — reordering, hiding, and merging columns in the preview — so the Excel file arrives closer to its final form.
Special Case: Financial Tables
Financial tables in invoices, statements, and reports deserve special attention because errors have direct monetary consequences.
For invoice line item tables, verify that each row's amount equals quantity times unit price. AI extraction tools typically flag cases where this does not hold, but verify that flagged rows are reviewed.
For bank statement transaction tables, verify that the opening balance plus all credits minus all debits equals the closing balance. This is a quick sanity check that catches most extraction errors in a single calculation.
For financial reports with summary and detail levels, verify that detail rows sum to the stated summary totals. Misaligned rows — where a detail row was assigned to the wrong summary group — are hard to detect without this check.
Currency and decimal handling deserves attention for international financial documents. European-format numbers use a period as a thousands separator and a comma as a decimal separator ("1.250,00" for one thousand two hundred fifty). If this is interpreted as a US-format number, the value becomes 1.25 — a 1000x error. AI extraction handles locale-aware number formatting, but verify that currency and decimal interpretation matches the document's country of origin.
Extract Tables from PDF for Free
DocPrivy extracts tables from any PDF — digital or scanned — and exports directly to Excel or CSV. The AI identifies table structure automatically: headers, rows, columns, and cell values are preserved exactly as they appear in the document.
Upload your PDF at docprivy.com, review the extracted table, customize columns if needed, and download your Excel file. No account required, no file stored, no subscription.