AIDocPrivy
Quay lại Blog
9 min read

Cách chuyển tài liệu scan thành dữ liệu có cấu trúc

Hướng dẫn thực tế biến tài liệu giấy đã scan thành dữ liệu có tổ chức, sẵn sàng xuất ra Excel bằng AI.

scanningstructured dataguide

You have a stack of paper documents — invoices, receipts, contracts — and you need the data inside them in a spreadsheet or database. Scanning them to PDF is only half the battle. The scanned images contain text that looks readable to humans but is completely opaque to computers. Here is how to bridge that gap and turn scans into structured, usable data.

Understanding the Challenge

When you scan a paper document, the result is a digital photograph — a grid of pixels that represents the page visually. Unlike a digital-native document (a Word file, a PDF generated by accounting software) where text exists as actual character data, a scanned document contains no text at all from the computer's perspective. It is entirely visual information.

This distinction explains why common approaches fail. Opening a scanned PDF and pressing Ctrl+A to select all text yields nothing, because there is no text layer. Copy-pasting fails. CTRL+F search returns no results. Standard PDF-to-Excel converters produce empty spreadsheets.

Converting scanned documents to structured data therefore requires two separate steps: first, using OCR to extract readable text from the image; second, using AI to understand what that text means and organize it into labeled fields, tables, and key-value pairs. Modern AI extraction tools handle both steps automatically — you upload the scan and get structured data back, with the intermediate OCR step happening invisibly.

Step 1: Get a Good Scan

The quality of your scan directly affects the quality of data extraction. Follow these guidelines for best results.

Resolution: Scan at 300 DPI or higher. Lower resolutions make character recognition unreliable, especially for small text. Many entry-level scanners default to 200 DPI, which is adequate for most documents but falls short for dense text, small fonts, or documents with fine print (like contracts or tax forms). Set your scanner to 300 DPI as the minimum; 600 DPI for documents with very small text.

Contrast: Ensure good contrast between text and background. If the original document is faded — old receipts, carbon copies, lightly printed forms — increase contrast in your scanner settings. Most scanner software includes a contrast adjustment. If the text appears grey rather than black in the scan preview, increase contrast before capturing.

Alignment: Keep documents straight on the scanner bed. Skewed text is harder for OCR engines to read, though modern AI tools handle moderate skew (up to 10-15 degrees) reasonably well. Major misalignment reduces accuracy, particularly for table columns where alignment matters for data structure.

File format: Save as PDF or PNG. Avoid heavy JPEG compression, which introduces artifacts around text edges that confuse character recognition. JPEG is a lossy format that blurs fine detail — exactly the kind of detail that distinguishes characters from each other. If you must use JPEG, set quality above 85%.

Clean originals: Remove staples, unfold creases, and smooth wrinkles before scanning. Physical imperfections translate to digital noise that degrades OCR accuracy. A creased receipt with text along the fold line is significantly harder to process than a flat one.

Step 2: Choose Your Extraction Approach

For simple text extraction (search indexing, archival), standard OCR is sufficient. Most scanning apps include built-in OCR that produces searchable PDFs. This is appropriate when you need to find specific text within archived documents but do not need the data organized into fields.

For structured data extraction (getting specific fields into spreadsheet columns), you need AI-powered extraction. This is the right choice when you need the vendor name in one column, the invoice total in another, and each line item as a separate row.

The distinction matters for workflow planning. OCR-only tools are faster, cheaper, and produce output suitable for archival and search. AI extraction tools take more processing time and may have per-page costs, but deliver output you can use directly in accounting, analysis, or database workflows without manual reformatting.

For occasional documents, a free online AI extraction tool handles both purposes. For high-volume operations (hundreds of documents per month), evaluating dedicated tools with batch processing and accounting system integrations makes sense.

Step 3: Extract and Validate

Upload your scanned document to an AI extraction tool. The tool should identify the document type automatically (invoice, receipt, contract, etc.) and extract all relevant fields.

Review the extraction results carefully, paying attention to:

Numbers: Verify that amounts, quantities, and totals are correct. The characters "0" and "O", "1" and "l", "5" and "S" are common confusion points in OCR. A unit price of $1,500.00 becoming $1,S00.00 and being interpreted as text rather than a number is the kind of error that passes visual inspection quickly but causes problems downstream.

Dates: Check that date formats were interpreted correctly. Is "03/04/2026" March 4th or April 3rd? Good extraction tools normalize dates to unambiguous formats like YYYY-MM-DD. If the output date looks unusual, verify against the original document.

Special characters: Accented characters, currency symbols, and non-Latin scripts may need verification. Vietnamese diacritics, Chinese characters, and Arabic script all benefit from checking in the extracted output, especially if the original scan is less than ideal quality.

Table structure: Confirm that line items were extracted as separate rows with correct column alignment. A line item table with five products should produce five rows in the output, each with description, quantity, unit price, and amount in separate fields.

Step 4: Export and Use

Once you have verified the extracted data, export it in the format your downstream system needs.

Excel (XLSX): Best for manual review and ad-hoc analysis. Supports multiple sheets — one for document header fields, one for line items, one for each table in the document. Excel also preserves data types (dates as dates, numbers as numbers) rather than converting everything to text.

CSV: Universal format for database import, accounting systems, and data pipelines. Simple and widely compatible, but lacks support for multiple sheets and may require attention to encoding for non-ASCII characters (like accented letters or currency symbols).

JSON: Ideal for developer workflows, APIs, and automated processing. JSON preserves the hierarchical structure of extracted data (document fields at the top level, line items as an array) without flattening it into rows and columns.

DOCX or PDF: Good for creating formatted reports from extracted data — expense reports, data summaries, or formatted records that need to be shared with others.

Handling Difficult Documents

Not all scanned documents extract cleanly. Here is how to handle common problem cases.

Low-quality scans: Very old documents, carbon copies, fax receipts, and thermal paper receipts often scan poorly due to fading or low original contrast. Increasing contrast and brightness during scanning helps. If you cannot improve the scan quality, try uploading to multiple extraction tools — different models handle degraded documents differently.

Documents with stamps and annotations: Government documents, customs forms, and heavily-used business documents often have stamps, handwritten additions, and corrections overlaid on printed text. AI extraction handles this better than template-based tools because it understands context, but very dense annotation can still cause confusion. If accuracy is critical for heavily annotated documents, verify extracted values against the original.

Multi-page documents with varying layouts: A contract where each page has different content, or a financial report where tables appear only on certain pages, processes best when you let the AI handle the entire multi-page document in one upload. The model maintains context across pages to understand running totals, continued tables, and document structure.

Handwritten content: Printed text extracts reliably. Handwritten text is harder. Modern AI models handle clear handwriting reasonably well, but messy or informal handwriting reduces accuracy significantly. For documents where handwritten fields are critical (signed amounts, handwritten corrections), verify those fields manually.

Common Pitfalls

Trusting extraction blindly: Always review results, especially for financial documents. Even the best AI makes mistakes on poor-quality scans. A 2% error rate sounds small until it means two incorrect line items on every hundred invoices you process.

Over-compressing images: JPEG artifacts around text edges cause OCR errors. Use PNG for scans or use minimal JPEG compression. When in doubt, save at higher quality and reduce file size later if needed.

Ignoring confidence indicators: Good extraction tools provide confidence scores. "NEEDS_REVIEW" means exactly that — do not skip the review. These indicators tell you where the AI is uncertain, which is exactly where errors are most likely.

Processing too many pages at once: For large documents (50+ pages), consider processing in sections. Extraction quality can vary across pages with different layouts, and processing smaller batches makes review more manageable.

Not keeping originals: Always retain the original scanned document alongside the extracted data. The extraction is a convenience — the original scan is the authoritative record. If a discrepancy is discovered later, you need the original to verify.

Scanning Best Practices for Regular Workflows

For businesses that regularly scan and process documents, establishing consistent scanning practices pays dividends over time.

Batch scanning: Process documents in batches rather than one at a time. Scanning 20 invoices at once and extracting them in a single session is more efficient than processing them individually as they arrive.

Naming conventions: Establish a consistent file naming scheme before uploading to extraction tools. Documents named YYYY-MM-DD_Vendor_InvoiceNumber.pdf are easier to manage than Scan001.pdf.

Scan before you need it: Scan documents when they arrive, not when you need to process them. This prevents the "urgent invoice" problem where poor scan quality delays payment because the document needs to be rescanned.

Quality check at scan time: Look at the scan preview before accepting it. If text is blurry, the page is skewed, or the contrast is poor, rescan immediately. It is much faster to rescan at the source than to deal with poor extraction quality later.

Try It Now

DocPrivy handles scanned documents natively. Upload a scanned PDF, JPEG, or PNG image, and the AI will extract structured data from it — even from low-contrast or slightly skewed scans. The AI identifies the document type automatically and extracts all relevant fields, tables, and line items.

Export to XLSX, CSV, DOCX, or JSON, all for free, with no account required. Your documents are processed in memory and immediately discarded — nothing is stored on our servers.

Sẵn sàng thử?

Trích xuất dữ liệu từ tài liệu miễn phí — không cần đăng ký.

Trích xuất ngay