19 tháng 3, 202610 min read

Dữ liệu của bạn không riêng tư như bạn nghĩ (và vâng, bao gồm cả bạn)

Sự thật phũ phàng về chuyện gì xảy ra với tài liệu sau khi bạn nhấn "tải lên." Spoiler: không tốt đẹp gì đâu. Đây là cách để ngừng trở thành sản phẩm.

privacysecuritydatahumor

Let me paint a picture. It is 11 PM on a Tuesday. You have a stack of invoices to process. You Google "free PDF to Excel converter," click the first result, and upload your company's financial documents to a website you have never heard of, hosted in a country you could not point to on a map.

Congratulations. You just emailed your bank statements to a stranger. Except worse — because at least with email, you chose who to send it to.

The "Free" Tool Trap

Here is a universal truth of the internet: if you are not paying for the product, you are the product. Or more accurately, your data is the product.

That free PDF converter? It works great. It really does. But somewhere in the terms of service — you know, that novel-length document you scrolled past in 0.3 seconds — there is a paragraph that says something like "we may retain uploaded content to improve our services." Translation: your tax returns are now training data.

This is not hypothetical. In 2023, researchers found that several popular free document tools were storing uploaded files for weeks, sometimes months. Some were sharing aggregated data with advertising partners. One service was literally selling extracted text to data brokers. Your invoice from Dave's Plumbing Services is now part of a marketing dataset. Dave would not be happy about that.

The economics make this inevitable for advertising-supported tools. Data has value. Your financial documents contain extremely high-value signals — income level, spending patterns, business relationships, suppliers. An advertising platform would pay well for this information. A "free" OCR tool that processes millions of documents per month is sitting on a data goldmine, and the terms of service give them permission to mine it.

What Actually Happens When You Upload a Document

When you upload a document to a typical online tool, here is the journey it takes.

First, your file travels from your computer to a server. Hopefully over HTTPS — but not always. Some budget tools still use unencrypted connections, which means anyone sitting on the same coffee shop WiFi could theoretically intercept your file. Fun.

Once it reaches the server, your document is processed. So far, so good. But then what? In many cases, the file is stored — sometimes temporarily, sometimes permanently. It might be cached on a CDN. It might be backed up to another data center. It might be logged for debugging purposes. Your single document now exists in three to five different locations, each with its own security posture.

And then there are the subprocessors. Your document might pass through an OCR service, a text extraction API, a translation service, and a storage provider. Each one is a separate company with its own privacy practices, its own security team (or lack thereof), and its own incentives. Your document is on a world tour, and you did not even pack its bags.

Even if the primary service has impeccable privacy practices, each subprocessor in the chain introduces additional risk. A breach at any point in that chain exposes your document. And a data breach at a subprocessor you have never heard of, processing documents from a service you used once, might not even make the news.

The Checkbox of Lies

We need to talk about consent. Specifically, the kind of consent that involves a checkbox and 47 pages of legal text.

"I agree to the Terms of Service and Privacy Policy." You have clicked this approximately 4,000 times in your life. You have read the full document approximately zero times. Nobody blames you — a study found that it would take roughly 76 working days per year to read every privacy policy you encounter. That is more than three months of full-time reading. Just reading.

So what are you actually agreeing to? Often, quite a lot. The right to store your data indefinitely. The right to share it with "trusted partners" (a term so vague it could include literally anyone). The right to use your content for "service improvement" (read: AI training). The right to transfer your data if the company is acquired (which happens all the time in the startup world — your documents could end up owned by a company that did not exist when you uploaded them).

You did not read any of this. You clicked the checkbox. We all did.

The most concerning use case is AI training. Millions of business documents are being used to train document understanding AI models without the document owners' meaningful knowledge or consent. Your invoice, your contract, your medical record — these are valuable training examples for AI companies. "Service improvement" clauses are how they get access to them.

Cloud Storage Is Not Private Storage

Here is a misconception that needs to die: putting something "in the cloud" does not mean it is floating safely in digital space, protected by the angels of encryption.

The cloud is just someone else's computer. More specifically, it is a lot of someone else's computers, in data centers scattered across the globe, managed by teams of people who have access to the systems your files live on.

Most major cloud providers encrypt your data "at rest" (when it is sitting on a disk) and "in transit" (when it is being sent somewhere). This is good. But it is not the same as end-to-end encryption. The provider can still access your files if they want to — or if a government asks nicely (or not so nicely, depending on the jurisdiction).

For personal photos and music playlists, this level of privacy is probably fine. For financial documents, medical records, contracts with confidential terms, and business data? You might want to think twice about where that data actually lives and who can see it.

Data residency matters too. If you are a European company uploading documents to a service that stores data in the US, your data is subject to US government access laws, not just EU data protection rules. GDPR cannot protect your data once it crosses into a jurisdiction without equivalent protections.

The GDPR Made Things Better (Sort Of)

The European Union's General Data Protection Regulation was supposed to fix everything. And to be fair, it helped. Companies now have to tell you what data they collect. You can request deletion. There are actual consequences for violations — we are talking fines in the hundreds of millions.

But here is the thing: GDPR works on the honor system, enforced by regulators who are chronically underfunded and overwhelmed. A small document processing startup in Southeast Asia is not losing sleep over GDPR compliance. And even companies that do comply often do so in the most technically-correct-but-practically-useless way possible. "We deleted your data from our production database!" Great. What about the backup from last Thursday? The log files? The analytics pipeline? The copy that the intern downloaded to their laptop for testing?

Regulation is necessary, but it is not sufficient. You still need to care about where your data goes, because nobody else is going to care as much as you do.

Similarly, privacy certifications (ISO 27001, SOC 2, etc.) indicate that a company has security processes in place, not that your specific documents are protected. Certification means the company follows best practices. It does not mean the company has not made a business decision to use your data for purposes you would find objectionable.

Real-World Consequences of Poor Document Privacy

This might seem like abstract concern until you consider what is actually in the documents most businesses upload to free tools.

Invoices contain: vendor bank account numbers, your own bank account or payment method, VAT/tax ID numbers, detailed purchasing patterns that reveal your suppliers and contract sizes.

Contracts contain: confidential business terms, client and vendor names and relationships, pricing information that is typically kept confidential, intellectual property assignments, non-disclosure obligations (ironic to violate those by uploading the NDA).

Payroll documents contain: employee names, salaries, tax IDs, bank account details for direct deposit.

Medical documents contain: diagnoses, treatment history, insurance information, personal identification.

Any of these, if exposed, causes real harm: competitive damage, regulatory penalties, identity theft risk, loss of client trust. The question is not whether your documents have sensitive information — they almost certainly do. The question is what level of risk you are accepting by uploading them to unknown services.

Five Things You Can Actually Do About It

Alright, enough doom and gloom. Here is the practical part.

First, check the privacy policy. I know, I know — I just said nobody reads them. But you do not have to read the whole thing. Search for keywords like "retain," "store," "share," "third party," and "training." If any of those words appear in a context that makes you uncomfortable, find a different tool.

Second, prefer tools that process locally or in-memory. If a service explicitly states that your documents are never stored and are processed in memory only, that is a massive green flag. No storage means no breach, no subpoena, and no surprise data sale.

Third, redact before you upload. If you only need the line items from an invoice, do you really need to upload the page with your bank account number on it? Crop, redact, or mask anything that is not strictly necessary.

Fourth, use different tools for different sensitivity levels. That free converter is fine for a restaurant menu you want in text form. It is not fine for your company's quarterly financials. Match the tool's privacy level to the document's sensitivity.

Fifth, check for security headers. This is a nerd move, but it works. Open your browser's developer tools, look at the response headers from the site. If you see Content-Security-Policy, Strict-Transport-Security, and X-Content-Type-Options, the developers at least thought about security. If those headers are missing, the site was probably built in a weekend hackathon and privacy was an afterthought.

Why We Built DocPrivy This Way

Full disclosure: this is the part where we talk about ourselves. But it is relevant, promise.

DocPrivy processes your documents in memory and discards them immediately. We do not store your files. We do not log their contents. We do not use them for training. There is no database of uploaded documents sitting on a server somewhere. When the extraction is done, your data exists in exactly one place: your browser.

We do not require an account. We do not track what you upload. We enforce strict Content Security Policy headers to prevent cross-site attacks. We chose these constraints deliberately, even though they mean we cannot offer features like document history or saved templates.

Is this approach perfect? No. Your document still travels to our server for processing (that is how web-based extraction works). But the window of exposure is measured in seconds, not days or months. And once processing is complete, there is nothing to breach, nothing to subpoena, and nothing to sell.

The bar for document privacy should not be this low, but here we are. At least now you know what to look for.

The Bottom Line

Your data is valuable. Not in a vague, philosophical sense — in a literal, dollars-and-cents sense. Companies build entire business models around collecting, analyzing, and monetizing user data. The document you uploaded "for free" might generate more revenue for the service provider than if you had just paid five dollars for the tool.

The good news is that privacy-first tools exist. The bad news is that they are still the exception, not the rule. Until that changes, the responsibility falls on you to ask the right questions before clicking "upload."

Or you could just keep uploading your tax returns to random websites. Your call. But Dave's Plumbing Services would really prefer if you did not.

Protecting Sensitive Documents During Data Extraction AI Document Extraction Explained for People Who Do Not Care About AI Why Does My Accountant Keep Asking for Documents I Already Sent?

Sẵn sàng thử?

Trích xuất dữ liệu từ tài liệu miễn phí — không cần đăng ký.

Trích xuất ngay

← Tất cả bài viết