Deepseek: Data Theft or Open Source Revolution?

amu · January 30, 2025, 10:14pm

The tech world is buzzing. Accusations of data theft are flying, misinformation is rampant, and Deepseek is at the center of it all. Is this just another case of fear-mongering around open source, or is there something more to it? Let’s dive in.

The narrative often goes like this: Big Tech invests heavily in proprietary technology, then open source comes along and disrupts the model. The question inevitably arises: How do they make money? It’s a valid concern, but it often overshadows the broader benefits of open source.

Think about Meta/Facebook. They’ve been developing their own frameworks. The long-term play? If their framework becomes widely adopted, like Docker is today, the payoff will be enormous. It’s not always just about immediate profit.

The reality is, the foundation of virtually all AI modules, including those from big players, relies heavily on open-source components. Companies like Meta and X (formerly Twitter) leverage their vast data resources to refine and improve these modules. OpenAI, lacking a social network, faces a different challenge. Their training data comes primarily from user prompts and uploads – data you implicitly agree to share when using their services.

Now comes Deepseek, making waves with its impressive performance. Naturally, data privacy concerns are surfacing. It’s crucial to be mindful of data security, but why is everyone so quick to hand their data to OpenAI or Facebook while scrutinizing Deepseek?

This situation reflects a natural evolution. Commercial AI providers are struggling to maintain their competitive edge as open source gains momentum. Deepseek’s open-source nature allows anyone to rebuild, enhance, and even “unjail” the module. Can you do that with OpenAI’s models? Or any other major AI company’s models, for that matter? Of course not. And where do these companies get their training data? From the very prompts and data users upload – a fact they often downplay! Consider this: ask your webmaster how many web scrapers are taking content from your webpage. For a simple test, write a unique, fake statement on a hidden page of your website. Wait six to eight months, then prompt the AI model for that specific statement. If it returns the fake statement, you’ll have a pretty good idea how some of these models are trained.

Deepseek’s open nature offers a compelling alternative. Worried about sharing your data? You can run Deepseek locally on your Linux, Windows, or Mac machine, with or without a GPU, and even without an internet connection. Getting started is surprisingly simple. On Linux, two commands will do the trick:

a.) pacman -S ollama

b.) ollama run deepseek-r1:1.5b

That’s it. No high-end hardware or cost required (just about 2GB of storage).

The traditional business model for AI modules is facing disruption. Open source, driven by its free and accessible nature, is a powerful force. The key question is shifting from “who has the best module?” to “who can provide the necessary computing power?”

Let’s be realistic: No company in its right mind uploads sensitive financial data to the cloud for analysis. Trade secrets are too valuable to risk. The future likely belongs to local, on-premise modules. Companies will pay for the computing resources, not necessarily the modules themselves.

The current backlash against Deepseek seems like a knee-jerk reaction. The impending release of Meta’s Llama4 in open source will likely further accelerate this trend.

Imagine a “people for the people” project (remeber to Seti?) where individuals donate spare CPU/GPU power to build even better AI modules. This could be a game-changer, challenging the dominance of commercial AI providers. A win for us, the users? What do you think? Any supporters?