Anna's Blog: Backing up Spotify

In December 2025, Anna’s Archive the world’s largest “shadow library” shook the internet by announcing its latest conquest: a massive, nearly complete backup of Spotify.

Known primarily for preserving books and academic papers, the team at Anna’s Archive has officially expanded their mission of “preserving humanity’s knowledge and culture” into the realm of music. Here is a breakdown of what this backup includes and why it’s a milestone for digital preservation.

The Scale: 300TB of Human Heritage

The numbers behind the “Spotify Scrape” are staggering. The archive isn’t just a list of songs; it’s a meticulously organized database designed for long-term survival:

  • 256 Million Tracks: Metadata for 99.9% of all tracks on the platform.
  • 86 Million Music Files: Roughly 300 terabytes of audio, covering 99.6% of all listens on the service.
  • The Metadata King: It is now the largest publicly available music metadata database in existence, containing 186 million unique ISRCs (for comparison, MusicBrainz has around 5 million).

Why Backup Spotify?

The volunteer behind the post (identified as “ez”) explained that while music is generally “well-preserved” by enthusiasts, existing efforts have three major flaws:

  1. Popularity Bias: Most archives only focus on top-tier artists, leaving a “long tail” of obscure music at risk of disappearing.
  2. Storage Bloat: Audiophiles chase lossless FLAC files, which makes archiving everything prohibitively expensive in terms of storage.
  3. No Central Authority: There was previously no single, open-source list aiming to represent “all music ever produced.”

How the Archive is Structured

The release is divided into two parts to make it manageable for “data hoarders” and preservationists:

  • The Metadata (SQLite): A searchable database containing artists, albums, and tracks. It’s even designed to allow a “True Shuffle” something Spotify users have complained about for years allowing you to shuffle every single song in the archive without algorithmic bias.
  • The Files: Distributed in bulk torrents via the “Anna’s Archive Containers” (AAC) format.
    • High Popularity: Original OGG Vorbis quality (160kbit/s) with metadata added.
    • Zero Popularity: Re-encoded to OGG Opus (75kbit/s) to save space while maintaining decent quality, focusing on the millions of songs that get almost no streams.

The “Preservation” Ideology

Anna’s Archive frames this as a safeguard against “natural disasters, wars, budget cuts, and other catastrophes.” By making the archive “fully open,” they ensure that anyone with enough disk space can mirror the entire collection, making it nearly impossible for authorities or corporate entities to delete the history of digital music.

What’s Next?

While the project is currently a torrents-only archive intended for bulk backup, the blog post hints that if there is enough interest, they may eventually add functionality to download individual songs directly from the Anna’s Archive website.

Link: Backing up Spotify - Anna’s Blog


Disclaimer: Anna’s Archive operates in a legal gray area (and is often classified as a piracy site). While this backup is a landmark for digital preservation, it involves significant copyright complexities.

To manage 300 TB on Linux, you move away from desktop tools and into Enterprise Storage Administration. The standard choice for this is ZFS on Linux (ZoL).

Here is the architectural blueprint for setting up, managing, and backing up a 300 TB pool.


1. The Hardware Preparation

At this scale, you need a specialized HBA (Host Bus Adapter) like an LSI 9300-8i in “IT Mode” (Initiator Target). This allows Linux to see the drives directly without a hardware RAID controller interfering.

2. Implementation: The ZFS Pool

Do not create one giant RAID group. If you lose more than your parity allows, you lose 300 TB. Instead, use vdevs (Virtual Devices).

The Layout:
A pool made of two vdevs, each containing 10 drives in RAID-Z3 (triple parity).

  • Total Drives: 20 x 22 TB.
  • Redundancy: You can lose any 3 drives per group (6 total) without data loss.

Command to create the pool:

# Install ZFS
sudo apt install zfsutils-linux

# Create the pool 'bigdata' with two RAID-Z3 vdevs
sudo zpool create bigdata \
  raidz3 /dev/disk/by-id/drive1 /dev/disk/by-id/drive2 ... /dev/disk/by-id/drive10 \
  raidz3 /dev/disk/by-id/drive11 /dev/disk/by-id/drive12 ... /dev/disk/by-id/drive20

Note: Always use /dev/disk/by-id/ rather than /dev/sda to prevent drive letter shifting.


3. Backup Strategy: The “Snapshot” Method

On Linux, the fastest way to backup 300 TB is ZFS Send/Receive. It is block-level, meaning it only sends the changes between backups, not the whole 300 TB.

To a local backup server:

# Create a snapshot
zfs snapshot bigdata@backup_2026-01-16

# Send the snapshot to a second server over the network
zfs send bigdata@backup_2026-01-16 | ssh backup-server zfs recv backup_pool/bigdata


4. Restoration: The “Resilver”

If a drive fails, you replace it and “resilver” (rebuild).

# Check status
zpool status

# Replace the failed drive
zpool replace bigdata /dev/disk/by-id/OLD_ID /dev/disk/by-id/NEW_ID

Warning: Rebuilding 300 TB can take days. During this time, the other drives are under heavy stress. This is why RAID-Z3 is mandatory at this scale; a second or third drive failing during rebuild is a real risk.


5. Automated Monitoring (Crucial)

You cannot “manually” watch 300 TB. You need a monitoring stack:

  1. ZED (ZFS Event Daemon): Configure it to email you immediately if a checksum error occurs.
  2. Scrubbing: Schedule a monthly “scrub” to verify data integrity:
    zpool scrub bigdata
  3. Prometheus + Grafana: To visualize disk temperatures and IOPS.

Summary Checklist for Linux 300 TB

Layer Tool Purpose
Kernel ZFS on Linux Data integrity and volume management.
Network NFS / Samba Sharing the 300 TB to your network.
Transfer rclone Moving data to/from cloud (S3/Backblaze).
Verification sha256sum Verifying Anna’s Archive hashes after download.

Important Note for Anna’s Archive:
Since the Spotify backup is distributed via BitTorrent, you should use transmission-daemon or qbittorrent-nox on your Linux server. Use the sparse file setting so it doesn’t try to allocate all 300 TB of space instantly before the data arrives.