Columbia University has confirmed that a politically motivated intrusion between May and June 2025 resulted in the exfiltration of 460 gigabytes of sensitive data, exposing personal information on 868,969 individuals. The haul included up to 1.8 million Social Security numbers and 2.5 million applicant records, many tied to people who never attended or even applied to the institution.
What Happened
An unidentified attacker, claiming motivation to expose post-affirmative action admissions practices, infiltrated Columbia University's internal systems and remained active across a multi-week window in mid-2025. Columbia detected the intrusion in late June 2025 but waited until July to begin public notifications. The breach scope expanded dramatically as forensic teams discovered that a legacy recruitment database, believed to have been purged after 2012, still contained Social Security numbers harvested from third-party testing and scholarship pipelines a decade or more earlier.
What Was Taken
The attacker exfiltrated approximately 460 GB of data spanning:
- Up to 1.8 million Social Security numbers
- Over 2.5 million applicant records
- Dates of birth and contact details
- Financial aid records and academic histories stretching back decades
- Prospect data ingested from the College Board, ACT, and various scholarship programs
A substantial portion of the affected population, totaling 868,969 confirmed notifications, includes individuals with no formal relationship to Columbia. Their SSNs were collected through pre-2012 recruitment pipelines where testing organizations used SSNs as student identifiers and shared prospect lists with universities under consent checkboxes buried in test registration forms.
Why It Matters
This incident is a defining case study in latent data risk. The most damaging records in this breach were not active student files; they were forgotten artifacts of a recruitment ecosystem that no longer exists. Electronic Frontier Foundation technologists have described Columbia's decades-long retention as "really indicting," and the case demonstrates how data minimization failures from the early 2000s continue to generate fresh victims in 2026. For any organization that has acquired data through marketing partners, brokers, or affiliate pipelines over the past 20 years, Columbia's breach is a preview of what discovery looks like during incident response.
The Attack Technique
Columbia has characterized the actor as politically motivated rather than financially driven, which is consistent with the actor's public framing around admissions transparency. The university has not yet disclosed the precise initial access vector or whether credential abuse, exploitation of an internet-facing application, or insider-adjacent access was involved. The 460 GB exfiltration volume and the multi-week dwell time between May and June 2025 suggest staged collection from multiple data stores rather than a single smash-and-grab, indicating the attacker had time to enumerate legacy systems that defenders had effectively forgotten.
What Organizations Should Do
- Inventory every legacy database, archival store, and decommissioned application that may still hold SSNs or other regulated identifiers, including systems excluded from prior purge initiatives.
- Audit pre-2012 data acquired from third-party prospect pipelines, including testing services, scholarship platforms, and affiliate marketers, and document a legal basis for continued retention or destroy it.
- Deploy data discovery and classification tooling against file shares, backups, and orphaned database instances to surface SSNs and PII that current data maps do not reflect.
- Segment legacy and archival data stores away from production identity and authentication systems so that a single foothold cannot reach decades of historical records.
- Enable database-layer logging and exfiltration detection sized for bulk reads; 460 GB of structured data leaving a perimeter should produce a clear alert signature.
- Pre-stage breach notification workflows that can handle non-customer victim populations, including identity verification, call center capacity, and language for individuals who have no recollection of the relationship.
Sources: Columbia's Data Breach Exposes Hidden Victims Who Never Attended the University - Gadget Review