Chinese researchers appear to have erased crucial genetic samples of the earliest confirmed patients of COVID-19 in Wuhan for reasons that investigative scientists are now trying to figure out. A new report on Wednesday by an American virologist from the Fred Hutchinson Cancer Research Center in Seattle, Prof. Jesse Bloom, revealed the extent of the apparent coverup.
The report emerged amid growing scrutiny over the origins of the novel coronavirus, which now points to having a lab-based origin despite a year of widespread dismissal of the possibility.
The “lab-leak hypothesis” was first advanced by Arkansas Sen. Tom Cotton very early on in the pandemic. The Washington Post immediately dismissed the speculation as a “debunked conspiracy theory,” but has since attempted to rewrite the narrative, explaining how the theory “suddenly became credible.”
In a study published on Tuesday titled “Recovery of deleted deep sequencing data sheds more light on the early Wuhan SARS-CoV-2 epidemic,” Bloom explained that he discovered a data set of 45 positive samples from Wuhan outpatients with suspected COVID-19 infection early on in the outbreak, that he says was deleted from the NIH’s Sequence Read Archive.
“I recover the deleted files from the Google Cloud, and reconstruct partial sequences of 13 early epidemic viruses,” Bloom wrote in the abstract. “Phylogenetic analysis of these sequences in the context of carefully annotated existing data suggests that the Huanan Seafood Market sequences that are the focus of the joint WHO-China report are not fully representative of the viruses in Wuhan early in the epidemic. Instead, the progenitor of known SARS-CoV-2 sequences likely contained three mutations relative to the market viruses that made it more similar to SARS-CoV-2’s bat coronavirus relatives.”
According to the Wall Street Journal on Thursday, the genetic data could have aided in pandemic research. The NIH confirmed that it deleted the sequences following a request from a Chinese researcher who submitted them three months earlier, and said it was standard practice to allow this.
The deletion of the genetic sequences raises concerns that scientists studying the origin of the pandemic may lack access to crucial information that would allow them to determine where the virus came from.
Bloom said that some of the deleted information is still available in a paper that was published in a specialized journal, but scientists typically depend on databases like the one maintained by the NIH for gene sequences. Bloom said he was able to find the missing data after looking for it elsewhere online.
Although the information may not necessarily provide necessary details about the virus’ origins in totality, Bloom said that the removal of the crucial information sows doubt towards China’s commitment to “transparency” in continuing the investigation into the origin of the pandemic.
On Twitter, Bloom detailed his discovery:
In a new study, I identify and recover a deleted set of #SARSCoV2 sequences that provide additional information about viruses from the early Wuhan outbreak. Specifically, HIN maintains the Sequence Read Archive, where scientists around world deposit deep sequencing data for others to analyze. I noted peerj.com/articles/9255 lists all #SARSCoV2 data in archive as of March 31, 2020. Most from a project by Wuhan University.
But when I went to Sequence Read Archive, I found entire project was gone! (Note that as detailed below, this does *not* imply malfeasance by NIH. Sequence Read Archive policy allows submitters to delete by e-mail request.) I was able to determine deleted data corresponded to a study that partially sequenced “45 nasopharyngeal samples from [Wuhan] outpatients with suspected COVID-19 early in the epidemic.”
I discovered that even though the files were deleted from archive itself, they could be recovered from the Google Cloud … Using this approach, I recovered files for the 34 early samples that were virus positive. I was able to use the data in the files to reconstruct partial viral sequences (from start of spike to end of ORF10) for 13 of these samples.
Crucially, Bloom says that the emergence of COVID-19 appeared to be somewhat strange, stating “everyone agrees deep ancestors are coronavirus from bats.”
“Therefore, we’d expect the first #SARSCoV2 sequences would be more similar to bat coronaviruses, and as #SARSCoV2 continued to evolve, it would become more divergent from these ancestors,” he added. “But that is not the case!”
Early Huanan Seafood Market #SARSCoV2 viruses are more different from bat coronaviruses than #SARSCoV2 viruses collected later in China and even other countries.”
“The conundrum is easily seen by plotting the relative differences from the bat coronavirus RaTG13 outgroup versus collection date for early #SARSCoV2. If we include those sequences, and note 4 sequences from Guangdong are from two groups of people infected in Wuhan in late Dec / early Jan, we get plausible scenarios that resolve above problems. These two scenarios are plotted below. Each has a different ‘progenitor,’ which is the sequence that gave rise to all currently known #SARSCoV2 sequences …”
“Both progenitors suggest #SARSCoV2 was circulating in Wuhan before December outbreak at Huanan Seafood Market, which is corroborated by lots of other evidence, including news articles from China in early 2020. … There are also broader implications. First, fact this dataset was deleted should make us skeptical that all other relevant early Wuhan sequences have been shared,” he concluded.
Bloom noted that sharing genetic sequences with the scientific community is further hampered by China’s requirement that all scientists receive an approval from the State Council before publishing their work.