Name Uploaded Size
3dbeacon.tar.index Tue, 18 Mar 2025 09:52:53 GMT 8.0 MB
3dbeacon.tar Tue, 18 Mar 2025 09:53:00 GMT 340.0 MB
bfvd.tar.gz Tue, 18 Mar 2025 09:57:17 GMT 8.5 GB
bfvd.version Tue, 18 Mar 2025 10:02:49 GMT 11 B
bfvd_foldcompdb.tar.gz Tue, 18 Mar 2025 09:53:20 GMT 982.9 MB
bfvd_foldseekdb.tar.gz Tue, 18 Mar 2025 09:53:08 GMT 533.8 MB
bfvd_indexed.tar.index Tue, 18 Mar 2025 09:52:53 GMT 9.1 MB
bfvd_indexed.tar Tue, 18 Mar 2025 09:58:09 GMT 8.7 GB
bfvd_metadata.tsv Tue, 18 Mar 2025 09:53:01 GMT 14.6 MB
bfvd_taxid.tsv Tue, 18 Mar 2025 09:53:13 GMT 24.5 MB
bfvd_taxid_rank_scientificname_lineage.tsv Tue, 18 Mar 2025 09:53:16 GMT 83.8 MB
cif.tar.index Tue, 18 Mar 2025 11:36:27 GMT 9.2 MB
cif.tar Tue, 18 Mar 2025 11:38:59 GMT 10.9 GB
msa.tar.index Tue, 18 Mar 2025 09:53:25 GMT 9.1 MB
msa.tar Tue, 18 Mar 2025 10:01:37 GMT 21.1 GB
pae.tar.index Wed, 26 Mar 2025 05:38:46 GMT 9.5 MB
pae.tar Wed, 26 Mar 2025 06:48:54 GMT 50.6 GB
uniref30_2302_virus-rep_mem.tsv Tue, 18 Mar 2025 09:53:39 GMT 60.8 MB
archived/2023_02_v1
archived/2023_02_v2

Readme

The Big Fantastic Virus Database (BFVD) is a repository of 351,242 protein structures predicted by applying ColabFold to the viral sequence representatives of the UniRef30 clusters. BFVD holds a unique repertoire of protein structures, spanning major viral clades.

Kim R, Levy Karin E, Steinegger M. BFVD - a large repository of predicted viral protein structures Nucleic Acids Research doi: doi.org/10.1093/nar/gkae1119 (2024)

ColabFold Marv

Updates

  • 2024-09-04: First distribution of BFVD.
  • 2024-11-01 (2023_02_v1): 175,454 Base-MSA & 175,788 Base+Logan-MSA
    Of the 351,242 BFVD entries initially predicted with a base multiple sequence alignment (base-MSA), 175,788 lacked detectable homologs. For these, we augmented the alignments using Logan’s large-scale assemblies, reinforcing nearly half of all BFVD entries.
  • 2025-03-17 (2023_02_v2): 205,681 Base-MSA & 37,296 Base+Logan-MSA & 108,265 Base+Logan-MSA + 12-recycles
    Of the 351,242 BFVD entries, 175,788 lacked identifiable homologs in their base MSAs. The remaining entries, which had sufficient homologs, were left unchanged. For those insufficient homologs, we augmented the alignments using Logan-based data and performed 12-cycle predictions. This process generated three versions for each affected entry: (1) Base-MSA, (2) Base+Logan-MSA, and (3) Base+Logan-MSA & 12-recycles. Finally, we kept the best-scoring model (based on pLDDT) for each entry.

Availability

  • BFVD is browsable with UniProt accessions through website
  • BFVD is searchable through Foldseek webserver
  • Scripts for BFVD analyses are available at Zenodo
  • PDB files of BFVD are also available at Zenodo

Data description

1-bfvd.tar.gz: 351,242 predicted structures of BFVD.
2-bfvd.version: version file.
3-bfvd_foldcompdb.tar.gz: Compressed version of Foldseek database using Foldcomp.
Only 347,481 structures, none of which are discontinuous, were included.
4-bfvd_foldseekdb.tar.gz: Foldseek databse of 351,242 predicted structures of BFVD.
5-bfvd_metadata.tsv: General information of each model.

  1. UniRef100: UniRef100 identifier of the sequence
  2. model: File name of the predicted protein structure
  3. avg_pLDDT: Average pLDDT score of the predicted protein structure
  4. pTM: pTM score of the predicted protein structure
  5. splitted: Whether the protein sequence of UniRef100 entry was splitted into multiple models
    We splitted the protein sequences if their length are above 1500. (0 = not splitted, 1 = splitted)
  6. version: Specifies the MSA/refinement pipeline used to produce the final BFVD structure. (BASE, BASE+LOGAN, or BASE+LOGAN+12CY)
6-msa.tar: MSAs for each BFVD entries
7-bfvd_taxid.tsv: BFVD entry and their taxonomic identifier.
  1. model: File name of the BFVD.
  2. taxId: Taxonomy identifier of the protein.
    The protein ID, the portion before the first underscore in model, was used to retrieve the taxonomy ID.
8-bfvd_taxID_rank_scientificname_lineage.tsv: BFVD entry and their taxonomic information.
  1. model: File name of the BFVD.
  2. taxId: Taxonomy identifier of the protein.
    The protein ID, the portion before the first underscore in model, was used to retrieve the taxonomy ID.
  3. rank: rank of the taxonomy.
  4. scientific name: scientific name of the corresponding taxonomy identifier.
  5. lineage: lineage of the taxonomy.
9-uniref30_2302_virus-rep_mem.tsv: UniRef30 virus clusters.
  1. repId: Cluster representatives used for structure prediction
  2. memId: Member corresponding to the representative

License

All files are available under a Creative Commons Attribution 4.0 International License.