16!
References
1. National Human Genome Research Institute. DNA Sequencing Costs: Data.
https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data.
2. National Human Genome Research Institute. Genomic Data Science Fact Sheet.
https://www.genome.gov/about-genomics/fact-sheets/Genomic-Data-Science (2022).
3. Hernaez, M., Pavlichin, D., Weissman, T. & Ochoa, I. Genomic Data Compression.
Annu. Rev. Biomed. Data Sci. 2, (2019).
4. Koboldt, D. C. Best practices for variant calling in clinical sequencing. Genome Med. 12,
91 (2020).
5. Sheng, Q. et al. Multi-perspective quality control of Illumina RNA sequencing data
analysis. Brief. Funct. Genomics 16, 194–204 (2017).
6. Gailly, J. & Adler, M. gzip. https://www.gzip.org/.
7. Huffman, D. A Method for the Construction of Minimum-Redundancy Codes. Proc. IRE
40, 1098–1101 (1952).
8. LEMPEL, A., MEMBER, ZIV, J. & FELLOW. On the Complexity o f Finite Sequences.
9. Kryukov, K., Jin, L. & Nakagawa, S. Efficient compression of SARS-CoV-2 genome data
using Nucleotide Archival Format. Patterns (N Y) 3, 100562 (2022).
10. Roguski, L., Ochoa, I., Hernaez, M. & Deorowicz, S. FaStore: a space-saving solution for
raw sequencing data. Bioinformatics 34, 2748–2756 (2018).
11. Bonfield, J. K. & Mahoney, M. V. Compression of FASTQ and SAM format sequencing
data. PLoS ONE 8, e59190 (2013).
12. Benoit, G. et al. Reference-free compression of high throughput sequencing data with a
probabilistic de Bruijn graph. BMC Bioinformatics 16, 288 (2015).
13. Hach, F., Numanagic, I., Alkan, C. & Sahinalp, S. C. SCALCE: boosting sequence
compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057
(2012).
14. Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M. & Weissman, T. SPRING: a next-
generation compressor for FASTQ data. Bioinformatics 35, 2674–2676 (2019).
15. Deorowicz, S. FQSqueezer: k-mer-based compression of sequencing data. Sci. Rep. 10,
578 (2020).
16. Petagene. Petagene. https://www.petagene.com/.
17. Lan, D., Tobler, R., Souilmi, Y. & Llamas, B. Genozip: a universal extensible genomic
data compressor. Bioinformatics 37, 2225–2230 (2021).
18. Kokot, M., Gudyś, A., Li, H. & Deorowicz, S. CoLoRd: compressing long reads. Nat.
Methods 19, 441–444 (2022).
19. Chen, S. et al. Efficient sequencing data compression and FPGA acceleration based on
a two-step framework. Front. Genet. 14, 1260531 (2023).
20. El Allali, A. & Arshad, M. MZPAQ: a FASTQ data compression tool. Source Code Biol.
Med. 14, 3 (2019).
21. Chandak, S., Tatwawadi, K. & Weissman, T. Compression of genomic sequencing reads
via hash-based reordering: algorithm and analysis. Bioinformatics 34, 558–567 (2018).
22. Rivest, R. The MD5 Message-Digest Algorithm. https://www.ietf.org/rfc/rfc1321.txt
(1992).
23. Cock, P. J. A., Fields, C. J., Goto, N., Heuer, M. L. & Rice, P. M. The Sanger FASTQ file
format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.
Nucleic Acids Res. 38, 1767–1771 (2010).
24. Grebnov, I. High performance data compression library. http://libbsc.com/.
25. Chandak, S. Spring v1.1.1. https://github.com/shubhamchandak94/Spring/tree/v1.1.1.
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint