1 GeneSqueeze: A Novel Lossless, Reference-Free Compression

GeneSqueeze: A Novel Lossless, Reference-Free Compression Algorithm for FASTQ/A

Files

Foad Nazari

, Sneh Patel

, Melissa LaRocca

, Ryan Czarny

, Giana Schena

, Emma K. Murray

Rajant Health Incorporated, Malvern, PA, United States. email: [email protected]

Correspondence:

Emma Murray, PhD

Senior Vice President of Operations

Rajant Health Incorporated

200 Chesterfield Parkway

Malvern, PA 19355

rajanthealth.com

Trovomics Division

trovomics.com

[email protected]

Keywords:

Genomic data compression, FASTQ, FASTA, Lossless compression, k-mer sequence, DNA,

RNA, Next-Generation Sequencing, Storage

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

Abstract

As sequencing becomes more accessible, there is an acute need for novel compression

methods to efficiently store this data. Omics technologies can enhance biomedical research and

individualize patient care, but they demand immense storage capabilities, especially when

applied to longitudinal studies. Addressing the storage challenges posed by these technologies

is crucial for omics technologies to achieve their full potential. We present a novel lossless,

reference-free compression algorithm, GeneSqueeze, that leverages the patterns inherent in

the underlying components of FASTQ files (i.e., nucleotide sequences, quality scores and read

identifiers). GeneSqueeze provides several benefits, including an auto-tuning compression

protocol based on each sample's distribution, lossless preservation of IUPAC nucleotides and

read identifiers, and unrestricted FASTQ/A file attributes (i.e., read length, read depth, or read

identifier format). We compared GeneSqueeze to the general-purpose compressor, gzip, and to

the domain-specific compressor, SPRING. GeneSqueeze achieved up to three times higher

compression ratios as compared to gzip, regardless of read length, read depth, or file size.

GeneSqueeze achieved 100% lossless compression, with the original and decompressed files

perfectly matching for all tested samples, preserving read identifiers, quality scores, and IUPAC

nucleotides, in contrast to SPRING. Overall, GeneSqueeze represents a competitive and

specialized compression method optimized for FASTQ/A files containing nucleotide sequences

that has the potential to significantly reduce the storage and transmission costs associated with

large omics datasets without sacrificing data integrity.

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

Introduction

The field of genomics, due to advancements in high-throughput sequencing technologies, has

experienced a revolution in the past two decades, with almost a million-fold reduction (from 1

billion to hundreds of dollars) in the cost of sequencing.

The increase in accessibility has led to

an increase in the quantity of omics data, with estimations predicting that 2–40 exabytes of data

will be generated within the next decade.

Unfortunately, storage technology is advancing at a

much slower pace, leading to critical technical and economic bottlenecks for omics-based

discovery and their applications in biomedical or clinical research. Omics data is essential in

personalized medicine which relies on accurate omics data to extract biomarkers or signatures

that can individualize a patient’s preventative measures, diagnoses, treatments, and

monitoring.

To be practically used in patient care, omics data needs to be stored, retrieved, and

transmitted in a manner that is cost-effective, time-efficient, and lossless for analysis and

dissemination. Therefore, specialized compressors for omics data are needed to continue to

support advancements in medicine.

Nucleotide-based omics data (i.e., genomics, transcriptomics, epigenomics) are commonly

stored in FASTQ or FASTA text-based formats. FASTA is a simple format that stores only the

read identifier lines followed by nucleotide or amino acid sequences. FASTQ files are more

complex, storing information on nucleotide sequences as well as the quality scores for each

base. The addition of quality scores makes FASTQ the preferred format for sequencing

platforms (e.g., Illumina) and downstream applications as they reflect the level of confidence in

the read of each base in a sequence.

Quality scores are critical for analyses such as variant

detection, where the accuracy and reliability of the base calls are foundational.

4,5

Both FASTA

and FASTQ files can be large, particularly for high-coverage sequencing data, which makes

them difficult to store and process efficiently. Therefore, FASTQ/A files are routinely

compressed for more practical storage, management, and transmission.

Presently, biologists are employing general-purpose compression tools. These tools are not

tailored specifically for biological data and thus do not fully leverage the inherent repetitive

characteristics of FASTQ files, leading to lower compression efficiency. Gzip

, a general-

purpose algorithm that combines Huffman encoding

and LZ77

to create a dictionary tailored to

the frequency of word repetitions in the data, performs relatively poorly on genomic data

compression. Despite the inefficiencies, gzip remains as the de facto standard in the biology

domain due to its stability and popularity. Gzipped files are accepted as input by various

sequencing analysis tools and are used by public repositories of genomic data.

Alternatively,

domain-specific algorithms that leverage the redundancy in FASTQ files, and explicitly utilize

the underlying structures present in the nucleotide reads, quality scores and read identifiers of

genomic data, have shown promise in genomic data compression, achieving high compression

ratios while maintaining high accuracy for downstream analysis.

These domain-specific FASTQ compressors utilize underlying compression methodologies such

as classical, read-mapping-based, read-assembly-based, or read-reordering-based.

Among

these, classical compressors are notably less efficient. Unlike the other three categories, which

necessitate auxiliary information such as a reference sequence or preprocessing steps like

sequencing reordering prior to compression, classical compression methods operate solely on

reads in their original form and lack the ability to leverage read redundancies.

Read-mapping-

based methods map nucleotide sequences to a provided reference genome, and encode the

alignment position and the probable mismatches.

10,11

Read-assembly-based compressors

encode nucleotide sequences by using built-in long contiguous sequences (contigs), as

opposed to a reference genome.

10,12

Alternatively, read-reordering-based compression methods

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

are reference-free and involve arranging overlapping nucleotide sequences near each other, to

encode reads more efficently.

10,13

Various domain-specific genomic data compressors such as FaStore

, SPRING

FQSqueezer

, PetaGene

, Genozip

, ColoRd

, repaq

, MZPAQ

have shown significant

promise in addressing data size challenges within the domain. In our study we compare our

algorithm to the domain specific compressor, SPRING, a compressor which combines read-

assembly-based and read-reordering-based methods. SPRING is based on the HARC algorithm

and supports pair-preserving compression, lossless compression (with the exception of certain

cases such as FASTQ/A files containing rare IUPAC nucleotides or specific read identifier

formats), and lossy compression of quality values, among other features.

14,21

Unfortunately,

each of the existing domain-specific compression methods, including SPRING, still have

disadvantages, including poor compression ratio, lossy compression, nucleotide sequence -

quality score-encoding dependency, limitations on long reads, poor performance for nucleotide

sequences that include N bases, lossiness for non-ACTGN IUPAC bases, and/or rigid

protocols.

With the varying weaknesses of existing compression methods, there is a need for efficient,

domain-specific methods for next-generation sequencing (NGS) data compression. We present

a novel reference-free compressor, GeneSqueeze, which uses read-reordering-based

compression methodology. Notably, the reordering mechanism within GeneSqueeze is

temporally confined to specific algorithmic blocks, facilitating targeted operations on the

reordered sequences. Upon completion of these operations, the residual sequences are

reconstituted to their original order, thereby preserving losslessness without necessitating the

retention of the original index. GeneSqueeze presents a dynamic protocol for maximally

compressing diverse FASTQ/A files containing any IUPAC nucleotides while maintaining

complete data integrity of all components, verified by MD5

matching.

Methods

FASTQ/A format

The representation of data obtained from a sequencing experiment follows the FASTQ format

introduced by Cock et al.

Each read and its corresponding metadata are encapsulated in a

block of four lines. The read identifier is present on the first line and provides information on the

sequencing run. The second line contains the nucleotide sequences (A, C, T, G, N) that are

called by the sequencer. In most FASTQ/A files, among non-ACGT (irregular) bases, only N is

common, and the other irregular nucleotides identified in IUPAC standards presented in Table

1. are rarely observed. The fourth line contains the quality score for each nucleotide in the read,

which indicate the probability that each base in the read is incorrect. The quality scores are

encoded in ASCII using Phred scores.

The third line contains the symbol '+' to separate the

nucleotide sequences and the quality values, and may also include metadata or comments. An

example of the first eight lines of a FASTQ file is shown in Fig. 1.

GeneSqueeze Algorithm

In this section, the proposed algorithm for compression of FASTQ/A data is described. The

algorithm separates the nucleotide, quality score, and read identifier sequences and then

passes each sequence type through a specific branch of the GeneSqueeze algorithm, as

presented in the flow diagram in Fig. 2. The way this algorithm both branches and blocks

function is explained in this section.

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

FASTQ/A Reader

The FASTQ/A Reader block loads and reads the FASTQ/A files to memory before splitting the

file into nucleotide sequences, read identifiers, quality score sequences (for FASTQ only).

Sequence Segmenter

The Sequence Segmenter splits long nucleotide sequences and quality score sequences into

smaller segments to improve overall compression efficiency. Our study found that there exists

an ideal sequence length ("#$%&'&$() to maximize GeneSequeeze compression efficiency. In

this version of GeneSqueeze, a fixed value for ideal length was assumed for all

samples/datasets. The number and length of created segments from each nucleotide and

quality score sequence is determined using equations 1 and 2, respectively. The floor function,

denoted as )x* and employed in equations 1 and 2, yields the greatest integer less than or equal

to x.

(+,'-$./)

-$0'&$(

"#$%&'&$(

(1)

-$.'&$(/)

-$0'&$(

(+,'-$.

(2)

Where -$0'&$(, -$.'&$(!and (+,'-$. are the length of input sequences, length of created

segments, and number of segments, respectively, and the "#$%&'&$( is a hyperparameter that is

used to find the -$.'&$(.

The next block of the algorithm, Duplication Removal, necessitates equal segment lengths are

present for processing of nucleotide and quality score sequences, and thus, if 1-$0'&$(!2

!(+,'-$.3-$.'&$(45 the remainder ($678%'&$( bases4!of reads are separated and concatenated

together to make a long sequence to be processed separately.

$678%'&$(/-$0'&$(9(+,'-$.!3!-$.'&$(

(3)

Following is a numerical example:

!!:

-$0'&$(/;<=

!!"#$%&'&$(!/>?!

!!@!A

!(+,'-$./B

!-$.'&$(/>B

!!$678%'&$(/;

If the input to the Sequence Segmenter is a nucleotide sequence, the remainder sequence is

directed to the Nucleotide Sequence Regulator block; if the input is quality score sequence, the

remainder goes to the General-Purpose Compressor block.

Nucleotide Compression Branch

The segmented nucleotide sequences are passed to the nucleotide sequence compression

branch of the algorithm, which consists of the following blocks: Duplication Removal (DR),

Sequence Reversion, Random Sequence Selector, k-mer Dictionary Optimizer, k-mer Removal,

Semi-Duplication Removal (SDR), Nucleotide Sequence Regulator, and Nucleotide Sequence

Binary Encoding.

Duplication Removal (Nucleotide Branch)

The main objective of the Duplication Removal block is to find, group, encode and remove all

duplicate reads in an efficient manner (example in Fig 3). The block first creates a temporary

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

index, called original index, for each sequence to preserve a reference to the original sequence

order, during the sorting of the sequences alphabetically. Identical sequences are then grouped.

The first instance of the nucleotide sequence is labeled as the parent sequence and all

subsequent instances are designated as child sequences. An identifier label (dr_identifier)

stores information on parent and child sequences. If the sequence is denoted as a parent

sequence, the algorithm then identifies if the parent sequence has any child sequences, before

labeling them accordingly. If the sequence is a child sequence the identifier stores the index of

its parent sequence so the child’s sequence can be retrieved during decompression. The child

sequences are then removed and the parent sequences are re-sorted using their original index,

and the temporary original index value is removed.

The pseudocode presented below shows the duplication removal process:

FUNCTION duplication_removal(sequences)

1. Convert sequences to an array

2. Sort the sequences and keep track of their original indices

3. Group duplicate sequences and maintain original index values

4. Identify ‘parent’ sequence as first sequence of each group

5. Identify ‘parent’ index as index of each group

6. Prepare two data structures to store information about sequences and groups

One for the parent sequences

One for identifier label (dr_identifier)

7. Parent sequences are placed into a DataFrame and sorted based on original index

8. Each original index is assigned an identifier in dr_identifier

Index is associated with a parent sequence with at least one child THEN equals ‘M’

ELSE IF

Index is associated with a parent sequence with no child THEN equals ‘S’

ELSE IF

Index is associated with a child sequence THEN equals its parent index

9. Return the sorted sequences and the dr_identifier

The parent sequences are passed to the Sequence Reversion block and the dr_identifier is

encoded as a string and directed to the General-Purpose Compressor.

Semi-Duplication Removal

The Semi-Duplication Removal block finds similar, but not identical, nucleotide sequences, and

groups the sequences, before encoding and removing the similar part of the sequences within

each group in an efficient manner to reduce the compression ratio (example of workflow in Fig.

4). In this context, two sequences are considered semi-identical if their initial non-identical

character fall within a certain number of characters from their respective endings, defined as the

sdr_threshold. While the Semi-Duplication Removal block focuses on similarities at the

beginning of the sequence, there can also be similarities at the end of the sequence. To ensure

removal of similar sequences at the end of a sequence, the GeneSqueeze algorithm sends

sequences to the Sequence Reversion block prior to inputting the sequences into another

instance of the SDR block.

In SDR operation, the sequences are first ordered alphabetically with their original indices being

tracked. Once sorted, semi-duplicate nucleotide sequences are referred to as groups. For each

group of nucleotide sequences, only the first instance of the sequence is designated as the

parent sequence. The remaining semi-duplicate sequences within the group are designated as

the child sequences. Each child sequence is assigned a child index that indicates which parent

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

each child sequence belongs to. GeneSqueeze examines the non-identical nucleotides between

two sequences. We specifically encode the subsequence beginning from the first non-identical

character and extending to the last character of the sequence which we define as “noise”. The

algorithm identifies noise between the first child and its parent and between any subsequent

child with its preceding child. The length of the “noise sequence” is also encoded as “noise

length”. Noise length along with child identifier indices are passed to the General-Purpose

Compressor (noise length, child index). Meanwhile, the noise sequences are directed to

Nucleotide Sequence Regulator block. Finally, the parent sequences are resorted using their

original indices and directed to Sequence Reversion or Nucleotide Sequence Regulator, if it is

the first or the second SDR operation, respectively.

The pseudocode presented below shows the semi-duplication removal process:

FUNCTION semi_duplication_removal(sequences, sdr_threshold)

1. Convert sequences to an array

2. Sort the sequences and keep track of their original indices

3. Calculate the differences between adjacent sorted sequences

4. Prepare data structure to store information about sequences, groups, noise

sequence, and noise lengths

5. Create groups of sequences based on the cut points

6. Identify the first sequence in each group as parent sequence and rest of the

sequences as child sequences

7. Calculate the noises between each sequence and its previous sequence for each

group

8. Calculate the length of obtained noises

9. Build a DataFrame to organize the information for the child indices

10. Sort sequences by original indices

11. Extract the indices of the child sequences as child indices

12. Join noise sequences as a single string

13. Return the child indices, noise lengths, noise sequences, and the parent sequences

Sequence Reversion

As mentioned in the Semi-Duplication Removal block, this algorithm is crafted to identify and

eliminate sequences that exhibit semi-duplication either at their beginning or end. As the SDR

block only identifies duplicates at the beginning of a sequence, we have to reverse the

sequence to remove duplicates found at the end of the sequence. These reversed sequences

are then sent through the SDR block, depicted in Fig. 2. After passing through SDR, the output

parent sequences are reverted to their original orientation using the Sequence Reversion block.

k-mer Dictionary Optimizer

k-mers denote subsequences of length ‘k’ extracted from longer nucleotide sequences. k-mers

serve as ideal units for partitioning nucleotides to enhance algorithmic efficiency. The frequently

occurring k-mers can be identified, assigned short indices, and subsequently eliminated to

conserve space. Each selected k-mer and its associated index is stored as a key-value pair in a

dictionary. There exists a trade-off between the length and frequency of k-mers. The elimination

of longer k-mers can result in larger deletions overall, but only if these k-mers are sufficiently

high in frequency. Removing shorter k-mers results in removing less information per sequence;

however, given their higher likelihood of occurrence, removal of high frequency short k-mers

can lead to a greater number of nucleotides being eliminated.

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

This dichotomy indicates that it is imperative to ascertain the optimal k-mers to maximize the

compression ratio. As there are many k-mers within a dataset we can store the optimal k-mers

in a k-mer dictionary which can then be used for compression. The k-mer dictionary can contain

many k-mers of numerous k lengths; given the myriad of combinations, the creation and storage

of this dictionary can be an expensive process. To enhance cost effectiveness of the algorithm,

GeneSqueeze creates an initial k-mer dictionary using a small random subset of nucleotide

sequences, assuming similar distribution of k-mers across the whole dataset. The created k-mer

dictionary can then be used to reduce redundancy across all remaining nucleotide sequences.

In theory, the storage space required for both the k-mer dictionary and the additional encoding

information after k-mer removal can exceed the space saved by removing k-mers. A potentially

confounding factor is that the removal of each k-mer from the remaining sequences may affect

the frequency of other k-mers being found within the sequences. Thus, we need an algorithm to

iteratively find the ‘best’ remaining k-mer and assess the worth of adding that k-mer to the

dictionary, considering both space savings and encoding costs. Therefore, a dictionary with both

an optimal count of k-mer – index pairs and of k-mers of optimal sizes is desirable and thus

these two optimization problems are inherently coupled together and require simultaneous

computation. The optimal k-mer dictionary is specific to each sample’s sequences, and can vary

significantly between samples. Thus, in this block, an optimization strategy is presented to solve

the optimization problems and find an optimal k-mer dictionary for each sample so that in k-mer

Removal Block we encode the k-mers for each sample using its ad-hoc k-mer dictionary (Fig.

5).

Given the fact that nucleotide sequences will align to specific locations in a reference genome,

they are not inherently randomly distributed. This indicates there is a high level of imbalance for

the substrings prior to or next to each k-mer, which is exploited to achieve higher levels of

compression in our algorithm. GeneSqueeze finds high frequency k-mers and filters their prefix

and/or suffix based on given thresholds. During encoding, it looks for an exact match for the

selected optimal k-mers and then checks if the substrings before and after them match with the

identified prefix and suffix, respectively. If not, the difference between them is encoded as noise.

Among the hyperparameters of this algorithm, the k values for each k-mer in the k-mer

dictionary play a significant role. These k values are not limited to a single number but can

consist of multiple values. A pre-selected list of k values, referred to as k_list, is considered for

the k-mer selection process. As previously noted, there exists a trade-off between the length

and frequency of k-mers, making it crucial to ascertain the optimal k-mers. In deciding which k-

mer is more qualified to be first added to the k-mer dictionary, the algorithm takes into account

the storage saved by removing the k-mer from sequences. A simplistic example which includes

a small k_list and is without affix: if k_list=[k1, k2] where k2 > k1, and if frequency of a selected

k2-mer is higher than a selected k1-mer, the selection of k2 will lead to larger improvements in

compression ratio. Conversely, if the frequency of the selected k2-mer is lower than that of the

selected k1-mer, a trade-off between their length and frequency is taken into account. The

algorithm outlined in this section streamlines the process of selecting k-mers using these

calculations.

As can be seen in the diagram of Fig. 5, for every k value within k_list, internal competition

arises among the k-mers at each iteration. Subsequently, winners of these internal

competitions, one for each k in k_list, engage in an external competition to determine the overall

external winner. The victor of the external competition joins the optimal k-mer dictionary, while

the external losers revert to their respective internal competitions. The internal and external

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

competitions are done based on internal competition score (CDE4 and external competition score

(FDE4, and are presented in equations 4 and 5, respectively.

CDE/G

!"#$

1=H&$(1%II"64JK4

(4)

FDE/;G

!"#$

KH&$(

%II"6

%#&'()*

3N O PQC

+,--./

(#&

+,--./

%23

H O PQC

%$#-./

(#&

%$#-./

423

(5)

In this study, G represent the frequency of k-mer. Furthermore, M

%#&'()*

is the penalty factor for

the fraction of nucleotides that does not match the prefix-suffix. This penalty factor helps to

include the cost of prefix-suffix mismatch in calculation of FDE and the cost function presented in

equation 9. It is imperative during encoding, as elucidated in the k-mer removal block, to store

not only the correct type of nucleotides but also the position and quantity of noises. As is shown

in equation 6, the existence index for position

character)

of affix, FC

5 is defined as the total

frequency of all bases in position

, U

5 over the frequency of the k-mer itself, G. !

!JG!

(6)!

Also, as indicated in equation 7, the dominance index!

!!for position

of affix is the frequency of

the dominant base position

, X

5 over the total frequency of all bases in position

of the k-mer

affix, U

!J!U

(7)!

The non-dominance index!

YVW

!!for position

of affix is presented in equation 8.!

PQC

/=9QC

(8)!

An example for the prefix-suffix identification process, including calculation of the dominance

and existence indices, is presented in Fig. 6.

ZW'[\]^_\`ab!

and

VW'[\]^_\`ab!

are the considered

thresholds for

and

, respectively. As depicted in Fig. 6., the frequency of the dominant

nucleotide at each specific position before and after the selected k-mer is shown in blue. Among

these, the green nucleotides surpass EI and DI thresholds (

ZW'[\]^_\`ab!

and

VW'[\]^_\`ab

) and

are selected as the k-mer prefix and suffix.

This process persists until one of the following stopping criteria is fulfilled:

1- The number of consecutive cycles with an increase in cost function score exceeds the

“tolerance_threshold."

The algorithm allows for temporary increases in the cost function score over a set

number of cycles, denoted as “tolerance_threshold”. If the number of cycles proceeds

beyond the set threshold then the process is stopped. The algorithm then removes the

last tolerance_threshold elements (i.e. the last k-mer that was added to the k-mer

dictionary) added to the dictionary, and passes the finalized dictionary to the subsequent

stage. The tolerance_threshold value remains constant throughout the process.

2- The length of the k-mer dictionary reaches the max_dict_len limit.

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

10!

In this study, the “data size” is denoted by c. By using the following formulas (equations 9-19),

which have been derived based on the GeneSqueeze algorithm, without compressing the data,

we can have a rough estimation of c

67"%$#++#89+#4+

in bits as a function of #"d7'&$( for the

existing distribution of k-mers in the FASTQ/A file. #"d7'&$( denotes the number of k-mers

added to the dictionary. To calculate c

67"%$#++#89+#4+

in equation 9, c

7$.:.&'(9 ;# 4+

is calculated

based on equation 10, c

!"#$ 9'((

based on equations 11 and 13-17, and c

'--./9'((

based on

equations 12, 18 and 19. In equation 9, it is expected that c

!"#$ 9'((

'--./9'((

will be negative.

As such, it follows that k-mer removal is a better option than simply computing binary encoding

on all nucleotide sequences. It is worth noting that in the cost function, which is designed to

provide an approximate estimation of actual compressed data, all bases are assumed to be

regular, forming the basis of equation 10. In this equation -$0'#$e7f and -$0'&$(.7f are the

number and length of the input nucleotide sequences of this block, respectively.

67"%$#++#89+#4+

7$.:.&'(9 ;# 4+

!"#$ 9'((

'--./9'((

(9)

7$.:.&'(9 +# 4+

/;3-$0'#$e7f3-$0'&$(.7f

(10)

!"#$ 9'((

/9c

$#"7<#89!"#$9+#4+

!"#$ 9%7+.).7&

!"#$ 9.& 8 # /

!"#$ 9#/.+)#&6#

!"#$ 98.6).7&'$*

(11)

'--./9'((

/19c

$#"7<#89'--./ 9;# 4+

'--./9&7.+#9%7+.).7&

'--./9&,"=#$9&7.+#

'--./9&7.+#9)*%#

!4Hc

'--./98.6).7&'$*

'--./9+#4+

'--./98.6).7&'$*

(12)

$#"7<#89!"#$9+#4+

/ O G

!"#$

8.6)9(#&>3

.2?

3g;3K

!"#$

(13)

!"#$ 9%7.).7&

/ O G

!"#$

8.6)9(#&>3

.2?

3i1-$0'&$(.7f9K

!"#$

4!j

(14)

!"#$ 9.& 8 # /

/ O NG

!"#$

3&$(1kC

!"#$

8.6)

!"#

.2?

(15)

!"#$ 9#/.+)#&6#

/-$0'#$e7f

(16)

!"#$ 98.6).7&'$*

/ O Ng;3K

!"#$

hHi&l.

A(#&

!$%&

!jH&$(1kC

!"#$

8.6)

!"#

.2?

(17)

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

11!

'--./9+#4+

/ O ;G

!"#$

8.6)

!"#

.2?

&$(g%II"6

!"#$

h9M

%#&'()*

3N O PQC

(#&A+,- -./

'(")

%23

H O PQC

(#&A%$#-./

'(")

423

(18)

'--./98.6).7&'$*

;3;3&$(

%II"6

!"#$

8.6)

!"#

.2?

i&l.

+#4 9(# & : )5>!

'(")*$+

i&l.

+#4 9(# & : )5>!

'(")

>(#&0+,--./

'(")

(19)

where, HI stands for binary code length of Huffman encoded k-mer in k-mer dictionary.

F6"-7$(d$ is a binary vector of length -$0'#$e7f to show which nucleotide sequence include any

k-mer of the final dictionary.

The dictionary optimizer obtains the optimum #"d7'&$( value to have the c

67"%$#++#89+#4+

minimized.

Random Sequence Selector

As we explained, k-mer dictionary creation is an expensive process. To reduce computational

load we utilized the Random Sequence Selector block, which randomly selects a subset of

nucleotide sequences to be used in k-mer Dictionary Optimizer block. The percentage of

sequences selected is defined by random_subset %.

k-mer Removal

The k-mer removal block removes duplicative k-mers in the dataset. The function first checks

each nucleotide sequence of the input sequences to determine if there is a full match for the k-

mers in the finalized dictionary. If there is a match, it removes the k-mer and its affix and

encodes its existence (one bit per nucleotide sequence), k-mer index, k-mer position in

nucleotide sequence and its index (dictionary value), as well as the affix noises. After removing

the k-mer and its affix, the subsequences before and after the removed part of the sequence are

concatenated and sent to the Nucleotide Sequence Regulator block. The untouched nucleotide

sequences (nucleotide sequences with no identified k-mers) are passed to the Semi-Duplication

Removal block. The process is shown in Fig. 7.

Nucleotide Sequence Regulator

This block encodes any bases presented in Table 1 other than A, C, G and T. As is shown in

Fig. 2, the input for this block is the output of all other blocks that are in the nucleotide sequence

form. Given that the predominant characters in nucleotide sequences are usually A, C, G, and

T, and considering that four characters can be encoded using only two bits in binary, this block

encodes the information of non-ACGT characters separately and then replaces them with A in

the nucleotide sequence. Therefore, the output of this block is the nucleotide sequences that

only include A, C, G and T, as well as the position of non-ACGT characters. They are directed

to Nucleotide Sequence Binary Encoding and General-Purpose Compressor blocks,

respectively.

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

12!

This block assumes by default that all irregular (non-ACGT) nucleotides characters are N (which

is the most common irregular nucleotide) and encodes their nucleotide sequence position. In

cases where any irregular non-N nucleotide is observed, as well as the position, its actual

character is also encoded.

Nucleotide Sequence Binary Encoding

The nucleotide sequences are then encoded in binary using the hash table presented in Table

2. As mentioned in the Nucleotide Sequence Regulator block, irregular nucleotides are encoded

as A: [0 0] since their information is recorded separately.

Quality Score Compression Branch

The segmented quality scores are passed to the quality score compression branch of the

algorithm, which has one block dedicated to duplication removal.

Duplication Removal (Quality Score)

After quality scores are segmented in Sequence Segmenter block, they proceed through the

Duplication Removal block. This block searches across the quality score sequences to find and

group identical sequences and encodes only the first instance of a specific quality score

sequence of each group, designated as the parent sequence. The other sequences of each

group are designated as child sequences and are referred back to the parent in the same

fashion as nucleotide sequences. The encoded files are sent to the General-Purpose

Compressor.

Read Identifier Compression Branch

The segmented read identifiers are passed through the read identifier compression branch of

the algorithm which has one block in which the read identifiers are encoded.

Read Identifier Encoder

The Read Identifier Encoder block stores the first read identifier. The block then identifies which

portions of the identifier are fixed or variable and subsequently encodes only the change in

variables for each ensuing read.

General-Purpose Compressor

All the generated files are then sent to a general-purpose compressor to achieve a higher

compression ratio. GeneSqueeze uses BSC, which is a general-purpose compressor built

based on the Burrows-Wheeler Transform (BWT).

Results and Discussion

We conducted a case study where we compared GeneSqueeze against two other algorithms for

the compression of nucleotide-containing FASTQ files: (1) the prevalent general-purpose

compressor, gzip (version 1.9) and (2) the domain-specific compressor, SPRING (version

1.1.1).

Dataset

We tested the performance of each compression algorithm on a collection of datasets that

totaled 1572 GB in uncompressed size. It contained 283 FASTQ files, ranging in length (Table

3), depth (Table 4), and size (Table 5). The full list of the datasets used for experiments is

found in Table S1. In these experiments, each FASTQ file is compressed independently. The

characteristics of each dataset can directly affect the performance of a compression algorithm,

thus algorithm performance was analyzed by characteristics. Natively, these datasets contained

N but no non-N irregular nucleotides. So, to test the losslessness of non-N irregular IUPAC

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

13!

nucleotides, two N characters were replaced with non-ACGTN IUPAC characters in a FASTQ

file. Additionally, to alleviate memory consumption, four 50+ GB files were split into four chunks

each during GeneSqueeze encoding and recombined after decoding.

Hyperparameters

The values of the hyperparameters for GeneSqueeze are presented in Table 6. GeneSqueeze’s

k-mer Dictionary Optimizer block selected the optimal length of the k-mer dictionary and the k-

mers via the optimization process presented in the Methods section (k-mer Dictionary

Optimization block). In the current version of GeneSqueeze, we employed a static value for the

rest of the hyperparameters, which is not optimal. Our future directions include upgrading the

algorithm to dynamically determine and adapt the optimal values of the hyperparameters for

each dataset or sample to further improve our compression ratio.

Evaluation metrics

We measured the performance of GeneSqueeze against gzip and SPRING using the evaluation

metrics presented in equations 20 and 21:

67"%$#++#8

7$.:.&'(

.2?

r!!

(20)

(7++(#++

DEF9"')65#8

(21)

where

!and s

(7++(#++

are average compression ratio (%) and losslessness ratio (%),

respectively. Here, , represents the total number of FASTQ files in the test set, while

tQ<',%7df$# denotes the number of samples with matched MD5 sums. The metric

compression ratio was chosen to assess the total reduction in file size after compression, and

losslessness ratio was chosen to understand if total data integrity was preserved after

compression. These represent important factors in the overall efficiency and integrity of each

compression algorithm.

Compression results

All three algorithms were able to significantly compress the datasets. The average compression

ratio of SPRING and GeneSqueeze were similar (7.61% vs 7.70%), and both were significantly

lower than gzip (21.20%) (Table 7). FASTQ files can vary in size, read depth and read length,

characteristics which we hypothesized could lead to changes in algorithmic efficiency. To

understand which FASTQ files GeneSqueeze excelled at compressing, we examined

compression ratio amongst FASTQ files with differing dataset sizes (Table 8), read depths

(Table 9), and read lengths (Table 10). We found that both GeneSqueeze and SPRING

outperformed gzip in all categories. We found that GeneSqueeze compressed files of size 8-

12GB most efficiently and was least efficient at compressing large files (54-57 GB). Notably,

GeneSqueeze compressed smaller files (0-4GB) better than SPRING (Table 8). When

examining GeneSqueeze performance for differing read depths we found that GeneSqueeze

was best at compressing files with read depths of 35-65 million reads and performed worst

when compressing files with large read depths of 215-224 million reads. GeneSqueeze

outperformed SPRING on read depths of 1-20 and 20-35 million reads (Table 9). Next, we

examined compression ratio based on read length. We found that GeneSqueeze had the

greatest compression on files with read lengths of 50-51 nucleotides and least compression on

reads of 100-101 nucleotides. GeneSqueeze outperformed SPRING on files with read lengths

of 35-36, 50-51, and 202 nucleotides (Table 10).

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

14!

Quality scores are typically more challenging to compress than nucleotide sequences.

Thus,

quality score compression represents an area to be improved in the field. GeneSqueeze

uniquely pre-compresses quality scores with a domain-specific compression process and then

feeds that to the general-purpose compressor. Our method allows us to further reduce the file

size dedicated to quality scores without any loss or impact on downstream analysis. The pie

charts in Fig. 8 indicate how much - on average - each component of a FASTQ file contributes

to the total size, before and after GeneSqueeze compression. The difficulty in compressing

quality scores due to its aforementioned characteristics is indicated by 60% of GeneSqueezed

FASTQ file sizes, on average, being allotted to quality scores (Fig. 8). We observed that read

identifiers were the easiest to compress due to their fixed template.

We have also performed algorithm development and implementation for decompression of

GeneSqueezed FASTQ files. To examine data accuracy and integrity of each compression

method, we generated MD5 checksums for the original and decoded FASTQ files for all 283

files in the datasets. We observed 100% losslessness for GeneSqueeze and gzip, with 283/283

files having an MD5 match. However, SPRING experienced some loss during decoding.

SPRING had an MD5 match for 10/283 files for 3.5% losslessness. As has been mentioned in

the SPRING Supplementary Data, SPRING discards any metadata or comments that follow the

"+" in the second read identifier line (the line after nucleotide sequence and before quality

score).

As our datasets did not include non-N irregularities, we changed two nucleotides in a FASTQ

file to non-N irregularities in a separate experiment. We observed that GeneSqueeze and gzip

losslessly encoded and decoded the file and that the MD5 hashes for the original and decoded

files matched. However, SPRING was not able to decode non-ACGTN letters that exist in the

IUPAC nucleotide code, meaning it was not IUPAC lossless. Overall, GeneSqueeze

outperformed competitors by compressing file sizes with a compression ratio on par with that of

current state-of-the-art compressors, all while maintaining losslessness.

Conclusion

Nucleotide sequencing data contains intricate patterns of redundancy and variation that require

specialized data compression techniques. The unique features of nucleotide sequences, such

as hierarchical structure and redundancy, can be exploited to achieve high compression ratios

while minimizing the loss of information. Developing effective compression algorithms for

sequencing data requires a deep understanding of the data and the biological processes that

generate it. The GeneSqueeze algorithm is capable of compressing FASTQ and FASTA data

containing nucleotide sequences and is designed to be lossless for all parts of the FASTQ

format, including the read identifier, quality score, and nucleotide sequence. GeneSqueeze

relies on reducing the dimension and redundancy in genomic data in a unique and optimal way,

and prior to storage in binary format.

The results of our comparison demonstrate the effectiveness and efficiency of our novel data

compressor, with high and accurate data compression in our experiments. GeneSqueeze

implements a number of functionalities that not only enhance compression ratios, but allow it to

handle sequencing data losslessly.

This algorithm’s unique key features are:

1. Presenting a dynamic ad-hoc protocol for optimally encoding high frequency k-mers

2. Exploiting the redundancy in quality scores before passing them to the general

compressor

3. Encoding and losslessly decoding all IUPAC characters

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

15!

4. Expressing no limitation in read length / depth, allowing for flexibility in long-read / high-

throughput FASTQ/As compression.

5. Compatible with all FASTQ/As identifier formats.

Overall, we observed that GeneSqueeze performed competitively against SPRING, with both

performing better than gzip for compression ratios. GeneSqueeze and gzip maintained

losslessness, whereas SPRING experienced loss in the majority of samples, based on MD5

matches. Due to GeneSqueeze’s novel approach, competitive performance, and complete

preservation of genomic data, we believe GeneSqueeze is able to support the growing

application of omics technologies in biomedical or clinical research. Our forthcoming emphasis

will be on implementing GeneSqueeze in a more efficient programming language and in a

manner that optimizes memory usage. GeneSqueeze's current Python implementation reveals

certain drawbacks in speed and memory usage when juxtaposed with SPRING and gzip, which

are predominantly implemented in C/C++. Additionally, it is imperative to upgrade the algorithm

to dynamically optimize the values of hyperparameters across multiple blocks, ensuring optimal

adaptation for each dataset or sample.

Acknowledgments

The authors would like to express their gratitude to individuals from Rajant Health Incorporated

for valuable discussions and technical contributions, and to Rajant Corporation for providing

funding support for this project.

The SPRING software used by the authors, was developed at the University of Illinois at

Urbana-Champaign and Stanford University.

Author contributions

Algorithm development: F.N.

Implementation: F.N., S.P.

Figure Generation: F.N., M.L.

Pseudocode Generation: F.N., R.C.

Conceptualization: F.N., S.P., E.K.M., G.S.

Manuscript Writing: F.N., M.L., R.C., E.K.M., G.S.

Review: F.N., S.P., E.K.M., G.S., M.L. R.C.

All authors revised and approved the manuscript.

Competing interests

Authors F.N., S.P., M.L., R.C., G.S., and E.K.M. were employed by Rajant Health Incorporated.

F.N., S.P., G.S., and E.K.M. have a patent pending for the GeneSqueeze compressor:

“Fastq/fasta compression systems and methods,” 2022-11-18: Application filed by Rajant Health

Incorporated. 2023-05-25: Publication of WO2023092086A2. F.N., S.P., M. L., G.S., and

E.K.M. have a provisional patent submitted for the GeneSqueeze compressor: “GeneSqueeze:

A Novel Method for Compression of FASTQ/A Files,” Application filed by Rajant Health

Incorporated. 2024-02-16: Application No. 63/554,788.

Data Availability

Datasets are sourced from the public databases listed in the Supplementary data.

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

16!

References

1. National Human Genome Research Institute. DNA Sequencing Costs: Data.

https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data.

2. National Human Genome Research Institute. Genomic Data Science Fact Sheet.

https://www.genome.gov/about-genomics/fact-sheets/Genomic-Data-Science (2022).

3. Hernaez, M., Pavlichin, D., Weissman, T. & Ochoa, I. Genomic Data Compression.

Annu. Rev. Biomed. Data Sci. 2, (2019).

4. Koboldt, D. C. Best practices for variant calling in clinical sequencing. Genome Med. 12,

91 (2020).

5. Sheng, Q. et al. Multi-perspective quality control of Illumina RNA sequencing data

analysis. Brief. Funct. Genomics 16, 194–204 (2017).

6. Gailly, J. & Adler, M. gzip. https://www.gzip.org/.

7. Huffman, D. A Method for the Construction of Minimum-Redundancy Codes. Proc. IRE

40, 1098–1101 (1952).

8. LEMPEL, A., MEMBER, ZIV, J. & FELLOW. On the Complexity o f Finite Sequences.

9. Kryukov, K., Jin, L. & Nakagawa, S. Efficient compression of SARS-CoV-2 genome data

using Nucleotide Archival Format. Patterns (N Y) 3, 100562 (2022).

10. Roguski, L., Ochoa, I., Hernaez, M. & Deorowicz, S. FaStore: a space-saving solution for

raw sequencing data. Bioinformatics 34, 2748–2756 (2018).

11. Bonfield, J. K. & Mahoney, M. V. Compression of FASTQ and SAM format sequencing

data. PLoS ONE 8, e59190 (2013).

12. Benoit, G. et al. Reference-free compression of high throughput sequencing data with a

probabilistic de Bruijn graph. BMC Bioinformatics 16, 288 (2015).

13. Hach, F., Numanagic, I., Alkan, C. & Sahinalp, S. C. SCALCE: boosting sequence

compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057

(2012).

14. Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M. & Weissman, T. SPRING: a next-

generation compressor for FASTQ data. Bioinformatics 35, 2674–2676 (2019).

15. Deorowicz, S. FQSqueezer: k-mer-based compression of sequencing data. Sci. Rep. 10,

578 (2020).

16. Petagene. Petagene. https://www.petagene.com/.

17. Lan, D., Tobler, R., Souilmi, Y. & Llamas, B. Genozip: a universal extensible genomic

data compressor. Bioinformatics 37, 2225–2230 (2021).

18. Kokot, M., Gudyś, A., Li, H. & Deorowicz, S. CoLoRd: compressing long reads. Nat.

Methods 19, 441–444 (2022).

19. Chen, S. et al. Efficient sequencing data compression and FPGA acceleration based on

a two-step framework. Front. Genet. 14, 1260531 (2023).

20. El Allali, A. & Arshad, M. MZPAQ: a FASTQ data compression tool. Source Code Biol.

Med. 14, 3 (2019).

21. Chandak, S., Tatwawadi, K. & Weissman, T. Compression of genomic sequencing reads

via hash-based reordering: algorithm and analysis. Bioinformatics 34, 558–567 (2018).

22. Rivest, R. The MD5 Message-Digest Algorithm. https://www.ietf.org/rfc/rfc1321.txt

(1992).

23. Cock, P. J. A., Fields, C. J., Goto, N., Heuer, M. L. & Rice, P. M. The Sanger FASTQ file

format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.

Nucleic Acids Res. 38, 1767–1771 (2010).

24. Grebnov, I. High performance data compression library. http://libbsc.com/.

25. Chandak, S. Spring v1.1.1. https://github.com/shubhamchandak94/Spring/tree/v1.1.1.

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

17!

Figure Legends

Figure 1. An example FASTQ file

Figure 2. GeneSqueeze Compression Flow-Diagram

Figure 3. GeneSqueeze Duplication removal process

Figure 4. GeneSqueeze Semi-duplication removal process

Figure 5. GeneSqueeze k-mer dictionary optimizer process

Figure 6. GeneSqueeze Prefix-suffix identification process for EI_threshold = 0.6 and

DI_threshold = 0.5

Figure 7. GeneSqueeze k-mer removal process

Figure 8. Average contribution of each of three parts of FASTQ files in a) uncompressed and b)

GeneSqueezed files

Tables

Table 1. IUPAC (International Union of Pure and Applied Chemistry) single letter codes for

DNA/RNA.

Symbols

Name

Remarks

Adenine

Purine

Guanine

Purine

Cytosine

Pyrimidine

Thymine

Pyrimidine

Uracil

Pyrimidine

Purine

A or G

Pyrimidine

C or T/U

A or C

G or T

Strong

C or G

Weak

A or T

Not G

A or C or T

Not A

C or G or T

Not U/T

A or C or G

Not C

A or G or T

Ambiguous

A or C or G or T

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

18!

Table 2. Nucleotides binary encoding hash table

Nucleotide

Binary code

Table 3. FASTQ file size (GB) of the datasets used for experiments.

FASTQ File Size (GB)

Number of Files

0-4

156

4-8

8-12

54-57

Table 4. Read depth (million reads per sample) of the datasets used for experiments.

Read Depth (million reads per sample)

Number of Files

1-20

172

20-35

35-65

215-224

Table 5. Read length (nucleotide sequence) of the datasets used for experiments.

Read Length (nucleotide sequence)

Number of Files

35-36

106

50-51

107

100-101

202

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

19!

Table 6. Hyperparameter values used by the GeneSqueeze algorithm.

Hyperparameter

Value

k_list

[10, 25]

random_subset

EI_threshold

0.1

DI_threshold

0.5

tolerance_threshold

max_dict_len

top_check

%#&'()*

"#$%&'&$(

sdr_threshold (1st SDR block)

sdr_threshold (2nd SDR block)

Table 7. Average compression ratio (%) of samples uncompressed and compressed with gzip,

SPRING, or GeneSqueeze compression algorithms.

Uncompressed

Compressed

gzip

SPRING

GeneSqueeze

100%

21.20%

7.61%

7.70%

Table 8. Average compression ratio (%) of samples by FASTQ file size (GB).

Compressed

FASTQ File Size (GB)

gzip

SPRING

GeneSqueeze

0-4

20.50%

8.01%

7.90%

4-8

23.17%

8.10%

8.35%

8-12

19.42%

4.96%

5.12%

54-57

32.63%

14.56%

18.76%

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

20!

Table 9. Average compression ratio (%) of samples by read depth (million reads per sample).

Compressed

Read Depth

gzip

SPRING

GeneSqueeze

1-20

22.11%

9.01%

8.99%

20-35

20.33%

6.43%

6.25%

35-65

18.69%

4.31%

4.57%

215-224

32.63%

14.56%

18.76%

Table 10. Average compression ratio (%) of samples by read length.

Compressed

Read Length

gzip

SPRING

GeneSqueeze

35-36

20.59%

7.69%

7.56%

50-51

17.82%

5.36%

5.25%

100-101

27.16%

10.60%

11.64%

202

28.39%

13.93%

11.62%

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

Figure 1.

@SRR6380737.1 1 length=51

GCCTGTTCCATCTCCTCGTCCTTCTCTGCCAGCTTCCGCTCGATCTCTGCC

+SRR6380737.1 1 length=51

BBBFFBFBBBF<FBBFFB00<0BFFF<B7BB''B<FBFI<FB'7<F'7'0B

@SRR6380737.2 2 length=51

TGGAGGTGGGGGCTGAGGCTGGGGAGTCACTTTTGGGATTCAAGGGACCTT

+SRR6380737.2 2 length=51

BBBBFFB0FFFBBFIIBFFIFBB00B0<FB77BFFFI0<BBFBBBFFBBB<

Sample FASTQ File:

Content:

Identifier

Nucleotide Sequence

Quality Scores

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

Duplication

Removal

General- Purpose

Compressor

Semi- Duplication

Removal

Random Sequence

Selector

Sequence

Segmenter

k- mer Dictionary

Optimizer

k- mer Removal 

Semi- Duplication

Removal

Nucleotide Seq Regulator

Nucleotide Seq

Binary Encoding

k- mer Dictionary

Auxiliary Info

Compressed 

FASTQ/A

nt sequences

Random Subset of nt

sequences

k- mer Dictionary

nt sequences

Binary nt sequences

SDR Extra

nt sequences

k- mer Removal

Remaining nt Sequences

QS sequences

encoded template

and variables

nt Seq DR Auxiliary Info

SDR Auxiliary Info

SDR Auxiliary

Info

k- mer Removal

Auxiliary Info

FASTQ/A File

Final Compressed ﬁles

nt Sequence

Regularizer

Auxiliary Info

FASTQ/A Reader

nt sequences

SDR Extra

nt

sequences

nt sequences

QS DR Auxiliary Info

Read Identiﬁer

Encoder

QS sequences

Sequence Reversion

nt sequences

QS sequences

Read Identiﬁers

nt sequences

Sequence Reversion

nt sequences

split reminder - nt Sequencesplit reminder - QS Sequence

Figure 2.

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

Original Index

Original Read

ACCGAC

TCACGA

CCGTAG

ACCGAC

TCACGA

CCGTAT

ACCGAC

TCACGA

CCGTAA

Original Index

Original Read (A→Z)

ACCGAC

CCGTAA

CCGTAG

CCGTAT

TCACGA

Original Index

Parent Read

1,4,7

2,5,8

ACCGAC

CCGTAA

CCGTAG

CCGTAT

TCACGA

Parent Identifier

Parent’s Original Index

Parent’s Read

ACCGAC

TCACGA

CCGTAG

CCGTAT

CCGTAA

Original Index

DR_Identifier

Duplication Removal:

and

Figure 3.

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

Figure 4.

Original Index Original Read

ACCCCCC

AAAAAAA

TCGGTAC

ACCCCCA

ACCCCTT

TGGTTAC

Original Index

Original Read (A→Z)

AAAAAAA

ACCCCCA

ACCCCCC

ACCCCTT

TCGGTAC

TGGTTAC

Semi-Duplication Removal:

Original Index

Original Parents

AAAAAAA

TCGGTAC

ACCCCCA

TGGTTAC

concatenated

as a string

Child Index

Noise

Noise Length

and

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

Figure 5.

count k- mers

frequencies

k=k_list(i)

Random Subset

k_list

i=0

sort k- mer based on

frequency

count nucleotides

frequencies in kmers

aﬃx positions

trim aﬃx of each kmer

based on

existence_index and

dominance_index ﬁlters

calculate the initial ﬁxed

length for aﬃx

trim aﬃx using

calculated initial length

kmer & aﬃx

initial aﬃx

length

i = len(k_list) ?

i=i+1

Yes

Calculate initial ICS for

top_check number of k-

mers of all k values

used_read_list= []

update the

used_read_list with

adding the index of

reads that include the

external winner

update kmer frequencies

trim aﬃx of each kmer

based on

existence_index and

dominance_index ﬁlters

update IC Score of ﬁrst

ranked k- mer

ICS (1st rnaked kmer) >=

ICS (2nd ranked kmer)

Update ranking of 1st

ranked kmer according

its ICS

add 1st ranked kmer as

winner of internal

competition for k

k_list

i=0

k=k_list(i)

i = len(k_list) ?

i=i+1

Yes

calculate ECS for all

internal winner kmers

add kmer with lowest

ECS as external winner

and add it to the kmer

dictionary

stopping

creteria met?

Yes

Optimal kmer Dictionary

Yes

calculate cost function

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

Figure 6.

Prefix

K-mer Suffix

Dominant Frequency

Total Frequency

K-mer Frequency

Existance Index

Dominance Index

Final Prefix, K-mer, Suffix

ACGTACCT

Preﬁx - Sufﬁx Identiﬁcation:

0.55

0.43

0.63

0.43

0.77

0.71

0.82

0.67

102

110

0.93

0.69

0.86

0.74

0.70

0.78

0.69

0.46

0.57

0.79

0.46

0.49

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

Figure 7.

Read # Read Sequences

ACCGGACGCAAG

TCACGATACGAT

ATATTGCGTAGC

ACGGACGCAACG

TCACGAAACGAA

CCGGGACGCATT

ACCGACTCGACT

TCACGGACGCAC

CCGTAAGGTAAG

Index

Prefix

K-mer

GACGC

TATTG

Suffix

CGT

K-mer Removal:

and

K-mer Dictionary

Read #

K-mer Existence

K-mer Index

K-mer Location

Comp Change Location

1,4

Comp Change Type

Untouched Reads

TCAGCATACGAT

TCACGAAACGAA

ACCGACTCGACT

CCGTAAGGTAAG

K-mer Removal

Remaining Nucleotides

ACG

AGC

ACG

CCT

TCA

and

Ecoded K-mers

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint

Figure 8.

Quality Scores

27.3%

45.4%

Sequences

Identiﬁers

Quality Scores

60%

5.1%

Identiﬁers

34.8%

Sequences

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 25, 2024. ; https://doi.org/10.1101/2024.03.21.586111doi: bioRxiv preprint