1
Digital Sequence Information on Genetic Resources: Concept, Scope and
1
Current Use
2
3
Authors: Wael Houssen
1,2
, Rodrigo Sara
3
,
Marcel Jaspars
1
4
5
1
Marine Biodiscovery Centre, Department of Chemistry, University of Aberdeen, Old Aberdeen AB24
6
3UE, Scotland, UK
7
2
Institute of Medical Sciences, University of Aberdeen, Aberdeen AB25 2ZD, Scotland, UK
8
3
Consultant to the Secretariat of the Convention on Biological Diversity
9
2
Table of contents
1
Page
List of Figures
3
List of Tables
3
List of Abbreviations
4
1. Executive Summary
5
2. Introduction
8
3. Scientific Background
9
3.1. Discovery of DNA and its composition
11
3.2. RNA transcription, protein translation and biosynthesis of metabolites
11
3.3. Natural and synthetic modifications to DNA, RNA and proteins
14
3.3.1. DNA sequence modifications
14
3.3.2. Nucleotide modifications
14
3.3.3. Epigenetic modifications
15
3.3.4. Protein modifications
15
3.4. DNA sequencing technologies
15
3.5. DNA Sequencing
19
3.6. Genetic engineering
19
3.7. Synthetic biology
20
3.8. Techniques and databases used to study RNA, proteins and metabolites
21
3.8.1. Transcriptomics
21
3.8.2. Proteomics
21
3.8.3. Metabolomics
22
3.8.4. Databases
22
4. Sectors that rely on DSI and technologies/techniques enabled by DSI
22
4.1. Introduction
22
4.2. Taxonomy and conservation
23
4.2.1. Overview of sector
23
4.2.2. Key trends and examples
23
4.3. Agriculture and food security
23
4.3.1. Overview of sector
23
4.3.2. Key trends and examples
24
4.4. Industrial and synthetic biology
24
4.4.1. Overview of sector
24
4.4.2. Key trends and examples
25
4.5. Healthcare applications and discovery of pharmaceuticals
25
4.5.1. Overview of sector
25
4.5.2. Key trends and examples
25
4.6. Extent of reliance on DSI and technologies/techniques enabled by DSI
26
5. DSI: Scope and terminology
28
5.1. Introduction
28
5.2. Understanding the flow of data and information
29
5.3. New logical groupings & alternative terminology
30
5.3.1. Broad scope of subject matter: information associated with biological processing and
subsidiary information
34
5.3.2. Intermediate scope: information associated with biological processing involving
transcription, transcription and biosynthesis
34
5.3.3. Intermediate scope: data/information associated with biological processes involving
transcription and translation
36
5.3.4. Narrow scope: limited to nucleic acid sequence data associated with translation
36
3
5.4. Digital Sequence Information
38
5.4.1. Digital (OED)
38
5.4.2. Sequence (OED)
38
5.4.3. Information (OED)
41
5.5. Modifications DNA, RNA and protein sequences and their subunits
42
6. Conclusions and implications for future discussions concerning DSI
42
6.1. Subject matter groupings
42
6.2. Priority issues to clarify the concept of DSI
43
6.3. Subject matter groupings and life-science sectors
44
7. References
46
1
List of Figures
2
Page
Figure 1. Digital sequence information on genetic resources and derivatives
10
Figure 2. Structures of DNA, RNA and nucleotides
11
Figure 3. The ‘Central Dogma of Molecular Biology’
12
Figure 4. Different types of modification that can be made to DNA and protein sequences
14
Figure 5. The significant reduction in the cost of genome sequencing over time
18
Figure 6. The flow of data/information from genetic resource through DNA, RNA and
proteins to metabolites showing the limits/boundaries of some alternative terms used to
refer to DSI
30
Figure 7. Proposed subject matter groupings to facilitate discussions concerning DSI scope
and terminology
32
Figure 8. Main terminologies proposed to replace ‘DSI’ and different ways that the
intermediate subject matter grouping could be interpreted
35
Figure 9. The relationship between nucleotide sequence, chemical structure and SMILES
string of the same DNA sequence
41
3
List of Tables
4
Page
Table 1. The genetic code
13
Table 2. Comparison of currently available NGS platforms
17
Table 3. Use of DSI-related technologies in different sectors
27
Table 4. Scope of the different current terminologies showing the subject matter
groupings
33
Table 5. Applying the proposed DSI subject matter groupings to the different life-sciences
sectors
45
5
4
List of Abbreviations
1
BBNJ
A process to develop an international legally binding instrument
on the conservation and sustainable use of marine biological
diversity of areas beyond national jurisdiction under the United
Nations Convention on the Law of the Sea
CBD
Convention on Biological Diversity
CRISPR
Clustered regularly interspaced short palindromic repeats
DGR
Dematerialised genetic resources
DNA
Deoxyribonucleic acid
DSD
Digital sequence data
DSI
Digital sequence information
EVD
Ebola virus disease
GI
Genetic information
GMOs
Genetically modified organisms
GROs
Genomically recoded organisms
GRSD
Genetic resource sequence data
GS
Genetic sequences
GSD
Genetic sequence data
GSI
Genetic sequence information
ICC
International Chamber of Commerce
INSDC
International Nucleotide Sequence Database Collaboration
IPR
Intellectual property rights
ITPGRFA
International Treaty on Plant Genetic Resources for Food and
Agriculture
IUCN
International Union for Conservation of Nature
NCBI
National Center for Biotechnology Information
NGS
Next generation sequencing
NP
Nagoya Protocol on Access to Genetic Resources and the Fair and
Equitable Sharing of Benefits Arising from their Utilization to the
Convention on Biological Diversity
NSD
Nucleotide sequence data
OED
Oxford English Dictionary
OTUs
Operational taxonomic units
PIP
Pandemic Influenza Preparedness Framework of the WHO
PNA
Peptide nucleic acids
POC
Point-of-care
RNA
Ribonucleic acid
SI
Subsidiary information
SMILES
Simplified molecular input line entry specification
WHO
World Health Organisation
2
5
1. Executive Summary
1
At the 14
th
Conference of the Parties to the Convention on Biological Diversity four studies related to
2
Digital Sequence Information on Genetic Resources were requested pursuant to decision 14/20,
3
paragraph 11 (b) to (e). This study is the first of those requested: “a science-based peer-reviewed
4
fact-finding study on the concept and scope of digital sequence information on genetic resources
5
and how digital sequence information on genetic resources is currently used building on the existing
6
[Laird and Wynberg] fact-finding and scoping study”.
7
“Digital Sequence Information” (DSI) is widely acknowledged as a placeholder term for which no
8
consensus on a replacement or precise definition exists to-date. This study seeks, firstly, to ensure
9
sufficient technical grounding with which to consider the concept of DSI by explaining the different
10
types of information that can be understood to constitute DSI and providing context as to how this
11
information is generated and used. The flow of information derived from genetic resources is shown
12
in Figure 1 which is a key reference for the reader to understand the technical basis of this study. It
13
builds on the ‘central dogma of molecular biology’ (i.e. the processes in which DNA is transcribed
14
into RNA, which in turn is translated into protein) to explain how the DNA of a genetic resource
15
whether obtained from a natural source or developed artificially is processed biologically into
16
metabolites that carry out the tasks and processes within organisms that we understand to be life. It
17
also depicts different types of data that may be associated with a genetic resource and its
18
derivatives, including genomic, transcriptomic, metabolomic, epigenomic and metadata.
19
We also consider technical improvements and cost reductions in DNA sequencing which have led to
20
next generation technologies that facilitate the sequencing of genomes from a single cell and entire
21
ecosystems from environmental samples. As a result of these advances, DNA sequences and related
22
information deposited in large open access databases (collectively, DSI) has grown at an
23
unprecedented pace. Once a genome is sequenced and deposited, its genes can be compared for
24
similarities and differences to hundreds of other genes, thus helping understand its function and
25
importance. Thus, the power of DSI is in the assembled data, not a single DNA sequence. These
26
capabilities are the driving force behind applications such as gene editing and synthetic biology.
27
This revolution in genomics has led to greater understanding of the tree of life, the function of genes
28
and the metabolic processes they are associated with. For example, epigenetics provides insights
29
into hereditable changes without altering the DNA sequence. Transcriptomics informs on which
30
genes are active in organisms and communities of organisms leading to greater understanding of
31
interactions between organisms. Proteomics shows which proteins are expressed, and how they are
32
modified. Metabolomics shows the complement of small molecules in organisms and provides a
33
useful profile of metabolic activity and health status of organisms. These ‘omicstechnologies which
34
are primarily aimed at the detection of genes (genomics), mRNA (transcriptomics), proteins
35
(proteomics) and metabolites (metabolomics) in biological and environmental samples, yield vast
36
amounts of information associated with the underlying genetic resource, as depicted in Figure 1.
37
Additional techniques such as protein structure determination, codon optimization, and gene
38
editing, rely on this information to enable modification of DNA, RNA and protein sequences in order
39
to optimize expression, function or activity.
40
Technologies which are enabled by DSI are becoming ubiquitous in life-science related research and
41
industry. Understanding how different types of information are generated and used in this context is
42
6
essential in clarifying the concept of DSI, so we have chosen the following sectors to highlight:
1
taxonomy and conservation; agriculture and food security; industrial and synthetic biology;
2
healthcare applications and discovery of pharmaceuticals. For each sector we provide a brief
3
overview of the sector accompanied by key trends and examples of the application of techniques
4
technologies which are enabled by DSI in that particular sector. We show that each of these sectors
5
is reliant on DSI, the disruptive nature of technologies/techniques enabled by DSI, and the significant
6
economic footprint of several sectors. The same will be true for many other sectors in the life-
7
sciences and these factors should be considered in the discussions regarding the scope of DSI and in
8
assessing implications arising from the inclusion/exclusion of certain types of information from DSI
9
subject matter.
10
Having established a technical grounding, the study seeks to clarify subject matter scope and
11
terminology associated with DSI by proposing new logical groups to assist in evaluating the concept
12
of DSI, and by identifying priority issues that will help determine whether certain types of
13
information should be included or excluded from DSI subject matter. During the 2017-2018 inter-
14
sessional period, parties to the CBD and Nagoya Protocol undertook a number of steps to attempt to
15
clarify the concept of DSI. This process did not yield consensus on the appropriateness of the term
16
‘DSI’ nor what it refers to. These challenges are not unique to CBD and its Nagoya Protocol, as
17
evidenced by comparable discussions underway in various other UN processes. To help clarify the
18
concept of DSI we consider the flow of information from the utilization of a genetic resource, as
19
depicted in Figure 1. It is evident that at each step the data/information it yields becomes
20
progressively further removed from the original genetic resource. This proximity to the underlying
21
genetic resource and information associated with each step provides a logical basis to group
22
information that may comprise DSI. This gives rise to four alternative groups proposed to define the
23
scope of DSI, summarized as follows:
24
Group 1 - Narrow: DNA and RNA
25
Group 2 - Intermediate: (DNA and RNA) + proteins
26
Group 3 - Intermediate: (DNA, RNA and proteins) + metabolites
27
Group 4 - Broad: (DNA, RNA, protein, metabolites) + traditional knowledge, ecological
28
interactions, etc.
29
Group 1 has a narrow scope or proximity to the genetic resource and is limited to nucleotide
30
sequence data associated with transcription. Group 2 has an intermediate scope or proximity to the
31
genetic resource and extends to protein sequences, thus comprising information associated with
32
transcription and translation. Two interpretations for the scope of this group are possible, either
33
subject matter is strictly limited to nucleotide and protein sequence data or it includes information
34
associated with transcription and translation more broadly, for instance, functional annotations of
35
genes, gene expression information, epigenetic data, and molecular structures of proteins. Group 3
36
has a wider intermediate scope or proximity to the genetic resource and extends to metabolites and
37
biochemical pathways, thus comprising information associated with transcription, translation and
38
biosynthesis. Group 4 has the broadest scope or weakest proximity to the underlying genetic
39
resource and extends to behavioural data, information on ecological relationships and traditional
40
knowledge, thus comprising information associated with transcription, translation and biosynthesis,
41
as well as downstream subsidiary information.
42
We use these four groups to evaluate a broad list of subject matter potentially comprising DSI as
43
proposed in 2018 by the Ad Hoc Technical Expert Group (AHTEG) on Digital Sequence Information on
44
7
Genetic Resources. We also use these groups to evaluate a range of terms proposed to replace DSI,
1
as shown in Table 4, which is a key reference for the reader to understand the different groups
2
proposed to evaluate the concept of DSI in this study. It is evident from these evaluations that
3
terminology is readily available to describe DSI with narrow subject matter as proposed in Group 1.
4
These terms include Genetic Resource Sequence Data (GRSD); Genetic Sequences (GS); Genetic
5
Sequence Data/Information (GSD/GSI); and Nucleotide Sequence Data (NSD)). Consequently, the
6
terms Digital Sequence Data (DSD), Genetic Resource Sequence Data (GRSD) or Genetic Resource
7
Sequence Data and Information (GRSDI), could be used to describe subject matter of intermediate
8
scope as proposed in Group 2, depending on the interpretation adopted. None of the terms
9
proposed to date appear to adequately capture an intermediate range comprising information
10
associated with the additional biosynthesis of a genetic resource as proposed by Group 3. Finally,
11
terminology is also readily available to describe subject matter with broad scope as proposed in
12
Group 4. Overall, the four logical groups proposed in this study provide a nuanced alternative to the
13
2018 AHTEG list and so may better assist in clarifying the concept and scope of DSI, however,
14
appropriate terminology will need to be evaluated, particularly for the intermediate groups.
15
The proximity of information to the underlying genetic resource determines whether it is possible to
16
accurately identify or infer the source from which it is derived. This is possible to differing degrees in
17
the case of RNA and protein sequences, however, it becomes much more challenging with
18
biosynthetic information and impossible with subsidiary information. Accordingly, the proximity of
19
data/information has significant implications for traceability to a particular genetic resource and also
20
in identifying the source of information, including whether it has been generated through the
21
utilization of a genetic resource or independently. In a system in which the traceability of DSI is
22
important, a narrow scope of DSI subject matter appears better suited given the technical difficulties
23
in identifying or inferring origin, whereas if traceability is not important a broader scope of subject
24
matter may be able to be accommodated.
25
The study identified several key issues, as well as potential solutions which should be considered and
26
resolved as a priority in order to help clarify the concept of DSI. The first two questions are: 1.) how
27
far along the flow from genetic resource onwards to DNA, RNA, protein sequences and metabolites
28
‘DSI’ can be considered to extend; and 2.) whether DSI includes both data and information and the
29
extent to which data has been processed before it can be considered information. Both questions
30
can be resolved by adopting one of the four clearly defined proposed groups to clarify the scope of
31
DSI subject matter. The third question is whether certain sequence information should be excluded
32
from the scope of DSI subject matter, including sequences below a certain length, non-coding DNA,
33
epigenetic heritable factors and modified DNA/RNA/proteins. A sequence below 30 nucleotides may
34
not be unique, and this may provide a lower cut-off for sequence length. Non-coding DNA,
35
epigenetic heritable factors and DNA/RNA/proteins modified naturally all have functions suggesting
36
it might be logical to consider them for inclusion in the DSI subject matter. Conversely synthetically
37
modified DNA, RNA or proteins cannot be said to have a natural functional role and so on this basis
38
could be considered not to be an inherent part of the underlying genetic resource. Irrespective of
39
whether the logical groups proposed in this study are adopted, it is anticipated that the priority
40
issues identified in this study and a greater appreciation of the extent to which DSI across a range of
41
sectors in the life sciences, will assist the deliberations and the recommendations by the new Ad Hoc
42
Technical Expert Group (AHTEG) on Digital Sequence Information on Genetic Resources which will
43
consider the studies commissioned pursuant to decision 14/20 at the 14
th
Conference of the Parties
44
to the Convention on Biological Diversity.
45
8
2. Introduction
1
The 14
th
Conference of the Parties to the Convention on Biological Diversity requested four studies
2
related to Digital Sequence Information on Genetic Resources.
a
“Digital Sequence Information” (DSI)
3
is widely acknowledged as a placeholder term for which no consensus on a replacement or precise
4
definition exists to-date. This study is the first of those requested: a science-based peer-reviewed
5
fact-finding study on the concept and scope of digital sequence information on genetic resources
6
and how digital sequence information on genetic resources is currently used building on the existing
7
fact-finding and scoping study.
8
The existing fact-finding study referred to in the decision is that by Laird and Wynberg published in
9
2018
1
and the Executive Secretary of the Convention on Biological Diversity commissioned the
10
present study with the following two aims: 1) to explain in greater detail what types of information
11
could be understood as DSI and how these are generated in order to help the process of determining
12
what would be the most appropriate term and what it would cover; and 2) to explain how such
13
information is used in different technological applications and life-sciences sectors in order to
14
provide insights into how these might be affected by determinations regarding scope and the
15
inclusion/exclusion of certain types of information from DSI subject matter. These inquiries
16
regarding scope, terminology and the generation/application of different types of information that
17
may potentially comprise DSI
b
will contribute to broader deliberations regarding different
18
approaches for addressing DSI on genetic resources within the framework for access and benefit
19
sharing established under the CBD/Nagoya Protocol.
20
This study is scientific in scope and does not cover associated policy implications. The work by the
21
project team on this project took over 6 months and included a review of the primary and secondary
22
literature, product documentation and websites belonging to institutes, research projects and
23
companies.
24
To understand the subsequent discussion on scope and terminology, and to appreciate how this
25
impacts sectors which use information that may comprise DSI, an understanding of the
26
fundamentals of molecular biology and key developments associated with DNA sequencing and
27
related technologies are essential. Accordingly, this study commences with a scientific background
28
(Section 3) before evaluating different sectors in the life-sciences that rely on DSI and
29
technologies/techniques enabled by DSI (Section 4). The study then considers the flow of data and
30
information from a genetic resource and suggests new logical groupings for DSI subject matter, as
31
well as evaluating alternative terminology to replace DSI and identifying priority questions/issues
32
that need to be addressed in order to clarify the concept of DSI (Section 5). Implications for future
33
discussions concerning scope and terminology which arise from this Study are considered (Section
34
6).
35
36
a
Decision 14/20, paragraph 11 (b) to (e), accessible at www.cbd.int/doc/decisions/cop-14/cop-14-dec-20-
en.pdf
b
The scope of DSI is of course yet to be determined, however, for convenience “information that may
potentially comprise DSI” shall hereafter be used interchangeably with “DSI”.
9
3. Scientific Background
1
The ‘central dogma of molecular biology’ represented in Figure 3 provides a basis for us to explain
2
the structure of DNA and its copying mechanism, followed by the way in which DNA is translated
3
into proteins and then biosynthesised into metabolites, including modifications that can occur at
4
different stages of this process. We consider the relentless pace of technological advancement in
5
this field in recent decades starting with DNA sequencing, followed by the ability to edit and
6
engineer genes, the rise of synthetic biology, expansion of the genetic code and the emergence of
7
‘omics’ technologies, all of which generate or rely on information which may potentially comprise
8
‘DSI’.
9
Throughout this section, the reader should refer to Figure 1 which provides a clear scheme showing
10
the information flow from genetic resource to DNA, RNA, proteins and onwards to derivatives, using
11
different techniques and approaches. This is used to explain how the DNA of a genetic resource
12
whether obtained from a natural source or developed artificially is processed biologically, as well
13
as the different types of information that may be associated with a genetic resource and its
14
derivatives, including genomic, transcriptomic, metabolomic, epigenomic and meta data. It builds on
15
the ‘central dogma of molecular biology’ (as explained in Section 3.2 below) to depict the production
16
of metabolites that carry out the tasks and processes within organisms that we understand to be life.
17
10
1
Figure 1. Digital sequence information on genetic resources and derivatives. This figure shows the flow of information derived from genetic resources using different
2
techniques and approaches. This figure is provided as a separate A3 document.
3
11
3.1. Discovery of DNA and its composition
1
The idea that a chemical structure could carry genetic information was suggested in 1944
2
and in the
2
same year deoxyribonucleic acid (DNA) was identified as the substance responsible for heredity
3
.
3
Subsequently it was discovered that: 1) DNA contains 4 nitrogenous bases, and in any double-
4
stranded DNA, the number of guanine (G) bases is equal to the number of cytosine (C) bases and the
5
number of adenine (A) bases is equal to that of thymine (T) bases; 2) the composition of DNA varies
6
between species.
4
DNA is a double stranded helical structure (Figure 2). The two DNA strands are
7
also known as polynucleotides as they are composed of simpler monomeric units called nucleotides.
8
Each nucleotide is composed of a deoxyribose sugar, a phosphate group and one of the four
9
nitrogenous bases (A, C, T, G). The deoxyribose sugar and phosphate groups form the backbone of
10
each strand which resembles the sides of a ladder to which nitrogenous bases are connected. The
11
bases face the centre and each base is connected to and complements the base facing it in the
12
opposite strand to constitute the “rungs” of the ladder: adenine in one strand is always paired with
13
and complements thymine in the other whereas guanine is always paired with and complements
14
cytosine.
5
15
16
Figure 2. Structures of DNA, RNA and nucleotides. (modified from reference 6).
17
18
3.2 RNA transcription, protein translation and biosynthesis of metabolites
19
The structure of ribonucleic acid (RNA, Figure 2) is similar to that of DNA but differs in that RNA is a
20
single-stranded molecule, its sugar-phosphate backbone contains a ribose sugar instead of the
21
deoxyribose, and it contains uracil [U] instead of thymine [T].
7
Pursuant to the central dogma of
22
molecular biology
8
(Figure 3) DNA directs the formation of RNA, which in turn directs the synthesis
23
of proteins. Those proteins in turn carry out the tasks and processes within organisms that we
24
12
understand to be life. Many proteins are involved in the production of metabolites, the end products
1
of metabolism.
2
Transcription. The process by which DNA is copied into RNA is called transcription and is carried out
3
by an enzyme called RNA polymerase. During transcription, a DNA sequence is read by RNA
4
polymerase which produces a complementary RNA strand (Figure 3). The amino acid sequence in a
5
protein is correlated with the sequence of nitrogenous bases in RNA and each of the 20 natural
6
amino acids is specified by a three-base sequence of the RNA called a codon. For instance, the
7
three-base codon (CCC) encodes the amino acid proline while the codon AAA produces the amino
8
acid lysine.
9,10
Scientists deciphered the sequences of the 64 codons in nature as shown in Table 1.
9
Transcription is the first step in gene expression and in eukaryotic cells (cells with a nucleus), it is
10
followed by splicing in which introns (non-coding regions) are removed and exons (coding regions)
11
are joined together. In prokaryotic cells (cells without a membrane-bound nucleus), splicing does not
12
occur. The study of which genes are expressed under given conditions in an organism is called
13
‘transcriptomics’ (Figure 1).
14
15
Figure 3. The ‘Central Dogma of Molecular Biology’ focuses on the processes of transcription and
16
translation.
17
Translation. The process by which the base sequence information in RNA is converted into an amino
18
acid sequence in proteins is called translation. This process takes place on the ribosomes which are
19
large complexes of RNA molecules and proteins. Although the codon ‘AUG’ encodes the amino acid
20
methionine, it also activates the ribosome to start the process of making a protein and is thus known
21
as start codon. Similarly, there are stop codons which signal the termination of translation into
22
proteins. Many amino acids can be encoded by different codons and because of such redundancy,
23
the genetic code is described as degenerate. Translation of a DNA sequence to a protein sequence is
24
straightforward and can be carried out automatically using the standard codon triplets, whereas the
25
reverse process is not easily possible, thereby making it difficult or impossible to trace it back to the
26
original DNA sequence.
27
13
As said, different codons can code for the same amino acid, so different DNA sequences could lead
1
to the same protein. Different organisms have a ‘preference’ for the use of a particular codon for a
2
specific amino acid. Researchers usually choose from the different options for codons the one that is
3
preferred by the organism they are studying (or from which a desired substance was obtained) when
4
they want to express a specific amino acid and this process called codon optimisation. Finally, it
5
should be mentioned that some organisms do not use the standard triplet codons, for instance
6
Tetrahymena encodes the amino acid glutamine as TAA which in the ‘universal code’ is assigned as a
7
‘stop’ codon.
11
8
Biosynthesis. The process by which proteins give rise to metabolites is called biosynthesis.
9
Biosynthetic enzymes are protein catalysts directing the synthesis of ‘primary metabolites’ which are
10
directly involved in the normal growth, development, and reproduction of all organism (e.g.
11
carbohydrates, proteins, lipids and nucleic acids), as well as ‘secondary metabolites’ or ‘natural
12
products’ which are made by biosynthetic pathways specific to certain species (e.g. venoms, toxins
13
and antibacterial agents). Metabolites can be simple (e.g. the commonly consumed alcohol ethanol,
14
a primary metabolite) or complex (e.g. the plant derived anti-cancer agent paclitaxel, a secondary
15
metabolite/natural product). In many cases genes encoding the biosynthetic enzymes of specific
16
metabolites are clustered together to ensure the optimum production of a metabolite.
17
18
Table 1. The genetic code. Amino acids may have more than one triplet codon. Some codons are
19
assigned ‘start’ and ‘stop’ functions which start/stop the process of translation by the ribosome to
20
generate a protein from an RNA sequence.
21
Second base
U
C
A
G
First base
U
UUU
UUC
Phenylalanine
UCU
UCC
UCA
UCG
Serine
UAU
UAC
Tyrosine
UGU
UGC
Cysteine
U
C
A
G
Third base
UUA
UUG
Leucine
UAA
UAG
Stop
UGA
Stop
UGG
Tryptophan
C
CUU
CUC
CUA
CUG
Leucine
CCU
CCC
CCA
CCG
Proline
CAU
CAC
Histidine
CGU
CGC
CGA
CGG
Arginine
U
C
A
G
CAA
CAG
Glutamine
A
AUU
AUC
AUA
Isoleucine
ACU
ACC
ACA
ACG
Threonine
AAU
AAC
Asparagine
AGU
AGC
Serine
U
C
A
G
AAA
AAG
Lysine
AGA
AGG
Arginine
AUG
Methionine / Start
G
GUU
GUC
GUA
GUG
Valine
GCU
GCC
GCA
GCG
Alanine
GAU
GAC
Aspartic
acid
GGU
GGC
GGA
GGG
Glycine
U
C
A
G
GAA
GAG
Glutamic
acid
14
3.3 Natural and synthetic modifications to DNA, RNA and proteins
1
DNA, RNA and proteins are frequently modified in nature to allow them to carry out a range of
2
different functions. The types of possible modifications are summarised in Figure 4. The first is that it
3
is possible to substitute a natural DNA (‘wild type DNA’) sequence with a non-natural (‘recombinant
4
DNA’) one but maintain function. Secondly, the range of nucleotides and amino acids in DNA and
5
proteins can be expanded to include non-natural ones and it is also possible for the subunits to be
6
modified extensively so that they are no longer regarded as nucleotides or amino acids. Finally,
7
modifications to natural DNA and proteins may be made after they have been formed, for example
8
DNA that has been subjected to epigenetic modification. We will now consider each of these
9
possibilities in turn.
10
11
Figure 4. Different types of modification that can be made to DNA and protein sequences.
12
3.3.1 DNA sequence modifications. As discussed in Section 3.2, DNA sequences can be codon
13
optimised to allow efficient protein expression in another organism (Figure 4 a.). A codon optimised
14
DNA sequence can express a natural protein despite differing from the ‘wild-type’ DNA on which it is
15
based. The codon optimised DNA sequence is now unnatural and tracing it back to the originating
16
sequence will be complex, if not impossible as for each amino acid there are multiple different
17
codons. However, the protein sequence derived from these will still be traceable as it remains the
18
same, no matter which codons were used in its translation. In addition, DNA sequences can be
19
designed and synthesised to generate wholly new proteins with known or novel functions.
20
3.3.2 Nucleotide modifications. A nucleotide is composed of 3 elements, a phosphate, a sugar and a
21
base each of which can be modified (Figure 4 b.). There are several ways in which this has been
22
achieved, for instance replacing the phosphate or sugar units with modified versions.
12
More radical
23
is the complete replacement of the phosphate-sugar backbone with one made from an amino acid
24
chain resulting in ‘peptide nucleic acids’ (PNA). These are no longer nucleotides but form a double
25
helix and can carry and transfer the genetic code, like DNA.
13
It has also recently been possible to
26
expand the number of bases from two pairs (G/C & A/T) to four pairs of complementary bases using
27
15
synthetic nucleotides, and to develop a test-tube system that allows these to be transcribed to RNA,
1
thus expanding the density of information that DNA can encode.
14
The recent World Intellectual
2
Property Organization Standard ST.26 Recommended standard for the presentation of nucleotide
3
and amino acid sequence listings using xml’ uses a similar definition for nucleotides and their
4
modifications.
c
5
3.3.3 Epigenetic modifications. This allows the heritable changes to be made to DNA without
6
altering the original sequence (Figure 4 c.). There are multiple ways in which DNA can be prevented
7
from being transcribed to RNA, such as via repressor proteins that attach themselves to specific
8
regions of DNA or the modification of histones, around which DNA winds when packed in
9
chromosomes, thus preventing the unwinding and expression of a gene. Most relevant here is the
10
methylation of DNA, and if this happens in a gene promotor, it suppresses gene transcription.
11
Methylation data is not straightforward to obtain, and controls aspects of gene expression and
12
hence phenotype. As science advances it may be possible in the future to rapidly predict epigenetic
13
methylation patterns. The study of all the epigenetic modifications in an organism is termed
14
‘epigenomics’ (Figure 1).
15
3.3.4 Protein modifications. Proteins are often modified through metabolic processes, with the
16
simplest modification being the addition of phosphate (‘phosphorylation’) which acts as an energy
17
source for a protein to enable it to function (although there is a list of others such as acetylation,
18
glycosylation and several more) (Figure 4 d.). Such ‘post-translational’ modifications are common
19
and often happen at predictable sites such as a particular amino acid, but at other times are hard
20
to predict.
21
Smaller proteins up to ~50 amino acids long are termed ‘peptides’ and these can contain heavily
22
modified natural amino acids, with more than 200 reported compared to the 20 natural (‘canonical’)
23
amino acids. There are two important methods by which such peptides can be formed, the so called
24
RiPPs (ribosomally produced and post-translationally modified peptides) and NRPS (non-ribosomal
25
peptide synthesis). The RiPPs rely on a series of enzymes that modify amino acid sequences that are
26
produced by translation of RNA, whereas the NRPS are generated by a complex of enzymes that
27
generate non-canonical amino acids and combine them into metabolites.
28
3.4 DNA sequencing technologies
29
Sanger sequencing, introduced in the 1970s, allowed stretches of DNA (100-1000 base pairs) to be
30
accurately sequenced.
15
Longer strands of DNA are subdivided into fragments that are sequenced
31
separately and these sequences are then assembled to give the overall sequence. Methods of DNA
32
sequencing that involve randomly breaking up DNA into many small pieces and then reassembling
33
the sequence by looking for regions of overlap using bioinformatics are sometimes referred to as
34
‘shotgun sequencing’.
16
35
This method was successfully used in 1982 to sequence the genome of the bacterial virus,
36
bacteriophage λ (48,502 base pairs).
17
The first commercial sequencer was introduced in 1987 which
37
enabled the Human Genome Project to be launched in 1990. This project catalysed the development
38
of cheaper, high throughput and more accurate platforms known as the next generation sequencers.
39
c
https://www.wipo.int/export/sites/www/standards/en/pdf/03-26-01.pdf
16
These new platforms have increased the speed of sequencing remarkably. They differ in read length,
1
output data, quality and cost and Table 2 shows a comparison between the most used techniques
2
today.
3
The error rate of DNA sequencing (Table 2) may mean that it is difficult to distinguish whether a
4
change in a DNA sequence is due to an error in sequencing or a consequence of natural variation. In
5
some cases, errors can be corrected using bioinformatic tools, and if carried out, the remaining
6
differences are likely due to natural variation. However, errors in sequencing can also be reduced by
7
ensuring adequate or increased coverage or depth
d
.
18
8
d
Depth is the average number of times that a particular nucleotide location is represented in a collection of
random raw sequences
17
Table 2. Comparison of currently available NGS platforms. Method of sequencing, typical read length, accuracy of single read and typical applications. Data
1
is taken from Ref. 19.
2
Method
Read Length
(bp)
Accuracy of
single read
Reads per run
Time per run
Cost per 1
million bases ($)
Applications
Single molecule
real-time
10,000-
15,000
87%
50,000
30 min
4 hrs
0.06
Small scale, specific
research questions /
resolving high GC regions
Ion Torrent
< 400
98%
> 80
million
2 hours
1
Bacterial genomes / Large
sequencing projects
SOLiD
80-100
99.9%
1.2-1.4
billion
1-2
weeks
0.13
Metagenomics
MiSeq
50-600
99.9%
1-25
million
1-11
days
0.05-0.15
Bacterial genomes /
Large sequencing
projects
Pyro 454
700
99.9%
1 million
24
hours
10
Metagenomics
Sanger
400-900
99.9%
N/A
20 min
3 hrs
2400
Small scale, specific
research questions /
resolving high GC regions
18
The new advances in sequencing technologies were also associated with a sharp decrease in the cost
1
of sequencing. This can be seen clearly in the Cost per Genome graph generated by the National
2
Human Genome Research Institute (NHGRI) (https://www.genome.gov/) which has tracked the cost
3
of genome sequencing at the sequencing centres it funds since 2001 (Figure 5). In this graph, two
4
parameters were considered; 1) the size of the genome was assumed to be 3 billion base pairs (i.e.
5
the size of the human genome) and 2) the required “sequence coverage” which is the number of
6
reads that include a given base to overcome errors in the assembly of the genome. The latter differs
7
among sequencing platforms depending on the average sequence read length for each platform.
8
This lowering of costs has made access to sequencers possible to many researchers, either in their
9
own lab, or via larger scale sequencing facilities. This has led to a large increase in the amount of
10
sequence data available, leading to an increased necessity for interpreting this data using
11
bioinformatics. The latter is an interdisciplinary field which uses computer programming to analyse
12
biological data. Common uses including search for specific sequences/genes or alignment of
13
homologous sequences to identify mutations and/or predict gene function. Bioinformatics is also
14
used for comparative genomic studies in which the genomic features of different organisms are
15
compared in order to trace the evolutionary processes responsible for the divergence of the
16
genomes.
17
A single gene sequence cannot be analysed without reference to databases in which function has
18
been ascribed to genes through the execution of a great deal of laboratory-based research. It is the
19
collection of all DNA sequence data in the databases that has value and can influence direction of
20
the ongoing research as researchers can compare sequences from different organisms and predict
21
functions for genes which may be utilised commercially. The analysis of sequencing data is known as
22
annotation in which researchers use different techniques to identify the locations and functions of
23
genes and other coding regions in the genome.
24
25
Figure 5. The significant reduction in the cost of genome sequencing over time. Moore’s law is an
26
observation and projection of a historical trend. It asserted that the number of transistors on a
27
microchip doubles about every two years, though the cost of computers is halved. It is obvious that
28
the decrease in the cost in sequencing is occurring at a much faster rate than that seen with
29
computers.
30
(Source: https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost).
31
19
3.5 DNA Sequencing
1
Continuous improvements in sequencing technology meant that it was possible to sequence whole
2
genomes of increasingly more complex organisms starting with Haemophilus influenza Rd (1995, 1.8
3
megabase pairs)
20
followed by the fruit fly, Drosophila melanogaster (2000, 120 megabase pairs)
21
,
4
the first mammal, the mouse (2002, 2,700 megabase pairs)
22
and plant genomes such as rice Oryza
5
sativa indica and Oryza sativa japonica (2002, 430 megabase pairs)
23,24
However, our knowledge of
6
the genomes of the world’s eukaryotic biodiversity is very limited, with in 2017 only 2,534 unique
7
species in the NCBI database having sequenced genomes, representing less than 0.2% of the known
8
species. Of these only 25 species have genomes at the highest level of quality proposed for
9
reference genomes.
25
10
Shorter DNA sequences termed ‘DNA barcodes’ are used to identify a given species through the
11
comparison of nucleotide sequences in its DNA to that of the same regions/genes in other species.
12
When barcoding is used to identify organisms from a sample containing DNA from more than one
13
organism, the term DNA metabarcoding is used.
14
In some cases, the sample from the organism may contain a heterogeneous mixture of cells as is the
15
case with marine sponges that contain many uncultured microbial symbionts. In these cases, the
16
DNA extracted is a mixture of the genomes derived from the sponge and all the symbionts and is
17
called a metagenome. Analysis of metagenomic data involves a process called binning in which
18
sequencing reads are grouped and assigned to a group of organisms. Metagenomics is also applied
19
to DNA samples directly recovered from environment which is known as environmental DNA or
20
eDNA. eDNA is collected from a variety of environmental samples such as soil, seawater, snow or
21
even air rather than directly sampled from an individual organism. As various organisms interact
22
with the environment, DNA is expelled and accumulates in their surroundings. An example is the
23
DNA fragments left behind by marine organisms in the sea water.
26,27
24
Recent technical improvements of next generation sequencing (NGS) technologies allowed the
25
sequencing of the genome from a single cell thus provided access to previously inaccessible and
26
extremely invaluable information about the function of an individual cell in the context of its
27
microenvironment.
28
Single cell sequencing has been particularly useful to the field of
28
metagenomics.
29
29
The DNA sequences published continued to grow up at unprecedented pace. These data were
30
deposited and maintained in large open access databases and collectively constitute the digital
31
sequence information (DSI) on genetic resources. For a better understanding of sequence databases,
32
see studies 2/3 commissioned by the CBD Secretariat in parallel to this study.
33
3.6 Genetic engineering
34
Sequences deposited in databases are not limited to the information obtained from the different
35
genome/metagenome sequencing projects. The possibility of engineering the genomes of organisms
36
led to the inclusion of unnatural sequences in these databases. The advent of genetic engineering in
37
the 1970s allowed the transfer of genes within and across species boundaries and to introduce
38
mutations in the DNA sequences to produce organisms with improved useful characteristics e.g.
39
crops that tolerate herbicides or resist pests. The resulting entities are widely known as genetically
40
modified organisms (GMOs). However, Under the Convention on Biological Diversity and in this
41
report, the term Living Modified Organisms (LMOs) is used instead to refer to these entities. Genes
42
20
can be readily copied using the polymerase chain reaction, and editing can be achieved using
1
techniques such as site-directed mutagenesis which modifies a single base in a sequence or by
2
cutting and splicing larger DNA sequences using editing enzymes.
3
A recently discovered and very accurate form of gene editing is provided by CRISPR (clustered
4
regularly interspaced short palindromic repeats); a family of DNA sequences found within the
5
genomes of bacteria and which represent part of the bacterial defence system against invading
6
viruses.
30
This defence system involves two critical components. The first is the “seeker” an RNA
7
encoded in the bacterial genome that matches and complements the DNA of the viruses and thus
8
will be able to recognise and bind to the viruses’ DNA during the attack. The second element is the
9
“hitman”. Once the viral DNA is recognised as foreign, a bacterial nuclease named Cas9 is deployed
10
to cut the DNA of the virus. This system was found to be programmable and by substituting the
11
recognition element, the system can be redirected to cut other genes and genomes at specific sites.
12
The system has been further manipulated to make edits in the genome at the cut site. This was
13
based on knowledge of how the gene repairs itself after being cut. Typically, a cut-open gene tries to
14
recover any lost information from another copy of the gene in the cell. If the cell is given a DNA
15
fragment that has a slightly different sequence from the gene, there is a high probability that the
16
information written on this fragment is copied permanently into the genome.
31
17
3.7 Synthetic biology
18
Synthetic biology is a further development and new dimension of modern biotechnology that
19
combines science, technology and engineering to facilitate and accelerate the understanding,
20
design, redesign, manufacture and/or modification of genetic materials, living organisms and
21
biological systems. For example, synthetic biology can transform a biological cell into an industrial
22
biofactory
36
using complex biological systems or circuits built from standard interchangeable DNA
23
parts that have defined functions such as regulating transcription, regulating translation, binding
24
small molecules, coding proteins etc. The BioBricks foundation (https://biobricks.org) maintains a
25
registry of standard parts that can be freely used by synthetic biology researchers, who co-opt these
26
parts and engineer them for use in applications outside of their natural settings.
32
Several enabling
27
tools and technologies for synthetic biology have been identified including genomic databases,
28
public and private registries of biological parts, standard methods for physical assembly of DNA
29
sequences, commercial services for DNA synthesis and sequencing, and advances in bioinformatics.
16
30
DNA synthesis can readily be used to generate stretches of up to 5,000 base pairs; longer ones can
31
be created by splicing together shorter sections using gene splicing techniques.
32
Technologies have been developed to expand the genetic code and to allow the incorporation of
33
unnatural amino acids into proteins. Genetic code expansion offers the possibility to directly encode
34
these modifications and to produce a modified protein.
33
One strategy called codon assignment
35
involves using genetic engineering to reallocate one or more of the specific redundant natural
36
codons (Table 1) to encode an unnatural amino acid. The resultants are called genomically recoded
37
organisms (GRO).
34
Another strategy involves using engineering to allow the ribosome to incorporate
38
unnatural amino acids into protein in response to a four base quadruplet codon.
35
It should be
39
noted that the term GRO could be considered under the much broader term LMO.
40
41
21
3.8. Techniques and databases used to study RNA, proteins and metabolites
1
An RNA transcript is in an indication of which genes are active and which are dormant at any given
2
time or under any given set of conditions in an organism and is studied using ‘transcriptomics’.
3
Proteins and metabolites are downstream products of translation and biosynthesis respectively
4
(Figure 1). They fulfil important roles in an organism’s metabolism and can be studied using a huge
5
variety of techniques, too many to cover in detail in this report. These techniques are grouped under
6
‘proteomics’ and ‘metabolomics’ below, each of which could fill an entire textbook. Below, we have
7
attempted to give a brief overview of aspects of these three ‘omics’ technologies relevant to this
8
report.
9
3.8.1 Transcriptomics. The transcriptome of an organism is a measure of which genes are expressed
10
under any given set of conditions. Similar to DNA sequencing, high throughput RNA sequencing can
11
be used to determine the total population of RNAs in a sample. Alternatively, microarrays (DNA/RNA
12
chips, biochips), which are microscope slides printed with thousands of tiny spots in defined
13
positions, with each spot containing a unique, known DNA sequence, can be used. These
14
oligonucleotides act as probes to detect thousands of different transcripts simultaneously, relying on
15
quantitative fluorescence from labelled cDNA synthesized from the sample RNA. The output from
16
these techniques is therefore RNA sequences or an indication of which genes are transcriptionally
17
activated.
18
3.8.2 Proteomics. The proteome of an organism is the totality of all proteins that are produced or
19
modified by an organism. In principle, protein sequences produced by an organism can be predicted
20
from the genome, but the proteome of an organism changes under different growth conditions and
21
stresses, amongst other factors. It is therefore important to measure the actual proteome in an
22
organism using experimental techniques to enable a full understanding of the organism’s
23
metabolism and expression. While there are many techniques available to achieve this, including
24
antibody-based techniques and protein microarrays, most relevant here is the use of mass
25
spectrometric techniques. Mass spectrometry enables the rapid sequencing of single proteins and
26
peptides by measuring the mass of the intact protein/peptide and its fragments. These masses can
27
be searched in online databases (e.g. https://www.uniprot.org/) to determine the sequence of
28
amino acids in a protein, and to compare its similarity to other proteins in the database. Protein
29
masses in the database can be obtained by experiment, or by calculation using the masses of each
30
individual amino acid present in a protein/fragment sequence. The large number of proteins present
31
in a proteome means that some sort of separation technique is necessary prior to mass
32
spectrometry.
33
Some definitions of proteomics, sometimes termed ‘structural proteomics’ also include the 3-
34
dimensional structures of the individual proteins. The structure of a protein is determined by its
35
amino acid sequence alone,
36
without the need for additional genetic information. A protein may
36
adopt a ‘native’ active correctly folded structure and alternative inactive ‘misfolded’ structures. In
37
principle, therefore, it is possible to predict the structure of a protein given only its amino acid
38
sequence, although it is currently still very difficult to do this reliably.
37
For this reason most protein
39
molecular structures are determined using x-ray crystallography, acquisition of spectroscopic data or
40
the use of cryo-electron microscopy all of which rely on complex and expensive infrastructure and
41
require considerable computational power. Structural data on DNA, RNA and proteins can be found
42
in freely accessible databases such as the protein databank (https://www.rcsb.org/) giving atom
43
22
coordinates for these structures and associated metadata and linking out to papers describing their
1
function.
2
3.8.3 Metabolomics. This is the study of the full complement of small molecule metabolites
3
produced by an organism’s metabolism under a certain set of conditions. The metabolome contains
4
a large range of different types of molecule with varying characteristics produced at widely different
5
concentrations. Profiling the metabolome is therefore very different from DNA/RNA sequencing
6
(section 3.4) and proteomics and relies on using high resolution separation techniques to separate
7
each metabolite before measuring its mass and fragments using mass spectrometry. The measured
8
data can be analysed using a database such as METLIN (https://metlin.scripps.edu) which uses this
9
mass spectrometric data to identify each metabolite present, giving degree of certainty. Statistical
10
techniques are important to study how the metabolome changes when conditions change. For
11
instance, it can be used to measure the influence of a toxin or the effect of a gene modification on a
12
metabolome. If metabolites not present in the database are encountered in metabolomic analysis, it
13
will be more complicated to identify these. If this occurs, the metabolite will need to be obtained in
14
its pure form and its chemical structure defined using spectroscopic or other techniques.
38
15
3.8.4 Databases. Once generated, there are many databases in which data derived from genetic
16
resources are deposited. These data include: the metadata of the genetic resource (e.g. the
17
taxonomy of the organism and its geographical origin); the nucleotide sequence data in DNA and
18
RNA which is (genomic/transcriptomic data); amino acid sequences of proteins (proteomic data) and
19
is complemented by structural data of different proteins as identified by x-ray crystallography, cryo-
20
electron microscopy or nuclear magnetic resonance; the data on metabolites isolated and identified
21
from any organism using different spectroscopic techniques and mass spectrometry (metabolomic
22
data); and the epigenomic data which includes for example the pattern of the DNA methylation or
23
histone acetylation.
24
25
4. Sectors that rely on DSI and technologies/techniques enabled by DSI
26
4.1 Introduction
27
The 2018 Laird and Wynberg study
1
together with the synthesis of views
e
and accompanying case
28
studies prepared by the Secretariat to the CBD addressed the potential implications of DSI on the
29
objectives of the CBD
f
. These provided comprehensive examples of the different contexts and
30
purposes for which DSI can be used. Whereas Section 3 of this study complements these efforts by
31
providing greater technical context concerning the generation and use of DSI, this section focuses on
32
specific sectors in the life-sciences which have been (and continue to be) transformed or enabled by
33
this relatively recent revolution.
34
e
Synthesis of views and information on the potential implications of the use of digital sequence information
on genetic resources for the three objectives of the Convention and the objective of the Nagoya Protocol
(CBD/DSI/AHTEG/2018/1/2); Available at
https://www.cbd.int/doc/c/49c9/06a7/0127fe7bc6f3bc5a8073a286/dsi-ahteg-2018-01-02-en.pdf
f
Case studies and examples of the use of digital sequence information in relation to the objectives of the
Convention and the Nagoya Protocol (CBD/DSI/AHTEG/2018/1/2/ADD1); available at
https://www.cbd.int/doc/c/7a1d/3057/f5fa0ecb0734a54aadd82c01/dsi-ahteg-2018-01-02-add1-en.pdf
23
As considered in Section 3.8 emerging technologies enabled by DSI are becoming ubiquitous in life-
1
science related research and industry, particularly ‘omic’ technologies which are primarily aimed at
2
the detection of genes (genomics), mRNA (transcriptomics), proteins (proteomics) and metabolites
3
(metabolomics) in biological and environmental samples. These technologies have a broad range of
4
applications across scientific disciplines and we have chosen the following sectors to highlight:
5
taxonomy and conservation; agriculture and food security; industrial and synthetic biology;
6
healthcare applications and discovery of pharmaceuticals. A comprehensive analysis of each sector is
7
beyond the mandate for this Study, so for each sector we provide a brief overview of the sector
8
accompanied by coverage of key trends and examples highlighting the use of DSI and technologies
9
which are enabled by DSI in that sector. For convenience, Table 3 facilitates a sectorial comparison
10
of the application of the different techniques as outlined in Figure 1 and discussed in Section 3.8
11
(genomics, transcriptomics, proteomics, metabolomics and epigenomics) in each sector.
12
4.2 Taxonomy and conservation
13
4.2.1 Overview of sector. DNA barcodes and longer DNA sequences together with DNA sequence
14
databases allow rapid identification of unknown species. This process can now take days whereas
15
species identification using morphological methodology can take many months and relies on type
16
specimens held in national collection as well as taxonomic expertise, which is becoming rarer. In
17
addition to assisting in the discovery of new species, DNA barcodes have a broad range of
18
applications including biodiversity conservation, observing seasonal effects and effects of climate
19
change on species distributions, as well as correcting mistaken identification and labelling on foods
20
and plant-based medicines. A global effort is underway to catalogue all life on earth,
25
which aims to
21
improve our understanding of ecosystems, evolution, ecosystem services and biological assets. The
22
project is complemented by the CBD’s Global Taxonomy Initiative (https://www.cbd.int/gti/) a
23
cross-cutting effort coordinated by the CBD to ensure that taxonomic information and expertise is
24
available to CBD parties.
25
4.2.2 Key trends and examples
26
Evaluating biodiversity loss. The greater availability of DNA barcodes for many species could assist
27
biodiversity surveys, enable effective conservation measures to be implemented
39
and new species
28
to be identified.
40
29
Evaluating biodiversity response to climatic events.
41
Species richness and assemblage changes in
30
response to climatic events can be measured using metabarcoding. This requires sequencing of
31
environmental DNA present in the sample, followed by a comparison to available DNA barcodes for
32
a range of species.
33
Species identity and labeling. Fish sold can be mislabelled, either through accidental
34
misidentification or wilful mislabelling of species. Misidentification can occur when two species are
35
superficially similar and can be rectified using DNA barcodes and the construction of a phylogeny to
36
show relatedness.
42
DNA barcodes are used on fish sold to ensure lower value or endangered
37
species are not substituted.
43
38
4.3 Agriculture and food security
39
4.3.1 Overview of sector
40
Agrifood is an $8,000 billion per year industry with early-stage investment in agrifood tech startups
41
reaching $10.1 billion in 2017. Agrifood can be split into two parts. ‘Agritech’ refers to technologies
42
24
that target farmers. ‘Foodtech’, by contrast, targets manufacturers, retailers, restaurants and
1
consumers. Jointly, the two have enough reach to impact every part of the production line, from
2
farm to fork.
44
Agriculture uses a broad range of ‘omic’ techniques, principally to modify crops,
3
create new varieties, and manage agricultural practices. Genetic modification of crops and livestock
4
gives these unique traits such as herbicide resistance. Other methods of optimising productivity can
5
be developed using techniques such as marker-assisted selection.
6
4.3.2 Key trends and examples
7
Selective breeding. Marker-assisted selection can be used to select traits such as pathogen resistance
8
in crops or parasite resistance in livestock. A high-density map of molecular markers for the tomato
9
contains 40 resistance markers which allowed rapid selection of resistant breeds.
45
Complete genetic
10
maps identifying parasite resistance traits in dairy cattle are a first step to breeding cattle resistant
11
to parasites.
46
12
Development and characterization of LMOs. In 2014, half of all LMO crops planted were modified
13
soybeans for herbicide tolerance or insect resistance. A bacterial gene incorporated into the soybean
14
plant confers tolerance to the herbicide glyphosate. Thus, producers can chemically control weed
15
species during the growing season. Near-future LMO varieties are being developed with data from
16
transcriptomics, proteomic, epigenomic and metabolomic experiments.
47-51
In addition, gene editing
17
techniques rely on genomic sequences to create minute changes, conferring traits similar to
18
‘traditional’ LMOs or for use in rapid or de novo domestication.
52-54
19
Soil metagenomics. Understanding the soil microbial communities that carry out key ecosystem
20
services may be achieved by metagenomic analysis which identifies the composition and diversity of
21
these communities. A new frontier, ‘metaphenomics’ looks into the actual functions carried out by
22
viable and active cells under given environmental conditions.
66
23
4.4 Industrial and synthetic biology
24
4.4.1 Overview of sector
25
Industrial biotechnology provides alternative methods to generate industrial products via processes
26
that can be carried out in water, at ambient temperatures, without producing large volumes of
27
waste. The global market is growing at 9% per year and is expected to reach $576.9 billion by 2026
28
with almost 40% attributed to bioenergy, and a large proportion of the remainder to renewable
29
chemicals, such as solvents and biodegradable plastics. Biotechnology is heavily reliant on genetic
30
resources for the discovery of new products and processes.
56
31
Synthetic biology is a novel area of research that is the amalgamation of multiple disciplines such as
32
molecular biology, biotechnology, biophysics and genetic engineering amongst others (see Section
33
3.7). The global synthetic biology market was valued at $5.25 billion in 2015 and can be segmented
34
as indicated below with applications across many industries, including pharmaceutical, diagnostics,
35
energy, bioplastics and environment
57
:
36
o by products: Synthetic DNA/genes; Software tools; Chassis/host organisms; Synthetic clones;
37
Synthetic cells
38
o by technology: Nucleotide synthesis and sequencing; Bioinformatics; Microfluidics; Genetic
39
engineering
40
25
o by application: Pharmaceuticals and diagnostics; Chemicals; Biofuels; Bioplastics; Others
1
(Environment, agriculture & aquaculture)
2
4.4.2 Key trends and examples
3
Laundry Detergents. Low temperature laundry detergent enzymes (proteins) are developed by
4
analyzing and modifying genes from low temperature adapted microorganisms.
58,59
The three-
5
dimensional structure of the enzyme is used to identify ‘hotspots’ where amino acid modifications
6
may have the greatest effect. The gene encoding this enzyme can then be modified, resulting in the
7
desired change.
8
Production of Bioethanol. Related genes from different organisms can be ‘shuffled’ to produce
9
‘chimeric’ enzymes. These can be tested to determine if they have increased productivity, in this
10
case the production of bioethanol.
60
These genes can be reshuffled until enzyme activity is
11
optimized. Shuffled genes that express chimeric enzymes are difficult to trace back to an originating
12
DNA sequence as this is a product of the gene families used and the shuffling process.
13
Production of Therapeutic and High-value Compounds. Bacterial, fungal and plant systems are now
14
modified to produce therapeutic and high-value compounds through the introduction of multistep
15
biosynthetic pathways.
61-64
For example, a precursor to the antimalarial artemisinin can now be
16
produced using a synthetic biology process.
65
Process development relied on detailed knowledge of
17
the DNA sequence directing the production of artemisinin in the plant, related genes in other
18
organisms and whole genome sequences of alternative hosts, which were engineered for this
19
purpose. Codon optimization was used extensively.
20
4.5 Healthcare applications and discovery of pharmaceuticals
21
4.5.1 Overview of sector
22
Worldwide spending on pharmaceuticals reached $1.2 trillion in 2018, projected to grow at 4-6% per
23
year reaching $1.5 trillion by 2023.
66
However, this market is dominated by high-cost antibody-based
24
drugs and biologics
67
for inflammatory diseases. Genetic resources are commonly used in the
25
discovery of small molecule pharmaceuticals, and several of these can be found in the list of most-
26
prescribed pharmaceuticals
68
, some of which are based on natural product chemicals. Estimates
27
indicate that 20-25% of this market is derived from genetic resources
56
with nearly 2 out of 3
28
antibacterial agents deriving from genetic resources.
69
Of major importance in healthcare is also the
29
prevention of disease, such as food-borne illnesses and the early diagnosis so that appropriate
30
treatment can be provided. The global infectious disease diagnostics market is predicted to grow at
31
5.1% from $18.7 billion in 2018 to $26.5 billion in 2025.
70
32
4.5.2 Key trends and examples
33
The design of diagnostic tests for infectious disease agents. Design involves analysis of many
34
sequences to identify highly conserved target regions within the pathogen genome that have no
35
homology to other DNA or RNA sequences in the test sample.
71
These can then be used as markers
36
for presence of the pathogen. For example, diagnosis for the Ebola virus could take as long as 3-10
37
days but detecting the pathogen at 1 day reduces viral infection to almost 0%. A recent study
72
38
employed a CRISPR-associated RNA-guided RNA editing enzyme to detect the RNA genome of the
39
Ebola virus in blood samples in under 5 minutes.
40
26
Detection of pathogens in contaminated food for disease prevention. Rapid detection of food-borne
1
pathogens ensures food safety. The National Center for biotechnology Information (NCBI) hosts a
2
‘Pathogen Detection’ website that shares data on gene sequences for these pathogens.
73
It quickly
3
clusters and identifies related sequences to uncover potential food contamination sources.
4
Discovery of new drugs. Bacteria produce a range of important pharmaceuticals.
69
Comparative
5
genomic analysis of microbes can uncover new pharmaceutical compounds. For example, whole
6
genome analysis of the bacterium Staphylococcus lugdunensis indicated the bacterium contained a
7
biosynthetic pathway for the previously unknown metabolite, lugdunin, which is effective against
8
antibiotic resistant infections in a mouse model.
9
4.6 Extent of reliance on DSI and technologies/techniques enabled by DSI
10
It is evident from the coverage of each of sector and the sectorial comparison in Table 3 that all
11
sectors considered in this study use different types of information that potentially constitutes DSI
12
and technologies/techniques enabled by DSI. In particular, genomic data appears to be highly
13
utilized in all sectors. Similar trends are expected for other sectors in the life-sciences. Thus, these
14
sectors can be considered while discussing the scope and concept of DSI and assessing the
15
implications of including or excluding particular types of information associated with the underlying
16
genetic resource.
17
27
Table 3. Use of DSI-related technologies in different sectors. Relevance of each ‘omic’ technology is shown in the column headed ‘use’ and indicated as High
1
(H), Medium (M) or Low (L)
2
Taxonomy & Conservation
Agriculture & Food Security
Industrial & Synthetic Biology
Healthcare & Pharmaceuticals
Use
Comment
Use
Comment
Use
Comment
Use
Comment
Genomics
H
DNA barcode database incomplete for many
branches of life and relies on established
taxonomy and systematics with variable
coverage for different divisions of life.
Taxonomic organisation subject to change.
H
Reference genomes and
identifying natural variation
and trait loci. Metagenomic
analysis of soil micro-organisms
to understand crop health.
H
Accurate/reliable annotations to
assign functions to genes. Unknown
genes require additional lab work.
Proteins with same function may
have different DNA sequences.
H
Analysis of disease targets and
pathogen DNA and RNA
sequences to develop
treatments and diagnostics.
Epigenomics
L
M
Understanding heredity in
livestock.
M
Understanding of phenotypic changes
in LMO produced using synthetic
biology.
M
Microarray data are now the
main source for identifying new
therapeutic targets with the
current shift to personalised
medicine.
Transcript-
omics
M
Identification of metabolically active species
in environmental samples.
M
Understanding function of
different micro-organisms soil
microbiome in maintaining
crop health.
M
Determination of genes which are
being transcribed allows up or down
regulation of pathways to increase
production.
M
RNA silencing and gene therapy
rely on these data.
Proteomics
L
M
Determine if LMOs are
expressing desired proteins.
H
Determination of which proteins are
being expressed. Used to identify
genes encoding metabolites.
M
Understanding of proteins
involved in production of
potential natural product
pharmaceuticals.
Metabol-
omics
M
Profiling of plant metabolites to identify
correct phenotype (chemotaxonomy).
M
Determine if LMOs are
producing desired metabolites.
M
Identification and quantification of
small molecules being produced,
used to redirect metabolic flux to
increase production of these small
molecules.
M
Analysis of metabolites
produced by organisms studied
for potential natural product
pharmaceuticals.
Other
H
Codon optimization to achieve
expression of modified gene
constructs in alternative hosts. Gene
editing tools. Molecular structures of
proteins.
M
Develop of pharmaceuticals and
disease diagnostics using gene
editing tools.
3
28
5 DSI: Scope and terminology
1
5.1 Introduction
2
During the 2017-2018 inter-sessional period, parties to the CBD and Nagoya Protocol undertook a
3
number of steps to attempt to clarify the concept of DSI.
g
This process did not yield consensus on
4
the appropriateness of the term ‘DSI’ nor what it refers to, whether it is limited to DNA and RNA
5
sequences or whether it also covers the amino acid sequences of proteins and the metabolites
6
produced by biosynthetic enzymes, among other types of information.
74
These challenges are not
7
unique to CBD and its Nagoya Protocol as evidenced by comparable discussions underway in various
8
other UN processes such as the International Treaty on Plant Genetic Resources in Food and
9
Agriculture (ITPGRFA), the Pandemic Influenza Preparedness Framework (PIP) and the Conservation
10
and Sustainable use of Biodiversity of Areas Beyond National Jurisdiction Process (BBNJ). Various
11
definitions for ‘DSI’ and equivalent terminology have been published or proposed by organisations,
12
trade bodies and learned societies involved in the discussions across these domains.
13
In 2018 the Ad Hoc Technical Expert Group (AHTEG) on Digital Sequence Information on Genetic
14
Resources established under CBD and its Nagoya Protocol proposed a broad list of subject matter
15
that may potentially comprise DSI.
h
This list is useful as it is the most comprehensive breakdown
16
that has emerged from CBD’s efforts to date relevant to the utilisation of genetic resources.
17
Accordingly, we reproduce the AHTEG list here for convenience and use it as the starting point for
18
our observations in this study:
19
(a) “The nucleic acid sequence reads and the associated data
20
(b) Information on the sequence assembly, its annotation and genetic mapping. This information
21
may describe whole genomes, individual genes or fragments thereof, barcodes, organelle
22
genomes or single nucleotide polymorphisms.
23
(c) Information on gene expression
24
(d) Data on macromolecules and cellular metabolites
25
(e) Information on ecological relationships, and abiotic factors of the environment
26
(f) Function, such as behavioural data
27
(g) Structure, including morphological data and phenotype
28
(h) Information related to taxonomy
29
(i) Modalities of use
30
This list indicates the types of information that may be relevant to the utilization of genetic
31
resources, however, some elements were not clearly defined such as ‘associated data’ under
32
g
Parties and relevant stakeholders were invited to submit their views on potential implications of the use of
digital sequence information on genetic resources for the three objectives of the Convention and a fact-finding
and scoping study addressing similar issues was commissioned (Laird and Wynberg study). A synthesis of the
submissions received, including case studies and examples of the use of DSI, and the Laird and Wynberg study
were considered by an Ad Hoc Technical Expert Group on Digital Sequence Information on Genetic Resources
whose report and recommendations was subsequently submitted to COP14 and its Subsidiary Body on
Scientific, Technical and Technological Advice.
h
Report of the AHTEG on Digital Sequence Information on Genetic Resources is available at
https://www.cbd.int/doc/c/f99e/e90a/71f19b77945c76423f1da805/dsi-ahteg-2018-01-04-en.pdf
29
category a. Some of the categories, in particular (e)-(i) were not considered in detail by the AHTEG or
1
in the views on DSI submitted to the SCBD in the 2017-2018 inter-sessional period. The broad scope
2
of the AHTEG list reflects differences of opinion which exist regarding DSI subject matter and this is
3
reflected/inherent in the different terminology proposed to describe the concept of DSI. Building on
4
these previous efforts and reflecting on the terminology being considered in the various UN
5
processes described above, this study attempts to further clarify the concept of DSI by introducing
6
new logical groupings (‘broad’, intermediateand ‘narrow’) which may be better suited than the
7
AHTEG list to facilitate discussions regarding scope and terminology associated with DSI, and by
8
posing certain priority questions/issues which need to be resolved if a suitable terminology and
9
scope are to be found.
10
We commence by drawing a conceptual distinction between data and information and evaluate
11
their flow from the utilization of a genetic resource (Section 5.2). We use this as a basis to propose
12
the new logical groupings for DSI subject matter which are mapped against the AHTEG list and also
13
against alternative terminology to replace DSI in order to help clarify the subject matter and
14
boundaries of these groupings (Section 5.3). We identify priority questions/issues that need to be
15
addressed in order to clarify the concept of DSI by considering the meaning of the terms ‘digital’,
16
‘sequence’ and ‘information’, in turn, (Section 5.4) and by considering the effect of modifications to
17
DNA, RNA and protein subunits (Section 5.5).
18
5.2 Understanding the flow of data and information
19
It appears a common challenge faced at the CBD and other UN processes in clarifying the subject
20
matter and terminology associated with DSI or its equivalent terms, is in deciding what counts as
21
data and the circumstances in which data is embedded with value and transformed into information
22
(knowledge). Data is essentially a means of communicating and facilitating exchanges about the
23
material world. Data describes inherent characteristics of material artefacts as distinguished from
24
research outputs or other value-adding steps that generate knowledge such that its dissemination
25
constitutes the sharing of information (knowledge, claims, models, theories, communities, and so
26
on) as distinct from the underlying data itself.
27
In the context of a genetic resource, the question arises as to whether ‘DSI’ should be confined to
28
representational data (such as a DNA sequence ‘GTACCTGA …’, methylation patterns, and so on) and
29
if not, to what extent it should include processing activities performed with that data to generate
30
information in whatever format, medium, shape, and so on, by data producers, curators, users, and
31
so on. Conceived this way, a key challenge faced across the various UN processes is to determine
32
whether ‘DSI’, howsoever called, is limited to DNA and RNA sequences or whether it also captures
33
the amino acid sequences of proteins and/or information generated by cognitive processes applied
34
to such data.
35
To address this challenge, it is useful to consider the flow of data/information from a genetic
36
resource onwards to DNA, RNA, protein sequences and metabolites as depicted in Figure 6, which
37
also integrates terminology and subject matter components that may assist in clarifying the concept
38
of DSI. It is evident that at each step the data/information it yields becomes progressively further
39
removed from the original genetic resource.
40
30
1
Figure 6. The flow of data/information from genetic resource through DNA, RNA and proteins to
2
metabolites showing the limits/boundaries of some alternative terms used to refer to DSI. Subsidiary
3
information on the genetic resource includes sample metadata, taxonomy, biotic/abiotic
4
environmental factors and behavioural data amongst others.
5
5.3 New logical groupings & alternative terminology
6
To help clarify the concept of DSI we use the flow of information from a genetic resource,
7
particularly the degree of biological processing and proximity to the underlying genetic resource, to
8
provide a logical basis to group information that may comprise DSI. This gives rise to four proposed
9
groups, one broad/inclusive group, two intermediate groups and a narrow/defined group, as
10
depicted in Figure 7 and further described below. They are summarized as follows:
11
Group 1 - Narrow: DNA and RNA
12
Group 2 - Intermediate: (DNA and RNA) + proteins
13
Group 3 - Intermediate: (DNA, RNA and proteins) + metabolites
14
Group 4 - Broad: (DNA, RNA, protein, metabolites) + traditional knowledge, ecological
15
interactions, etc.
16
Group 1 has a narrow scope or proximity to the genetic resource and is limited to nucleotide
17
sequence information associated with transcription. Group 2 has an intermediate scope or proximity
18
to the genetic resource and extends to protein sequences, thus comprising information associated
19
with transcription and translation. Two interpretations for the scope of this group are possible, as
20
discussed below. Group 3 has a wider intermediate scope or proximity to the genetic resource and
21
extends to metabolites and biochemical pathways, thus comprising information associated with
22
transcription, translation and biosynthesis. Group 4 has the broadest scope or weakest proximity to
23
the underlying genetic resource and extends to behavioural data, information on ecological
24
relationships and traditional knowledge, thus comprising information associated with transcription,
25
translation and biosynthesis, as well as downstream subsidiary information concerning interactions
26
with other genetic resources and the environment as well as its utilization, among other subsidiary
27
information.
28
The proximity of information to the underlying genetic resource determines if it is possible to
29
accurately identify or infer the source from which it is derived. For example, in the case of DNA it is
30
possible to identify the genetic source. In the case of RNA and protein sequences it is possible to
31
infer the genetic sequence of the source, however, whereas this can be inferred with a high degree
32
of precision/confidence for RNA, the redundancy of the genetic code makes this less precise for
33
proteins (because multiple codons are available to encode an amino acid and so more than one DNA
34
31
option will be inferred from a protein sequence, see sections 3.2 and 3.3.1). Precision becomes even
1
more challenging with biosynthetic information and inferring the underlying genetic code is not
2
possible from some subsidiary information. Accordingly, proximity has significant implications for
3
traceability to a particular genetic resource and also in identifying the source of information,
4
including whether it has been generated through the utilization of a genetic resource or
5
independently.
6
Using these proposed groups, we can evaluate the broad list of subject matter potentially
7
comprising DSI as proposed in 2018 by the Ad Hoc Technical Expert Group (AHTEG) on Digital
8
Sequence Information on Genetic Resources, as identified above. We can also use these groups to
9
evaluate a range of terms proposed to replace DSI, including
i
: In silico; Dematerialised Genetic
10
Resources (DGR); Genetic Information (GI); Digital Sequence Data (DSD); Genetic Resource Sequence
11
Data (GRSD); Genetic Sequences (GS); Genetic Sequence Data/Information’ (GSD/GSI); Nucleotide
12
Sequence Data (NSD); and Subsidiary Information (SI).
j
These evaluations are shown in Table 4 which
13
is a key reference for the reader to understand the different groups proposed to evaluate the
14
concept of DSI in this study. Please note that in this table additional categories are listed where the
15
original AHTEG report is unclear. In these cases, such as ‘associated data’ in a., which is not defined,
16
we have added a more detailed explanation in the row underneath. Other categories are subdivided
17
to group similar information together.
18
It is evident that terminology is readily available to describe DSI with narrow subject matter limited
19
to nucleotide sequences (as proposed in Group 1). These terms include Genetic Resource Sequence
20
Data (GRSD); Genetic Sequences (GS); Genetic Sequence Data/Information’ (GSD/GSI); and
21
Nucleotide Sequence Data (NSD)). It is also evident that terminology is available to describe subject
22
matter with broad scope extending beyond transcription, translation and biosynthesis (i.e. as
23
proposed in Group 4). These terms include In silico; Dematerialised Genetic Resources (DGR);
24
Genetic Information (GI).
25
The terms Digital Sequence Data (DSD), Genetic Resource Sequence Data (GRSD) or Genetic
26
Resource Sequence Data and Information (GRSDI), although previously used in certain contexts to
27
describe Group 1 (narrow), could be used to describe subject matter of intermediate scope
28
comprising information associated with transcription and translation (as proposed for Group 2)
29
depending on the interpretation adopted. None of the terms proposed to date appear to adequately
30
capture an intermediate range comprising information associated with transcription, translation and
31
biosynthesis of a genetic resource (i.e. as proposed for Group 3). Overall, the four logical groups
32
proposed in this study provide a nuanced alternative to the 2018 AHTEG list and so may better assist
33
in clarifying the concept and scope of DSI, however, appropriate terminology will need to be
34
evaluated, particularly for the intermediate groups.
35
i
These terms arise from CBD forums, publications, professional bodies and learned societies in the context of
the parallel discussions underway in the various UN processes also attempting to clarify the concept of DSI,
howsoever called.
j
Additionally, during our interviews, further terms were introduced which can be analysed by reference to
Table 4 and the discussions above: “Biological sequence information”, “Functional sequencing information”,
“Digital genetic resources and sequence information”, “Digital biological code”, “Digital sequence information
on genetic material” and “Digital biological information”. These will not be discussed but could be analysed in
the same manner as all the terminology discussed above.
32
1
Figure 7. Proposed subject matter groupings to facilitate discussions concerning DSI scope and
2
terminology. Group 1 only includes data on DNA and RNA sequences, whereas Group 2 also
3
incorporates protein sequences. Group 3 extends to metabolites and Group 4 is the broadest
4
category which extends further downstream to all subsidiary information.
5
6
33
Table 4. Scope of the different current terminologies showing the subject matter groupings as in Figure 7. Some of the AHTEG categories have been
1
subdivided or supplemented with additional subcategories for clarity.
2
Where: DGR = dematerialised genetic resources; GI = genetic information; DSD = digital sequence data; GRSD = genetic resource sequence data; GS = genetic sequence; GSD = genetic sequence data; GSI = genetic
3
sequence information; NSD = nucleotide sequence data; and SI = subsidiary information
4
Narrow/Defined (Group 1)
Intermediate
(Groups 2 & 3)
Broad/Inclusive
(Group 4)
AHTEG
Category
Component
DSD
GRSD
GS
GSD
GSI
NSD
SI
2a
2b
3
In silico
DGR
GI
a1
Nucleic acid sequence reads
+
+
+
+
+
+
+
+
+
+
+
a2
Associated data to nucleic acid reads (technical aspects of
sequencing experiments: the sequencing libraries, preparation
techniques and data files).
+
+
+
+
+
+
+
+
+
+
b1
Information on the sequence assembly, including structural
annotation and genetic mapping. (This information may describe
whole genomes, individual genes or fragments thereof, barcodes,
organelle genomes or single nucleotide polymorphisms).
+
+
+
+
+
+
+
+
+
b2
Non-coding nucleic acid sequences
+
?
?
+
+
+
+
+
+
+
b3
Functional annotation of genes
?
+
+
+
?
+
c1
Information on gene expression
+
+
+
+
?
+
c2
Epigenetic heritable elements (e.g. methylation patterns).
+
+
+
+
?
+
d1
Amino-acid sequence of proteins produced by gene expression.
+
+
+
+
+
?
+
d2
Molecular structures of proteins.
+
+
+
+
?
+
d3
Data on other macromolecules and cellular metabolites. (Molecular
structures).
+
+
+
?
+
e
Information on ecological relationships, and abiotic factors of the
environment.
+
+
?
+
f
Function, such as behavioural data (this would include
environmental influences).
+
+
?
+
g
Structure, including morphological data and phenotype (this would
include environmental influences).
+
+
?
+
h
Information related to taxonomy.
+
+
?
+
i
Modalities of use.
+
+
?
+
Additional undefined elements.
+
+
?
+
34
5.3.1 Broad scope of subject matter: information associated with biological processing and
1
subsidiary information
2
Scope
3
As proposed above, Group 4 has the broadest scope or weakest proximity to the underlying genetic
4
resource and extends to behavioural data, information on ecological relationships and traditional
5
knowledge, thus comprising information associated with transcription, translation and biosynthesis,
6
as well as downstream subsidiary information concerning interactions with other genetic resources
7
and the environment as well as its utilization, among other subsidiary information.
8
Evaluation of Existing Terms
9
In silico The BBNJ process is using the term ‘in silico’ storage and utilisation of data or information as
10
a placeholder until a better term can be found. The term is also in use in other CBD parties. In
11
biology and chemistry this term is used to mean performed on computer or via computer
12
simulation’, with the reference to silicon, the material from which computer chips are
13
manufactured. It may refer to any data or information held or processed on a computer, all of which
14
fall within AHTEG categories a.-i.
15
Dematerialised Genetic Resources (DGR)
75
This terminology refers to the informational aspects of
16
genetic resources. It includes the acquisition, digitalisation, storage and dissemination of DNA
17
sequences from genetic resources. The separation between the provider of the genetic resource and
18
the eventual user as well as the digital nature of the data prompts the use of the word
19
‘dematerialised’. This information can then be ‘re-materialised’ through gene synthesis and
20
incorporation in living modified organisms (genetic modified organisms). This may only cover the
21
DNA and RNA sequences of these genetic resources, but the word ‘dematerialised’ may include all
22
types of information relating to the genetic resources from categories a.-i. in the AHTEG study.
23
Genetic Information (GI)
76
Collective term used to refer to information derived from genetic
24
resources, plant materials and viruses. It was a catch-all term used in discussions around the
25
information under the CBD, the Nagoya Protocol, the ITPGRFA, and the PIP Framework, and
26
encompasses AHTEG categories a.-i.
27
5.3.2 Intermediate scope: information associated with biological processing involving
28
transcription, translation and biosynthesis
29
Scope
30
As proposed above Group 3 has an intermediate scope or proximity to the genetic resource and
31
extends to protein sequences and metabolites thus comprising information associated with
32
transcription, translation and biosynthesis. Figure 8 shows how the proposed intermediate
33
groupings relate to the scope of the existing terminology (top panel) and also how the scope of
34
proposed group 3 could be interpreted (bottom panel).
35
Evaluation of Existing Terms
36
None of the terms proposed to date appear to adequately capture an intermediate range comprising
37
information associated with transcription, translation and biosynthesis of a genetic resource, as
38
proposed by Group 3.
39
35
1
2
Figure 8. Top Panel. A graphical representation of the main terminologies proposed to replace ‘DSI’
3
showing the extent of processing carried out on data to convert it to information plotted against
4
flow from genetic resource onwards to DNA, RNA, proteins and metabolites. The potential coverage
5
of the two proposed intermediate groupings is indicated showing that it includes DNA, RNA, protein
6
sequences, metabolites and a defined range of associated data and information selected from
7
AHTEG categories a.-d. (for instance functional annotations of genes, gene expression information,
8
epigenetic data, and molecular structures of proteins). Bottom Panel. The different ways that the
9
intermediate subject matter grouping could be interpreted. Group 2a includes DNA/RNA sequence
10
data including non-coding sequences, and information on the sequence assembly, including
11
structural annotation and genetic mapping, as well as protein sequence data. Group 2b is the same
12
as group 2a in addition to which it includes functional annotation of genes, gene expression
13
information, epigenetic data, and molecular structures of proteins. Group3 is the same as group 2b,
14
but adds data on other macromolecules and metabolites, including their molecular structures.
15
36
5.3.3 Intermediate scope: data/information associated with biological processes involving
1
transcription and translation
2
Scope
3
As proposed above, Group 2 has an intermediate scope or proximity to the genetic resource and
4
extends to protein sequences, thus comprising information associated with transcription and
5
translation. Two interpretations for the scope of this group are possible, either subject matter is
6
strictly limited to nucleotide and protein sequence data (Group 2a), or it includes information
7
associated with transcription and translation more broadly, for instance, functional annotations of
8
genes, gene expression information, epigenetic data, and molecular structures of proteins (Group
9
2b).
10
Evaluation of Existing Terms
11
Only one term, Digital Sequence Data (DSD), as described below appears readily available to
12
describe subject matter of intermediate scope comprising information associated with transcription
13
and translation (as proposed for Group 2). However, the term is understood to be limited to raw
14
protein sequence data and so would only be suitable for the narrow interpretation considered for
15
scope in this group (group 2a).
16
Another term, Genetic Resource Sequence Data (GRSD), which was intended by its proponent, the
17
International Chamber of Commerce, to be limited to nucleotide sequences (see section 5.3.4),
18
could be re-interpreted more broadly to describe DSI subject matter which includes protein
19
sequences. This is because the ‘Genetic Resource’ pre-fix gives the impression that the term covers
20
sequence data related to a genetic resource more broadly. Although this term would also be suitable
21
only for the narrow interpretation considered for scope in this group (group 2a), the broader
22
interpretation (group 2b) could be accommodated through use of the term, Genetic Resource
23
Sequence Data and Information (GRSDI), which of course comprises both data and information
24
related to proteins.
k
25
5.3.4 Narrow scope: limited to nucleic acid sequence data associated with translation
26
Scope
27
As proposed above Group 1 has a narrow scope or proximity to the genetic resource and is limited to
28
nucleotide sequence information associated with transcription.
29
Terms
30
Digital Sequence Data (DSD) This includes DNA, RNA and protein sequences. However, since the
31
word ‘data’ is used, this only refers to raw sequence data derived directly from genome and protein
32
sequencing. Data that has been processed, such as automatic DNA annotation by comparison to
33
other DNA sequences in the database or converting raw DNA data into protein sequences in an
34
automated way, will be out of scope of ‘DSI’ as it could now be considered ‘information’ and no
35
longer data (AHTEG category a. only).
36
k
We understand this term may have in fact been considered in the BBNJ process that has ultimately adopted
‘in silico’ as a placeholder term.
37
Genetic Resource Sequence Data (GRSD)
77
The International Chamber of Commerce defines this as:
1
“the description of the order of nucleotides (DNA or RNA), as found in nature, in the genome or
2
encoded by the genome of a given genetic resource. The ‘genome’ includes nuclear and extra-
3
nuclear DNA, and coding (gene) and non-coding DNA sequences. It does not include other molecules
4
resulting from natural metabolic processes associated with or requiring the genetic resource. GRSD
5
cannot, and does not, include information connected with or resulting from the analysis or further
6
application of GRSD, e.g. sequence assembly, sequence annotation, genetic maps, metabolic maps,
7
three-dimensional structure information or physiological properties related to it. Including
8
information resulting from human interaction on GRSD would result in yielding man-made genetic
9
sequences, which would no longer be considered GRSD.This definition therefore very clearly covers
10
only AHTEG category a. as it explicitly excludes metabolites and by omission excludes protein
11
sequences and metadata associated with the genetic resource. However, data on protein sequences
12
can usually be predicted by automated analyses of the DNA sequences although there are
13
exceptions (see section 3.2). Arguments around the use of the word ‘data’ were given above for
14
DSD.
15
This definition is very clear in that it expressly includes all possible forms of DNA discovered to date,
16
in particular non-coding DNA.
78
This might be excluded by CBD Art 2 which defines ‘genetic material’
17
as meaning “any material of plant, animal, microbial or other origin containing functional units of
18
heredity” as no function has yet been ascribed to some types of non-coding DNA.
19
The use of the word ‘genetic’ may be important here as it ascribes a function to the data but does
20
not specify the molecular mechanism by which heredity should occur. This therefore potentially
21
allows for the inclusion of modified DNA and RNA (see Sections 3.3 & 5.5), as long as these can
22
transfer genetic information in a hereditary manner.
23
Genetic Sequences (GS) From PIP framework “the order of nucleotides found in a molecule of DNA
24
or RNA. They contain the genetic information that determines the biological characteristics of an
25
organism or a virus” (Art 4.2)
79
This could refer to the actual DNA or RNA from the genetic resource
26
or the sequence data/information. This definition makes clear the extent of what is included, only
27
DNA/RNA (AHTEG categories a. and b.), and excludes proteins, metabolites and metadata associated
28
with the genetic resource.
29
Genetic Sequence Data/Information (GSD/GSI). Like GRSD, this refers only to genetic
30
data/information, but additional clarity will be needed to indicate that this is derived from a genetic
31
resource. This includes only DNA and RNA sequences and not protein sequences or information on
32
metabolites, thus encompassing only AHTEG categories a. and b.
33
Nucleotide Sequence Data (NSD) and Subsidiary Information (SI). NSD is more specific than
34
GSD/GRSD and includes only DNA and RNA sequence data, and expressly refers to the chemical
35
structure of the component nucleotides. It refers only to the AHTEG categories a.-b., and an
36
accessory term, ‘subsidiary information’ (SI) is introduced to cover metadata associated with the
37
genetic resource, data on proteins and metabolites, thus encompassing AHTEG categories c.-i. These
38
are the terms used by the International Nucleotide Sequence Database Collaboration (INSDC, see DSI
39
study 2/3 for additional discussion). The institutes that run the INSDC also run additional databases
40
that contain information on protein sequences derived from gene predictions and translations of
41
DNA sequences.
42
38
A description of the relationship between NSD and SI is given in a recent submission
80
: “NSD include
1
non-coding & coding sequences, regulatory sequences, conserved sequences, genes that encode
2
specific traits, DNA without known function and ‘junk DNA’. Larger data elements would include the
3
entire genome of an organism [or, indeed, of a clade (pangenome) or environmental sample
4
(metagenome)]. NSD are aggregated from naturally occurring genetic resources generated as a part
5
of research or downloaded from INSDC and other databases. Analyses of NSD are interpreted in
6
research to develop understanding of biological diversity at genetic, species and ecosystem levels.”
7
By the specificity of the terminology NSD, it excludes DNA and RNA in which the nucleotides have
8
been modified so that they can no longer be regarded as nucleotides, despite their potential to carry
9
the genetic code and be duplicated (see Sections 3.3 & 5.5). A second comment is the lack of
10
function ascribed to the nucleotides, which is apparent in the use of the term ‘genetic’ in the
11
previous three definitions. By not referring to function, non-coding DNA is also brought within scope
12
of ‘DSI’ as is clear from the definition of NSD above.
13
5.4 Digital Sequence Information
14
The 2018 Laird and Wynberg
1
study summarises objections to this terminology and explains why it is
15
not appropriate to describe the elements of ‘genetic information’ that might be included under the
16
CBD. We build on this by considering each of the constituent elements of ‘DSI’, in turn, and in the
17
process identify important issues which need to be considered in order to clarify the concept of DSI.
18
We use the Oxford English Dictionary (OED) definitions for ‘Digital’, ‘Sequence’ and ‘Information’ and
19
provide an analysis concerning the suitability/desirability of each term in any in terminology
20
proposed to replace DSI. Where relevant, we also assess implications regarding the type of
21
information that may be associated with the concept of DSI, arising from the use of the constitutive
22
term.
23
5.4.1 Digital (OED): “Of signals, information, or data: represented by a series of discrete values
24
(commonly the numbers 0 and 1), typically for electronic storage or processing.”
25
The word ‘digital’ only refers to the way in which data is held, implying it is in computer memory or
26
data storage, and to counter this it is stated that this data can also be held in other forms such as on
27
paper. However, DNA sequences printed on paper are machine readable, but are not ‘digital’ in this
28
sense and would therefore be out of scope of ‘DSI’, despite conveying the same information.
29
5.4.2 Sequence (OED): The fact of following after or succeeding; the following of one thing after
30
another in succession; an instance of this.” A subsidiary definition is given for biochemistry: “The
31
order of the constituent nucleotides in a nucleic acid molecule or of the amino-acids in a polypeptide
32
or protein molecule.”
33
Anything stored on a computer is in the form of a sequence such as ‘001100100 …’ and would be
34
captured by using the term ‘digital’, ‘sequence’ and ‘digital sequence’. The term ‘sequence’ is
35
applied to DNA, RNA and proteins, whose subunit nucleotides, for DNA/RNA, and amino acids, for
36
proteins, can be described by sequences of letters or groups of letters (e.g. the amino acid chain
37
MARWAELCEL can also be given as Met-Ala-Arg-Trp-Ala-Glu-Leu-Cys-Glu-Leu). Whilst this
38
information is useful, it gives no indication of the gene function or expression level of these
39
sequences. For DNA, the sequence alone does not indicate gene expression, its effect on phenotype,
40
39
and many other important characteristics (‘Broad’ subject matter grouping, Group 4, AHTEG
1
categories c, e, f, g, h).
2
Length or function of Sequence. The length of the sequence and its function may govern whether a
3
DNA sequence is unique to a particular genetic resource or origin. Table 4 in Study 2/3 shows that
4
sequences below 30 nucleotides may not be unique, meaning a search of a sequence of less than 30
5
nucleotides may yield multiple results from different organisms found in different countries. Parties
6
need to consider a minimum sequence length taking into consideration the data presented in Table
7
4 in Study 2/3. In addition, whether non-coding elements such as promotors, which are functional
8
but do not encode proteins, should be regarded as being within the scope of ‘DSI’ needs to be
9
considered, as should elements such as BioBricks which serve a variety of functions and may be
10
natural, modified or synthetic DNA sequences (Section 3.7).
11
Environmental and metagenomic DNA. Acquisition of environmental and metagenomic DNA is now
12
common in many research areas (Section 3.5). In the context of the CBD, the genetic resource
13
underpinning such is the combined DNA in the sample and not the unique organisms from which
14
they arise. Whereas it is possible therefore to connect the DNA sequences to the genetic resource, it
15
will be very difficult to connect them to the originating organism, thus raising problems of
16
traceability for such materials. Most of the environmental and metagenomic DNA sequences will be
17
partial or incomplete, but they are still vital to the understanding of community structures and many
18
other applications.
19
Microarray data. It is not clear whether microarray data (Section 3.8.1) would be regarded as
20
‘sequence’ data. The microarray readout is a quantity of light (fluorescence) and is conceivably not
21
directly sequence data. If gene expression (AHTEG category c.) is included in any term used to
22
replace ‘DSI’ then this brings microarray data within scope of ‘DSI’. Microarray data would therefore
23
be included in Intermediate Groups 2b and 3.
24
Three-dimensional structural information. The word ‘sequence’ approached in this way would also
25
exclude three-dimensional structural information on DNA, RNA and proteins, which is essential to
26
understand their biological function and interaction with DNA, RNA, proteins and metabolites. This
27
distinction was used in the case between D’Arcy and Myriad Genetics Inc., heard at the High Court of
28
Australia, where the chemical composition of a DNA sequence was regarded as different from the
29
information that this genetic sequence contained. Structural information on proteins (atom
30
coordinates) is contained within standardised text files known as ‘pdb’ files (Section 3.8.2). Three-
31
dimensional structural information is included in Intermediate groups 2b and 3.
32
Macromolecules. The use of the word ‘macromolecule in AHTEG category d. could also cause
33
confusion as this includes all DNA, RNA, proteins, polysaccharides amongst others. Polysaccharides
34
are chains of sugar molecules that are frequently encountered in biology and can be regarded as
35
macromolecules that can be represented as sequences. Examples include starch, cellulose and
36
glycogen which act as different types of energy stores. They can form very long linear or branched
37
chains of the same sugar molecule, such as starch, which can contain more than a thousand
38
molecules of glucose joined in a uniform linear way. A more complex example is the recognition by
39
the immune system of complex sequences of sugars in potential pathogens. All of these complex
40
40
sequences of sugars are the outcome of an organism’s metabolism, the interaction of many proteins
1
working together to generate polysaccharides and these could therefore be defined as ‘derivatives’.
l
2
Using the word ‘sequence’, without clearly defining sequences of which type of subunit, might
3
therefore bring polysaccharides within scope of ‘DSI’. Data on macromolecules including their
4
molecular structures is included in Intermediate group 3.
5
Alternative representations of metabolites. The use of the word ‘sequence’ would appear to exclude
6
small molecule metabolites which fall under ‘derivatives’ under the Nagoya Protocol (AHTEG
7
category d). However, molecular structures can be represented and stored as ‘sequences’ as SMILES
8
(Simplified Molecular Input Line Entry Specification, Figure 9) which includes information on
9
molecular connectivity without specifying two or three-dimensional coordinates of the atoms in the
10
molecule. In mathematical terminology it is a ‘molecular graph’ expressed as a unique ‘sequence’.
11
For each SMILES there is only one possible molecular graph (molecular structure) and vice versa. If it
12
is accepted that metabolites can be described as ‘sequences’ in this way, this could bring small
13
molecule metabolites within scope if the word ‘sequence is used in the eventual definition. All
14
molecules can be described in this way, including atom-level descriptions of DNA, RNA, proteins,
15
polysaccharides and metabolites, meaning that any definition including the word ‘sequence’ would
16
include these. In the current proposal for intermediate groups, data on cellular metabolites,
17
including molecular structures is included in Group 3.
18
A key issue in clarifying the concept of DSI is to consider which types of ‘sequence’ should be
19
included in any replacement terminology for ‘DSI’. If the definition of ‘sequence’ only includes DNA,
20
RNA and proteins and not sequential representations of small molecules (e.g. as SMILES strings) then
21
this brings with it the possibility of describing DNA, RNA and proteins as SMILES strings which under
22
this interpretation would not be regarded as sequences. Using this approach, the same information
23
is conveyed without using the normal sequence representation of DNA, RNA (using the 4-letter
24
nucleotide code) or proteins (using the 20-letter amino acid code).
25
l
Nagoya Protocol Article 2c: “ ‘Derivative’ means a naturally occurring biochemical compound resulting from
the genetic expression or metabolism of biological or genetic resources, even if it does not contain functional
units of heredity”
41
1
Figure 9. The relationship between nucleotide sequence, chemical structure and SMILES string of the
2
same DNA sequence.
3
5.4.3 Information (OED): “That which is obtained by the processing of data.”
4
Data (OED): “Related items of (chiefly numerical) information considered collectively, typically
5
obtained by scientific work and used for reference, analysis, or calculation.”
6
Broader definitions of ‘information’ include the imparting of information in general, as alluded to in
7
Section 5.2. Genes carry information which is interpreted through the action of transcription and
8
translation and is under the influence of environmental factors to give rise to the phenotype of an
9
organism (See Figure 3). Mutations and selection pressures lead to the evolution of these genes over
10
time, thus altering this information in a continuous process.
11
The use of the word information in keeping with the OED definition above implies that the raw data
12
(e.g. the raw sequences of nucleotides in DNA/RNA or amino acids in proteins) has been processed
13
in some way or has had some value added by processing. The question is whether automated tools
14
such as automated annotation or converting DNA sequences by translating codons into protein
15
sequences are enough to convert data into information, or whether human intervention or curation
16
is essential (see Section 3.4).
17
The error rate for the different DNA sequencing methodologies discussed in section 3.4 and Table 2
18
must be considered here as different levels of data processing are necessary to convert raw ‘reads’
19
into an accurate DNA sequence, which may be construed as converting data into information. The
20
distinction between the terms ‘data’ and ‘information’ has been heavily discussed in submissions to
21
the CBD DSI process and there appears to be a consensus that the difference between data and
22
information is the level of processing that has been executed. In the context of ‘DSI’ it will be helpful
23
to develop clarity on what automatic and semi-automatic processing might be included and what
24
might be excluded from the scope of subject matter comprising DSI. For example, sequence
25
alignment could be included as it is a necessary and semi-automated element of developing
26
sequence data for further analysis. This is included in most of the proposed terminology introduced
27
in Section 5.3 except for ICC’s original definition of genetic resource sequence data which includes
28
only raw sequence data but excludes aligned sequences.
29
42
5.5 Modifications DNA, RNA and protein sequences and their subunits
1
The modification of DNA, RNA and protein sequences and their subunits (nucleotides, amino acids)
2
were discussed in section 3.3. These must be considered in clarifying the concept of DSI as such
3
modifications influence the possible scope of subject matter constituting DSI, as per the terminology
4
proposed to replace DSI discussed above. The key issue is whether only unmodified DNA, RNA and
5
proteins are within scope of any terminology proposed to replace ‘DSI’, and if not, what is the extent
6
of modification that is permitted if modifications are to fall within the new ‘DSI’ terminology.
7
DNA Modifications. Designer unnatural DNA sequences can generate wholly new proteins, and as
8
these do not trace back to any genetic resource these do not fall within scope of any definition
9
relating to ‘DSI’. Changes to nucleotides, such as modifying the base, sugar or phosphate may no
10
longer be considered nucleotides. Therefore, any terminology to replace ‘DSI’ that includes
11
‘nucleotide’ would leave these novel, synthetic analogues out of scope, even though in principle
12
they may have the same function as DNA derived from a genetic resource. The nucleotide structure
13
can be retained, but the number of bases that can be used can be increased, but as these are not
14
natural and do not derive from a genetic resource these will fall outside of the scope of any
15
terminology proposed to replace ‘DSI’. Accordingly, in order to clarify the concept of DSI it needs to
16
be considered whether these modifications should all be regarded as ‘unnatural’ and whether those
17
that do not arise from a genetic resource should be considered as ‘DSI’ for policy discussions or not,
18
even if they have the same function as DNA derived from a genetic resource.
19
If only DNA and RNA sequences are included in the eventual definition of DSI (as per Group 1), then
20
epigenetic methylation of DNA may be excluded. It is important to consider that methylation does
21
not affect the nucleotide sequence and the pattern is not easily predicted, but it does influence gene
22
expression.
23
Protein modifications. Modifications to proteins including phosphorylation, and compounds
24
containing amino acid subunits such as RiPP and NRPS are one step further along the flow from
25
genetic resource to DNA, RNA onwards and do not fit neatly under the terms ‘proteins’ or
26
‘metabolites’. To clarify the concept of DSI it first needs to be considered whether proteins are
27
included under any terminology replacing ‘DSI’, and if so what level of modification is considered
28
allowable.
29
30
6. Conclusions and implications for future discussions concerning DSI
31
6.1 Subject matter groupings
32
Considering the flow of data/information from, and its proximity to, an underlying genetic resource
33
provides a basis to group information that may comprise DSI and this gives rise to four logical
34
groupings. To recap, Group 1 has a narrow scope or proximity to the genetic resource and is limited
35
to nucleotide sequence data associated with transcription. Group 2 has an intermediate scope or
36
proximity to the genetic resource and extends to protein sequences, thus comprising information
37
associated with transcription and translation. Two interpretations for the scope of this group are
38
possible, either subject matter is strictly limited to nucleotide and protein sequence data or it
39
includes information associated with transcription and translation more broadly, for instance,
40
functional annotations of genes, gene expression information, epigenetic data, and molecular
41
structures of proteins. Group 3 has a wider intermediate scope or proximity to the genetic resource
42
43
and extends to metabolites and biochemical pathways, thus comprising information associated with
1
transcription, translation and biosynthesis. Group 4 has the broadest scope or weakest proximity to
2
the underlying genetic resource and extends to behavioural data, information on ecological
3
relationships and traditional knowledge, thus comprising information associated with transcription,
4
translation and biosynthesis, as well as downstream subsidiary information. These groupings could
5
be used in future discussions to evaluate DSI subject matter and terminology (e.g. instead of the
6
AHTEG list).
7
The proximity of information to the underlying genetic resource determines whether it is possible to
8
accurately identify or infer the source from which it is derived. This has implications for the
9
traceability of information to a particular genetic resource and also in identifying the source of
10
information, including whether it has been generated through the utilization of a genetic resource or
11
independently. If traceability of DSI is important, a narrow scope of DSI subject matter appears more
12
desirable given the technical difficulties in identifying or inferring origin, whereas if traceability is not
13
important a broader scope of subject matter may be able to be accommodated.
14
6.2. Priority issues to clarify the concept of DSI
15
Throughout this study we have identified a number of priority issues that should be addressed in
16
order to clarify the concept of DSI. Irrespective of whether the proposed logical groups proposed in
17
this study are adopted, these issues should be used to help guide deliberations concerning the scope
18
and concept of DSI. These issues can be summarised as follows and for each we propose a logical
19
approach that may assist in resolving the issue:
20
1.) How far along the flow from genetic resource onwards to DNA, RNA, protein sequences and
21
metabolites ‘DSI’ can be considered to extend. Specifically: whether macromolecules (e.g.
22
proteins, polysaccharides) are included under ‘DSI’ and whether small molecules
23
(metabolites) are included under ‘DSI’ this can be resolved by adopting one of the four
24
groups proposed to clarify the scope of DSI subject matter, in which case all macromolecules
25
and metabolites would be excluded under Group 1 or 2, whereas they would be included
26
under Groups 3 or 4.
27
2.) The distinction between data and information and how this is stored and processed,
28
including the extent to which data has been processed before it can be considered
29
information this can be resolved by adopting one of the four groups proposed to clarify the
30
scope of DSI subject matter as these have clear subject matter boundaries and so an
31
approach, criteria or definition for distinguishing between data and information is not
32
necessary.
33
3.) Types of sequences that are included under any terminology proposed to replace ‘DSI’.
34
Specific questions are:
35
a. What length of sequence can still be considered as a ‘sequence’ - sequences below
36
30 nucleotides may not be unique and so this may provide a logical threshold below
37
which information should be excluded from DSI subject matter.
38
b. Whether non-coding DNA should be included under ‘DSI’- genetic elements which
39
do not encode proteins (such as promotors) may have a natural functional role in
40
transcription, translation or biosynthesis and on this basis it may be considered an
41
44
inherent part of the underlying genetic resource, such that it would be illogical to
1
distinguish between coding and non-coding sequences.
2
c. Whether epigenetic heritable factors should be included under ‘DSI’ - using the
3
same rationale as in b. above, epigenetic heritable factors may have a natural
4
functional role in transcription, translation or biosynthesis and on this basis it may be
5
considered an inherent part of the underlying genetic resource, such that it would be
6
illogical to exclude it from DSI subject matter.
7
d. Whether modified DNA, RNA (and proteins) should be included under DSI- using the
8
same rationale as in b. and c. above, naturally modified DNA, RNA or proteins may
9
nevertheless have a natural functional role in transcription, translation or
10
biosynthesis and on this basis these may be considered an inherent part of the
11
underlying genetic resource such that it would be illogical to exclude from DSI subject
12
matter, at least to the same extent that DSI subject matter includes DNA, RNA
13
and/or proteins. Conversely synthetically modified DNA, RNA or proteins cannot be
14
said to have a natural functional role and so on this basis could be considered not to
15
be an inherent part of the underlying genetic resource.
16
17
6.3 Subject matter groupings and life-science sectors
18
Section 4 discussed how different sectors rely on the use of DSI, including by comparing and
19
contrasting their reliance on ‘omics’ technologies (Table 3). Applying the subject matter groupings
20
proposed in Section 5 we are able to consider the implications for each sector if DSI subject matter is
21
construed in a narrow, intermediate or broad manner, as depicted in Table 5 and described in
22
greater detail below. Unsurprisingly given our earlier observations showing a heavy reliance on
23
‘omics’ technologies and trends and technologies enabled by DSI, Table 5 confirms that all sectors
24
rely heavily on DNA and RNA sequence data (narrow group) and on functional annotations of genes
25
and protein sequence data obtained via proteomic techniques (intermediate group 2a/b), while
26
molecular structures of proteins and metabolomic data (intermediate group 3) are particularly
27
important in industrial biotechnology, synthetic biology, healthcare and drug discovery. Irrespective
28
of whether a narrow, intermediate or broad approach is used in defining the scope of DSI, all sectors
29
would be within scope as all use information and applications/technologies which rely on such
30
information, within each grouping. Of course, the broader the scope of DSI subject matter adopted,
31
the greater the technologies, techniques and overall activities in each sector that would rely on
32
information that falls within the scope of DSI.
33
45
Table 5. Applying the proposed DSI subject matter groupings to the different life-sciences sectors.
1
DSI Subject Matter Grouping
Taxonomy &
Conservation
Agriculture & Food Security
Industrial & Synthetic Biology
Healthcare & Pharmaceuticals
Narrow (Group 1)
(DNA/RNA)
Most critical to this field of
use is DNA/RNA sequence
data, and ability to
compare these the
sequence databases
Reference genomes are needed to
carry out any marker assisted breeding
or genetic modification. Metagenomes
critical to understand health of soil
microbiome.
Availability of multiple related
gene sequences for
comparison is essential for this
field.
DNA and RNA sequences essential to this
field.
Intermediate (Group 2)
(DNA/RNA/Proteins)
Gene annotations needed for marker
assisted breeding. Gene expression
information relevant. Epigenetic
heritable elements are important.
Protein sequences less important than
DNA/RNA sequences. Proteomic data
used to assess effect of
breeding/genetic modification.
Gene annotations needed to
discover related enzymes.
Information on protein
sequences is essential for the
engineering of enzymes.
Molecular structure of
proteins is important needed
for targeting modifications.
Gene annotations needed to discover related
proteins. Information on gene expression is
needed to understand essential metabolic
processes in pathogens. Protein sequences
are important to understand how small
molecule metabolites are biosynthesized.
Molecular structures of proteins are
necessary to understand and engineer
metabolite biosynthesis.
Intermediate (Group 3)
(DNA/RNA/Proteins/Metabolites)
Metabolome of organism
can be used to assist
taxonomy.
Metabolomic data important to
understand nutritional value of crops.
Information on molecular
structure of metabolites
needed for many products.
Information on molecular structure of
metabolites needed for many products.
Broad (Group 4)
(DNA/RNA/Proteins/Metabolites
and subsidiary information)
For taxonomic description
this includes
morphological data.
Species richness and
assemblage may change in
response to abiotic and
environmental influences.
Phenotype, ecological relationships,
environmental and abiotic factors are
relevant.
Ecological relationship and abiotic
environmental factors are important to
understand pathogen evolution and spread.
2
46
Acknowledgments: The following are thanked for their helpful discussions and review of drafts:
1
Abbe Brown, Lydia Slobodian, Sarah Laird, Chris Lyal, Charles Lawson and Amber Scholz. The staff at
2
the CBD Secretariat also deserve thanks: Valerie Normand, Beatriz Gomez, Austein McLoughlin,
3
Kristina Taboulchanas, Worku Yifru and David Cooper.
4
Conflict of interest statement: Marcel Jaspars is a co-founder of, has shares in, and consultant to
5
GyreOx Ltd, a company that uses marine genetic resources from areas within national jurisdiction to
6
develop potential drug molecules.
7
7. References
8
1. S. A. Laird, R. P. Wynberg (2018) Fact-Finding and Scoping Study on Digital Sequence
9
Information on Genetic Resources in the Context of the Convention on Biological Diversity and
10
the Nagoya Protocol (CBD/DSI/AHTEG/2018/1/3); available at
11
https://www.cbd.int/doc/c/079f/2dc5/2d20217d1cdacac787524d8e/dsi-ahteg-2018-01-03-
12
en.pdf.
13
2. E. Schrödinger (1944) What is life? The physical aspect of the living cell. Cambridge University
14
Press, UK.
15
3. O. T. Avery, C. M. Macleod, M. Mccarty (1944) Studies on the chemical nature of the substance
16
inducing transformation of pneumococcal types. J. Exp. Med. 79: 137-158.
17
4. E. Vischer, E. Chargaff (1948) The separation and quantitative estimation of purines and
18
pyrimidines in minute amounts. J. Biol. Chem. 176: 703-714.
19
5. A. J. F. Griffiths, S. R. Wessler, S. B. Carroll, J. Doebley (2012) Introduction to genetic analysis
20
(third edition) W. H. Freeman and Company, New York.
21
6. L. Pray (2008) Discovery of DNA structure and function: Watson and Crick. Nature Education 1:
22
100.
23
7. A. Rich, D. Davies (1956) A new two-stranded helical structure: polyadenylic acid and
24
polyuridylic acid. J. Am. Chem. Soc. 78: 3548-3549.
25
8. F.H. Crick (1958) On protein synthesis. In F. K. Sanders (ed.). Symposia of the Society of
26
Experimental Biology. Number XII: The biological replication of macromolecules. Cambridge
27
University Press pp. 138-163.
28
9. R.S. Gardner, A. J. Wahba, C. Basilio, R. S. Miller, P. Lengyel, J. F. Speyer (1962) Synthetic
29
polynucleotides and the amino acid code, VII. Proc. Natl. Acad. Sci. USA 48: 2087-2094.
30
10. A. J. Wahba, R.S. Gardner, C. Basilio, R. S. Miller, J. F. Speyer, P. Lengyel (1963) Synthetic
31
polynucleotides and the amino acid code, VIII. Proc. Natl. Acad. Sci. USA 49: 116-122.
32
11. S. Horowitz, M. A. Gorovsky (1985) An unusual genetic code in nuclear genes of Tetrahymena.
33
Proc. Natl. Acad. Sci. USA 82: 2452-2455.
34
12. S. T. Sharfstein (2018) Non-protein biologic therapeutics. Curr. Opin. Biotechnol. 53: 65-75.
35
13. C. hler, P. E. Nielsen, L. E. Orgel (1995) Template switching between PNA and RNA
36
oligonucleotides. Nature 376, 578-581.
37
14. S. Hoshika, N. A. Leal, M.-J. Kim, M.-S. Kim, N. B. Karalkar, H. Kim, et al. (2019) Hachimoji DNA
38
and RNA: A genetic system with eight building blocks. Science 363, 884-887.
39
47
15. F. Sanger, S. Nicklen, A. R. Coulson (1977) DNA sequencing with chain-terminating inhibitors.
1
Proc. Natl. Acad. Sci. USA 74: 5463-5467.
2
16. L. J. Kahl, D. Endy (2013) A survey of enabling technologies in synthetic biology. J. Biol. Eng. 7:
3
13.
4
17. F. Sanger, A. R. Coulson, G. F. Hong, D. F. Hill, G. B. Petersen (1982) Nucleotide sequence of
5
bacteriophage λ DNA. J. Mol. Biol. 162: 729-773.
6
18. D. Sim, I. Sudbury, N. E. Ilott, A. Heger, C. P. Ponting (2014) Sequencing depth and coverage:
7
key considerations in genomic analyses. Nat. Rev. Genet. 15: 121-132.
8
19. Mardis, E. R. (2017) “DNA sequencing technologies: 2006-2016”. Nature protocols, 12, 213-218.
9
20. R. D. Fleischmann, M. D. Adams, O. White, R. A. Clayton, E. F. Kirkness, et al. (1995) Whole-
10
genome random sequencing and assembly of Haemophilus influenza Rd. Science 269: 496-498.
11
21. M. D. Adams, S. E. Celniker, R. A. Holt, C. A. Evans, J. D. Gocayne, et al. (2000) The Genome
12
Sequence of Drosophila melanogaster. Science 287: 2185-2195.
13
22. R. H. Waterston, K. Lindblad-Toh, E. Birney, J. Rogers, J. F. Abril, et al. (2002) Initial sequencing
14
and comparative analysis of the mouse genome. Nature 420: 520-562.
15
23. J. Yu, S. Hu, J. Wang, G. K. Wong, S. Li, et al. (2002) A Draft Sequence of the Rice Genome
16
(Oryza sativa L. ssp. indica). Science 296: 79-92.
17
24. S. A. Goff, D. Ricke, T. Lan, G. Presting, R. Wang, M. Dunn, et al. (2002) A Draft Sequence of the
18
Rice Genome (Oryza sativa L. ssp. japonica). Science 296: 92-100.
19
25. H. A. Lewin, G. E. Robinson, W. J. Kress, W. J. Baker, J. Coddington et al. (2018) Earth
20
BioGenome Project: Sequencing life for the future of life. Proc. Natl. Acad. Sci. USA 115: 4325-
21
4333.
22
26. T. Woyke, G. Xie, A. Copeland, J. M. González, C. Han, et al. (2009) Assembling the marine
23
metagenome, one cell at a time. PLoS One 4: e5299.
24
27. S. Mariani, C. Baillie, G. Colosimo, A. Riesgo (2019) Sponges as natural environmental DNA
25
samplers. Curr. Biol. 29: R401-R402.
26
28. J. Eberwine, J. Y. Sul, T. Bartfai, J. Kim (2014) The promise of single-cell sequencing. Nat.
27
Methods 11: 25-27.
28
29. K. Chen, L. Pachter (2005) Bioinformatics for whole-genome shotgun sequencing of microbial
29
communities. PLoS Comput. Biol. 1: 106-112.
30
30. P. Horvath, R. Barrangou (2010) CRISPR/Cas, the Immune System of Bacteria and Archaea.
31
Science 327: 167-170.
32
31. S. Mukherjee (2016) The Gene- An intimate history. Vintage, London.
33
32. A. Cheng, T. K. Lu (2012) Synthetic biology: an emerging engineering discipline. Annu. Rev.
34
Biomed Eng. 14: 155-178.
35
33. H. Neumann (2012) Rewriting translation genetic code expansion and its applications. FEBS
36
Lett. 586: 2057-2064.
37
48
34. M. J. Lajoie, A. J. Rovner, D. B. Goodman, H. Aerni, A. D. Haimovich, et al. (2013) Genomically
1
recoded organisms expand biological functions. Science 342: 357-360.
2
35. J. C. Anderson, N. Wu, S. W. Santoro, V. Lakshman, D. S. King, et al. (2004) An expanded genetic
3
code with a functional quadruplet codon. Proc. Natl. Acad. Sci. USA 101: 7566-7571.
4
36. C. B. Anfinsen (1972) Studies on the principles that govern the folding of protein chains. Nobel
5
Lecture available at:
6
https://www.nobelprize.org/uploads/2018/06/anfinsen-lecture.pdf
7
37. 13th Community Wide Experiment on the Critical Assessment of Techniques for Protein
8
Structure Prediction available at:
9
http://predictioncenter.org/casp13/
10
38. P. Crews, J. Rodriguez, M. Jaspars (2010) Organic Structure Analysis. Oxford University Press,
11
New York
12
39. A. Noreña-P, A. G. Muñoz, J. Mosquera-Rendόn, K. Botero, M. A. Cristancho (2018) Colombia,
13
an unknown genetic diversity in the era of Big Data. BMC Genomics 19: 859.
14
40. G. M. Connette, P. Oswald, M. K. Thura, K. J. LaJeunesse Connette, M. E. Grindley, et al. (2017)
15
Rapid forest clearing in a Myanmar proposed national park threatens two newly discovered
16
species of geckos (Gekkonidae: Cyrtodactylus). PLoS ONE 12: e0174432.
17
41. T. E. Berry, B. J. Saunders, M. L. Coghlan, M. Stat, S. Jarman, et al. (2019) Marine environmental
18
DNA biomonitoring reveals seasonal patterns in biodiversity and identifies ecosystem responses
19
to anomalous climatic events. PLoS Genet. 15: e1007943.
20
42. S. P. Iglésias, L. Toulhoat, D. Y. Sellos (2010) Taxonomic confusion and market mislabelling of
21
threatened skates: important consequences for their conservation status. Aquatic Conserv:
22
Mar. Freshw. Ecosyst. 20: 319333.
23
43. S. J. Helyar, A. D. Lioyd, M. de Bruyn, J. Leake, N. Bennett, G. R. Carvalho (2014) Fish Product
24
Mislabelling: Failings of Traceability in the Production Chain and Implications for Illegal,
25
Unreported and Unregulated (IUU) Fishing. PLoS ONE 9: e98691.
26
44. https://agfundernews.com/agrifood-tech-in-2018_17bn_funding_breakout-year.html
27
45. A. Barone, L. Frusciante (2007) Molecular marker-assisted selection for resistance to pathogens
28
in tomato. In: “Marker-assisted selection: current status and future perspectives in crops,
29
livestock, forestry and fish” [E. P. Guimarães, J. Ruane, B. D. Scherf, A. Sonnino, J. D. Dargie
30
(eds.)], Food and Agriculture organization of the United Nations (Rome).
31
46. J. I. Weller (2007) Marker-assisted selection in dairy cattle. In: “Marker-assisted selection:
32
current status and future perspectives in crops, livestock, forestry and fish” [E. P. Guimarães, J.
33
Ruane, B. D. Scherf, A. Sonnino, J. D. Dargie (eds.)], Food and Agriculture organization of the
34
United Nations (Rome).
35
47. P. Gallusci, Z. Dai, M. Génard, A. Gauffretau, N. Leblanc-Fournier, C. Richard-Molard, D. Vile, S.
36
Brunel-Muguet (2017) Epigenetics for Plant Improvement: Current Knowledge and Modeling
37
Avenues. Trends Plant Sci 22, 610-623.
38
49
48. N. M. Springer, R. J. Schmitz (2017) Exploiting induced and natural epigenetic variation for crop
1
improvement. Nat. Rev. Genetics 18, 563-575.
2
49. M. Eldakak, S. M. Milad, A. I. Nawar, J. S. Rohila (2013) Proteomics: a biotechnology tool for
3
crop improvement. Front. Plant Sci. 4, 35.
4
50. N. Amiour, S. Imbaud, G. Clément, N. Agier, M. Zivy, B. Valot, et al. (2012) The use of
5
metabolomics integrated with transcriptomic and proteomic studies for identifying key steps
6
involved in the control of nitrogen metabolism in crops such as maize. J. Exp. Botany 63, 5017-
7
5033.
8
51. A. Kamthan, A. Chaudhuri, M. Kamthan, A. Datta (2015) Small RNAs in plants: recent
9
development and application for crop improvement. Front. Plant Sci. 6, 208.
10
52. T. Wang, H. Zhang, H. Zhu (2019) CRISPR technology is revolutionizing the improvement of
11
tomato and other fruit crops. Horticult. Res. 6, 77.
12
53. Z. H. Lemmon, N. T. Reem, J. Dalrymple, S. Soyk, K. E. Swartwood, D. Rodriguez-Leal, J. Van Eck,
13
Z. B. Lippman (2018) Rapid improvement of domestication traits in an orphan crop by genome
14
editing. Nat. Plants 4, 766-770.
15
54. A. Zsögön, T. Čermák, E. R. Naves, M. M. Notini, K. H. Edel, S. Weinl, et al. (2018) De novo
16
domestication of wild tomato using genome editing. Nat. Biotechnol. 36, 1211-1216.
17
55. J. K. Jansson, K. S. Hofmockel (2018) The soil microbiomefrom metagenomics to
18
metaphenomics. Curr. Opin. Microbiol. 43: 162168.
19
56. An explanatory guide to the Nagoya Protocol on access and benefit-sharing; available at
20
https://www.iucn.org/content/explanatory-guide-nagoya-protocol-access-and-benefit-sharing
21
57. https://www.alliedmarketresearch.com/synthetic-biology-market
22
58. P.H. Nielsen (2005) Life cycle assessment supports cold-wash enzymes. Int. J. Appl. Sci. 10: 131.
23
59. J. Bjerre, O. Simonsen, J. Vind (2013) Household and personal care today. Vol. 8, p. 37.
24
60. T. H. Richardson, X. Tan, G. Frey, W. Callen, M. Cabell, et al. (2002) A novel, high performance
25
enzyme for starch liquefaction - Discovery and optimization of a low pH, thermostable α-
26
amylase. J. Biol. Chem. 277: 26501-26507.
27
61. R. Sadre, P. Kuo, J. Chen, Y. Yang, A. Banerjee, C. Benning, B. Hamberger (2019) Cytosolic lipid
28
droplets as engineered organelles for production and accumulation of terpenoid biomaterials in
29
leaves. Nat. Commun. 10: 853.
30
62. X. Luo, M. A. Reiter, d’Espaux, J. Wong, C. M. Denby, A. Lechner, et al. (2019) Complete
31
biosynthesis of cannabinoids and their unnatural analogues in yeast. Nature 567: 123-126.
32
63. V. Chubukov, A. Mukhopadhyay, C. J. Petzold, J. D. Keasling, H. G. Martin (2016) Synthetic and
33
systems biology for microbial production of commodity chemicals. NPJ Syst. Biol. Appl. 2:
34
16009.
35
64. A. Cravens, J. Payne, C.D. Smolke (2019) Synthetic biology strategies for microbial biosynthesis
36
of plant natural products. Nat. Commun. 10: 2142.
37
50
65. C. J. Paddon, J. D. Keasling (2014) Semi-synthetic artemisinin: a model for the use of synthetic
1
biology in pharmaceutical development. Nat. Rev. Microbiol. 12: 355-367.
2
66. The Global Use of Medicine in 2019 and Outlook to 2023. Available at:
3
https://www.iqvia.com/insights/the-iqvia-institute/reports/the-global-use-of-medicine-in-2019-
4
and-outlook-to-2023
5
67. https://www.genengnews.com/a-lists/top-15-best-selling-drugs-of-2018/
6
68. https://www.webmd.com/drug-medication/news/20150508/most-prescribed-top-selling-
7
drugs
8
69. D. J. Newman, G. M. Cragg (2016) Natural Products as Sources of New Drugs from 1981 to
9
2014. J. Nat. Prod. 79: 629-661.
10
70. https://www.bloomberg.com/press-releases/2019-05-22/infectious-disease-diagnostics-
11
market-high-prevalence-of-infectious-diseases-driving-the-industry-growth
12
71. G. Poste (2001) Molecular diagnostics: a powerful new component of the healthcare value
13
chain. Expert Rev. Mol. Diagn. 1: 1-5.
14
72. P. Qin, M. Park, K. J. Alfson, M. Tamhankar, R. Carrion, et al. (2019) Rapid and Fully Microfluidic
15
Ebola Virus Detection with CRISPRCas13a. ACS Sens. 4: 1048-1054.
16
73. https://www.ncbi.nlm.nih.gov/pathogens/
17
74. Recommendation adopted by the subsidiary body on scientific, technical and technological
18
advice. Twenty-second meeting Montreal, Canada, 2-7 July 2018 available at:
19
https://www.cbd.int/doc/recommendations/sbstta-22/sbstta-22-rec-01-en.pdf
20
75. A. Traoré (The dematerialization of plant genetic resources: A peasant’s perspective)- available
21
at:
22
https://www.righttofoodandnutrition.org/files/2._eng_the_dematerialization_of_plant_genetic
23
_resources.pdf
24
76. C. Lawson, F. Humphries, M. Rourke (2019) The future of information under the CBD, Nagoya
25
Protocol, Plant Treaty, and PIP Framework. J. World Intellect. Prop. 22: 1-17.
26
77. International Chamber of Commerce submission to the CBD (2016) Digital Sequence
27
Information (https://iccwbo.org/publication/digital-sequence-information/)
28
78. M. A. Bagley, A. K. Rai (2014) The Nagoya protocol and synthetic biology research: A look at the
29
potential impacts. Washington, DC.
30
79. The World Health Organization (2011) Pandemic influenza preparedness Framework for the
31
sharing of influenza viruses and access to vaccines and other benefits available at:
32
https://www.who.int/influenza/resources/pip_framework/en/
33
80. Submission of views and information on benefit sharing arrangements from commercial and
34
non-commercial use of digital sequence information on genetic resources- available at:
35
https://www.cbd.int/abs/DSI-views/2019/NHM-RBGK-RBGE-DSI.pdf
36