U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Elsevier Sponsored Documents

Elsevier Full-Text Article

The sequence of sequencers: The history of sequencing DNA

Determining the order of nucleic acid residues in biological samples is an integral component of a wide variety of research applications. Over the last fifty years large numbers of researchers have applied themselves to the production of techniques and technologies to facilitate this feat, sequencing DNA and RNA molecules. This time-scale has witnessed tremendous changes, moving from sequencing short oligonucleotides to millions of bases, from struggling towards the deduction of the coding sequence of a single gene to rapid and widely available whole genome sequencing. This article traverses those years, iterating through the different generations of sequencing technology, highlighting some of the key discoveries, researchers, and sequences along the way.

  • • We review the drastic changes to DNA sequencing technology over the last 50 years.
  • • First-generation methods enabled sequencing of clonal DNA populations.
  • • The second-generation massively increased throughput by parallelizing many reactions.
  • • Third-generation methods allow direct sequencing of single DNA molecules.

1. Introduction

“ ... [ A ] knowledge of sequences could contribute much to our understanding of living matter. ” Frederick Sanger [1]

The order of nucleic acids in polynucleotide chains ultimately contains the information for the hereditary and biochemical properties of terrestrial life. Therefore the ability to measure or infer such sequences is imperative to biological research. This review deals with how researchers throughout the years have addressed the problem of how to sequence DNA, and the characteristics that define each generation of methodologies for doing so.

2. First-generation DNA sequencing

Watson and Crick famously solved the three-dimensional structure of DNA in 1953, working from crystallographic data produced by Rosalind Franklin and Maurice Wilkins [2] , [3] , which contributed to a conceptual framework for both DNA replication and encoding proteins in nucleic acids. However, the ability to ‘read’ or sequence DNA did not follow for some time. Strategies developed to infer the sequence of protein chains did not seem to readily apply to nucleic acid investigations: DNA molecules were much longer and made of fewer units that were more similar to one another, making it harder to distinguish between them [4] . New tactics needed to be developed.

Initial efforts focused on sequencing the most readily available populations of relatively pure RNA species, such as microbial ribosomal or transfer RNA, or the genomes of single-stranded RNA bacteriophages. Not only could these be readily bulk-produced in culture, but they are also not complicated by a complementary strand, and are often considerably shorter than eukaryotic DNA molecules. Furthermore, RNase enzymes able to cut RNA chains at specific sites were already known and available. Despite these advantages, progress remained slow, as the techniques available to researchers – borrowed from analytical chemistry – were only able to measure nucleotide composition, and not order [5] . However, by combining these techniques with selective ribonuclease treatments to produce fully and partially degraded RNA fragments [6] (and incorporating the observation that RNA contained a different nucleotide base [7] ), in 1965 Robert Holley and colleagues were able to produce the first whole nucleic acid sequence, that of alanine tRNA from Saccharomyces cerevisiae [8] . In parallel, Fred Sanger and colleagues developed a related technique based on the detection of radiolabelled partial-digestion fragments after two-dimensional fractionation [9] , which allowed researchers to steadily add to the growing pool of ribosomal and transfer RNA sequences [10] , [11] , [12] , [13] , [14] . It was also by using this 2-D fractionation method that Walter Fiers' laboratory was able to produce the first complete protein-coding gene sequence in 1972, that of the coat protein of bacteriophage MS2 [15] , followed four years later by its complete genome [16] .

It was around this time that various researchers began to adapt their methods in order to sequence DNA, aided by the recent purification of bacteriophages with DNA genomes, providing an ideal source for testing new protocols. Making use of the observation that Enterobacteria phage λ possessed 5′ overhanging ‘cohesive’ ends, Ray Wu and Dale Kaiser used DNA polymerase to fill the ends in with radioactive nucleotides, supplying each nucleotide one at a time and measuring incorporation to deduce sequence [17] , [18] . It was not long before this principle was generalized through the use of specific oligonucleotides to prime the DNA polymerase. Incorporation of radioactive nucleotides could then be used to infer the order of nucleotides anywhere, not just at the end termini of bacteriophage genomes [19] , [20] , [21] . However the actual determination of bases was still restricted to short stretches of DNA, and still typically involved a considerable amount of analytical chemistry and fractionation procedures.

The next practical change to make a large impact was the replacement of 2-D fractionation (which often consisted of both electrophoresis and chromatography) with a single separation by polynucleotide length via electrophoresis through polyacrylamide gels, which provided much greater resolving power. This technique was used in two influential yet complex protocols from the mid-1970s: Alan Coulson and Sanger's ‘plus and minus’ system in 1975 and Allan Maxam and Walter Gilbert's chemical cleavage technique [22] , [23] . The plus and minus technique used DNA polymerase to synthesize from a primer, incorporating radiolabelled nucleotides, before performing two second polymerisation reactions: a ‘plus’ reaction, in which only a single type of nucleotide is present, thus all extensions will end with that base, and a ‘minus’ reaction, in which three are used, which produces sequences up to the position before the next missing nucleotide. By running the products on a polyacrylamide gel and comparing between the eight lanes, one is able to infer the position of nucleotides at each position in the covered sequence (except for those which lie within a homopolymer, i.e. a run of the same nucleotide). It was using this technique that Sanger and colleagues sequenced the first DNA genome, that of bacteriophage ϕ X174 (or ‘PhiX’, which enjoys a position in many sequencing labs today as a positive control genome) [24] . While still using polyacrylamide gels to resolve DNA fragments, the Maxam and Gilbert technique differed significantly in its approach. Instead of relying on DNA polymerase to generate fragments, radiolabelled DNA is treated with chemicals which break the chain at specific bases; after running on a polyacrylamide gel the length of cleaved fragments (and thus position of specific nucleotides) can be determined and therefore sequence inferred (see Fig. 1 , right). This was the first technique to be widely adopted, and thus might be considered the real birth of ‘first-generation’ DNA sequencing.

An external file that holds a picture, illustration, etc.
Object name is gr1.jpg

First-generation DNA sequencing technologies. Example DNA to be sequenced ( a) is illustrated undergoing either Sanger (b) or Maxam–Gilbert (c) sequencing. (b): Sanger's ‘chain-termination’ sequencing. Radio- or fluorescently-labelled ddNTP nucleotides of a given type - which once incorporated, prevent further extension - are included in DNA polymerisation reactions at low concentrations (primed off a 5′ sequence, not shown). Therefore in each of the four reactions, sequence fragments are generated with 3′ truncations as a ddNTP is randomly incorporated at a particular instance of that base (underlined 3′ terminal characters). (c): Maxam and Gilbert's ‘chemical sequencing’ method. DNA must first be labelled, typically by inclusion of radioactive P  32 in its 5′ phosphate moiety (shown here by Ⓟ). Different chemical treatments are then used to selectively remove the base from a small proportion of DNA sites. Hydrazine removes bases from pyrimidines (cytosine and thymine), while hydrazine in the presence of high salt concentrations can only remove those from cytosine. Acid can then be used to remove the bases from purines (adenine and guanine), with dimethyl sulfate being used to attack guanines (although adenine will also be affected to a much lesser extent). Piperidine is then used to cleave the phophodiester backbone at the abasic site, yielding fragments of variable length. (d): Fragments generated from either methodology can then be visualized via electrophoresis on a high-resolution polyacrylamide gel: sequences are then inferred by reading ‘up’ the gel, as the shorter DNA fragments migrate fastest. In Sanger sequencing (left) the sequence is inferred by finding the lane in which the band is present for a given site, as the 3′ terminating labelled ddNTP corresponds to the base at that position. Maxam–Gilbert sequencing requires a small additional logical step: Ts and As can be directly inferred from a band in the pyrimidine or purine lanes respectively, while G and C are indicated by the presence of dual bands in the G and A + G lanes, or C and C + T lanes respectively.

However the major breakthrough that forever altered the progress of DNA sequencing technology came in 1977, with the development of Sanger's ‘chain-termination’ or dideoxy technique [25] . The chain-termination technique makes use of chemical analogues of the deoxyribonucleotides (dNTPs) that are the monomers of DNA strands. Dideoxynucleotides (ddNTPs) lack the 3′ hydroxyl group that is required for extension of DNA chains, and therefore cannot form a bond with the 5′ phosphate of the next dNTP [26] . Mixing radiolabelled ddNTPs into a DNA extension reaction at a fraction of the concentration of standard dNTPs results in DNA strands of each possible length being produced, as the dideoxy nucleotides get randomly incorporated as the strand extends, halting further progression. By performing four parallel reactions containing each individual ddNTP base and running the results on four lanes of a polyacrylamide gel, one is able to use autoradiography to infer what the nucleotide sequence in the original template was, as there will a radioactive band in the corresponding lane at that position of the gel (see Fig. 1 , left). While working on the same principle as other techniques (that of producing all possible incremental length sequences and labelling the ultimate nucleotide), the accuracy, robustness and ease of use led to the dideoxy chain-termination method – or simply, Sanger sequencing – to become the most common technology used to sequence DNA for years to come.

A number of improvements were made to Sanger sequencing in the following years, which primarily involved the replacement of phospho- or tritrium-radiolabelling with fluorometric based detection (allowing the reaction to occur in one vessel instead of four) and improved detection through capillary based electrophoresis. Both of these improvements contributed to the development of increasingly automated DNA sequencing machines [27] , [28] , [29] , [30] , [31] , [32] , [33] , and subsequently the first crop of commercial DNA sequencing machines [34] which were used to sequence the genomes of increasingly complex species.

These first-generation DNA sequencing machines produce reads slightly less than one kilobase (kb) in length: in order to analyse longer fragments researchers made use of techniques such as ‘shotgun sequencing’ where overlapping DNA fragments were cloned and sequenced separately, and then assembled into one long contiguous sequence (or ‘contig’) in silico [35] , [36] . The development of techniques such as polymerase chain reaction (PCR) [37] , [38] and recombinant DNA technologies [39] , [40] further aided the genomics revolution by providing means of generating the high concentrations of pure DNA species required for sequencing. Improvements in sequencing also occurred by less direct routes. For instance, the Klenow fragment DNA polymerase – a fragment of the Escherichia coli DNA polymerase that lacks 5′ to 3′ exonuclease activity, produced through protease digestion of the native enzyme [41] – had originally been used for sequencing due to its ability to incorporate ddNTPs efficiently. However, more sequenced genomes and tools for genetic manipulation provided the resources to find polymerases that were better at accommodating the additional chemical moeities of the increasingly modified dNTPs used for sequencing [42] . Eventually, newer dideoxy sequencers – such as the ABI PRISM range developed from Leroy Hood's research, produced by Applied Biosystems [43] , which allowed simultaneous sequencing of hundreds of samples [44] – came to be used in the Human Genome Project, helping to produce the first draft of that mammoth undertaking years ahead of schedule [45] , [46] .

3. Second-generation DNA sequencing

Concurrent with the development of large-scale dideoxy sequencing efforts, another technique appeared that set the stage for the first wave in the next generation of DNA sequencers. This method markedly differed from existing methods in that it did not infer nucleotide identity through using radio- or fluorescently-labelled dNTPs or oligonucleotides before visualising with electrophoresis. Instead researchers utilized a recently discovered luminescent method for measuring pyrophosphate synthesis: this consisted of a two-enzyme process in which ATP sulfurylase is used to convert pyrophosphate into ATP, which is then used as the substrate for luciferase, thus producing light proportional to the amount of pyrophosphate [47] . This approach was used to infer sequence by measuring pyrophosphate production as each nucleotide is washed through the system in turn over the template DNA affixed to a solid phase [48] . Note that despite the differences, both Sanger's dideoxy and this pyrosequencing method are ‘sequence-by-synthesis’ (SBS) techniques, as they both require the direct action of DNA polymerase to produce the observable output (in contrast to the Maxam–Gilbert technique). This pyrosequencing technique, pioneered by Pål Nyrén and colleagues, possessed a number of features that were considered beneficial: it could be performed using natural nucleotides (instead of the heavily-modified dNTPs used in the chain-termination protocols), and observed in real time (instead of requiring lengthy electrophoreses) [49] , [50] , [51] . Later improvements included attaching the DNA to paramagnetic beads, and enzymatically degrading unincorporated dNTPs to remove the need for lengthy washing steps. The major difficulty posed by this technique is finding out how many of the same nucleotide there are in a row at a given position: the intensity of light released corresponds to the length of the homopolymer, but noise produced a non-linear readout above four or five identical nucleotides [51] . Pyrosequencing was later licensed to 454 Life Sciences, a biotechnology company founded by Jonathan Rothburg, where it evolved into the first major successful commercial ‘next-generation sequencing’ (NGS) technology.

The sequencing machines produced by 454 (later purchased by Roche) were a paradigm shift in that they allowed the mass parallelisation of sequencing reactions, greatly increasing the amount of DNA that can be sequenced in any one run [52] . Libraries of DNA molecules are first attached to beads via adapter sequences, which then undergo a water-in-oil emulsion PCR (emPCR) [53] to coat each bead in a clonal DNA population, where ideally on average one DNA molecule ends up on one bead, which amplifies in its own droplet in the emulsion (see Fig. 2 a and c). These DNA-coated beads are then washed over a picoliter reaction plate that fits one bead per well; pyrosequencing then occurs as smaller bead-linked enzymes and dNTPs are washed over the plate, and pyrophosphate release is measured using a charged couple device (CCD) sensor beneath the wells. This set up was capable of producing reads around 400–500 base pairs (bp) long, for the million or so wells that would be expected to contain suitably clonally-coated beads [52] . This parallelisation increased the yield of sequencing efforts by orders of magnitudes, for instance allowing researchers to completely sequence a single human's genome – that belonging to DNA structure pioneer, James Watson – far quicker and cheaper than a similar effort by DNA-sequencing entrepreneur Craig Venter's team using Sanger sequencing the preceding year [54] , [55] . The first high-throughput sequencing (HTS) machine widely available to consumers was the original 454 machine, called the GS 20, which was later superceded by the 454 GS FLX, which offered a greater number of reads (by having more wells in the ‘picotiter’ plate) as well as better quality data [56] . This principle of performing huge numbers of parallel sequencing reactions on a micrometer scale – often made possible as a result of improvements in microfabrication and high-resolution imaging – is what came to define the second-generation of DNA sequencing [57] .

An external file that holds a picture, illustration, etc.
Object name is gr2.jpg

Second-generation DNA sequencing parallelized amplification. (a): DNA molecules being clonally amplified in an emulsion PCR (emPCR). Adapter ligation and PCR produces DNA libraries with appropriate 5′ and 3′ ends, which can then be made single stranded and immobilized onto individual suitably oligonucleotide-tagged microbeads. Bead-DNA conjugates can then be emulsified using aqueous amplification reagents in oil, ideally producing emulsion droplets containing only one bead (illustrated in the two leftmost droplets, with different molecules indicated in different colours). Clonal amplification then occurs during the emPCR as each template DNA is physically separate from all others, with daughter molecules remaining bound to the microbeads. This is the conceptual basis underlying sequencing in 454, Ion Torrent and polony sequencing protocols. (b): Bridge amplification to produce clusters of clonal DNA populations in a planar solid-phase PCR reaction, as occurs in Solexa/Illumina sequencing. Single-stranded DNA with terminating sequences complementary to the two lawn-oligos will anneal when washed over the flow-cell, and during isothermal PCR will replicate in a confined area, bending over to prime at neighbouring sites, producing a local cluster of identical molecules. (c) and (d) demonstrate how these two different forms of clonally-amplified sequences can then be read in a highly parallelized manner: emPCR-produced microbeads can be washed over a picotiter plate, containing wells large enough to fit only one bead (c). DNA polymerase can then be added to the wells, and each nucleotide can be washed over in turn, and dNTP incorporation monitored (e.g. via pyrophosphate or hydrogen ion release). Flow-cell bound clusters produced via bridge amplification (d) can be visualized by detecting fluorescent reversible-terminator nucleotides at the ends of a proceeding extension reaction, requiring cycle-by-cycle measurements and removal of terminators.

A number of parallel sequencing techniques sprung up following the success of 454. The most important among them is arguably the Solexa method of sequencing, which was later acquired by Illumina [56] . Instead of parallelising by performing bead-based emPCR, adapter-bracketed DNA molecules are passed over a lawn of complementary oligonucleotides bound to a flowcell; a subsequent solid phase PCR produces neighbouring clusters of clonal populations from each of the individual original flow-cell binding DNA strands [58] , [59] . This process has been dubbed ‘bridge amplification’, due to replicating DNA strands having to arch over to prime the next round of polymerisation off neighbouring surface-bound oligonucleotides (see Fig. 2 b and d) [56] . Sequencing itself is achieved in a SBS manner using fluorescent ‘reversible-terminator’ dNTPs, which cannot immediately bind further nucleotides as the fluorophore occupies the 3′ hydroxyl position; this must be cleaved away before polymerisation can continue, which allows the sequencing to occur in a synchronous manner [60] . These modified dNTPs and DNA polymerase are washed over the primed, single-stranded flow-cell bound clusters in cycles. At each cycle, the identity of the incorporating nucleotide can be monitored with a CCD by exciting the fluorophores with appropriate lasers, before enzymatic removal of the blocking fluorescent moieties and continuation to the next position. While the first Genome Analyzer (GA) machines were initially only capable of producing very short reads (up to 35 bp long) they had an advantage in that they could produce paired-end (PE) data, in which the sequence at both ends of each DNA cluster is recorded. This is achieved by first obtaining one SBS read from the single-stranded flow-cell bound DNA, before performing a single round of solid-phase DNA extension from remaining flow-cell bound oligonucleotides and removing the already-sequenced strand. Having thus reversed the orientation of the DNA strands relative to the flow-cell, a second read is then obtained from the opposite end of the molecules to the first. As the input molecules are of an approximate known length, having PE data provides a greater amount of information. This improves the accuracy when mapping reads to reference sequences, especially across repetitive sequences, and aids in detection of spliced exons and rearranged DNA or fused genes. The standard Genome Analyzer version (GAIIx) was later followed by the HiSeq, a machine capable of even greater read length and depth, and then the MiSeq, which was a lower-throughput (but lower cost) machine with faster turnaround and longer read lengths [61] , [62] .

A number of other sequencing companies, each hosting their own novel methodologies, have also appeared (and disappeared) and had variable impacts upon both what experiments are feasible and the market at large. In the early years of second-generation sequencing perhaps the third major option (alongside 454 and Solexa/Illumina sequencing) [63] was the sequencing by oligonucleotide ligation and detection (SOLiD) system from Applied Biosystems (which became Life Technologies following a merger with Invitrogen) [64] . As its name suggests, SOLiD sequenced not by synthesis (i.e. catalysed with a polymerase), but by ligation, using a DNA ligase, building on principles established previously with the open-source ‘polony’ sequencing developed in George Church's group [65] . While the SOLiD platform is not able to produce the read length and depth of Illumina machines [66] , making assembly more challenging, it has remained competitive on a cost per base basis [67] . Another notable technology based on sequence-by-ligation was Complete Genomic's ‘DNA nanoballs’ technique, where sequences are obtained similarly from probe-ligation but the clonal DNA population generation is novel: instead of a bead or bridge amplification, rolling circle amplification is used to generate long DNA chains consisting of repeating units of the template sequence bordered by adapters, which then self assemble into nanoballs, which are affixed to a slide to be sequenced [68] . The last remarkable second-generation sequencing platform is that developed by Jonathan Rothburg after leaving 454. Ion Torrent (another Life Technologies product) is the first so-called ‘post-light sequencing’ technology, as it uses neither fluorescence nor luminescence [69] . In a manner analogous to 454 sequencing, beads bearing clonal populations of DNA fragments (produced via an emPCR) are washed over a picowell plate, followed by each nucleotide in turn; however nucleotide incorporation is measured not by pyrophosphate release, but the difference in pH caused by the release of protons (H  + ions) during polymerisation, made possible using the complementary metal-oxide-semiconductor (CMOS) technology used in the manufacture of microprocessor chips [69] . This technology allows for very rapid sequencing during the actual detection phase [67] , although as with 454 (and all other pyrosequencing technologies) it is less able to readily interpret homopolymer sequences due to the loss of signal as multiple matching dNTPs incorporate [70] .

The oft-described ‘genomics revolution’, driven in large part by these remarkable changes in nucleotide sequencing technology, has drastically altered the cost and ease associated with DNA sequencing. The capabilities of DNA sequencers have grown at a rate even faster than that seen in the computing revolution described by Moore's law: the complexity of microchips (measured by number of transistors per unit cost) doubles approximately every two years, while sequencing capabilities between 2004 and 2010 doubled every five months [71] . The various offshoot technologies are diverse in their chemistries, capabilities and specifications, providing researchers with a diverse toolbox with which to design experiments. However in recent years the Illumina sequencing platform has been the most successful, to the point of near monopoly [72] and thus can probably considered to have made the greatest contribution to the second-generation of DNA sequencers.

4. Third-generation DNA sequencing

There is considerable discussion about what defines the different generations of DNA sequencing technology, particularly regarding the division from second to third [73] , [74] , [75] , [76] . Arguments are made that single molecule sequencing (SMS), real-time sequencing, and simple divergence from previous technologies should be the defining characteristics of the third-generation. It is also feasible that a particular technology might straddle the boundary. Here we consider third generation technologies to be those capable of sequencing single molecules, negating the requirement for DNA amplification shared by all previous technologies.

The first SMS technology was developed in the lab of Stephen Quake [77] , [78] , later commercialized by Helicos BioSciences, and worked broadly in the same manner that Illumina does, but without any bridge amplification; DNA templates become attached to a planar surface, and then propriety fluorescent reversible terminator dNTPs (so-called ‘virtual terminators’ [79] ) are washed over one base a time and imaged, before cleavage and cycling the next base over. While relatively slow and expensive (and producing relatively short reads), this was the first technology to allow sequencing of non-amplified DNA, thus avoiding all associated biases and errors [73] , [75] . As Helicos filed for bankruptcy early in 2012 [80] other companies took up the third-generation baton.

At the time of writing, the most widely used third-generation technology is probably the single molecule real time (SMRT) platform from Pacific Biosciences [81] , available on the PacBio range of machines. During SMRT runs DNA polymerisation occurs in arrays of microfabricated nanostructures called zero-mode waveguides (ZMWs), which are essentially tiny holes in a metallic film covering a chip. These ZMWs exploit the properties of light passing through apertures of a diameter smaller than its wavelength, which causes it to decay exponentially, exclusively illuminating the very bottom of the wells. This allows visualisation of single fluorophore molecules close to the bottom of the ZMW, due to the zone of laser excitation being so small, even over the background of neighbouring molecules in solution [82] . Deposition of single DNA polymerase molecules inside the ZMWs places them inside the illuminated region ( Fig. 3 a): by washing over the DNA library of interest and fluorescent dNTPs, the extension of DNA chains by single nucleotides can be monitored in real time, as fluorescent nucleotide being incorporated – and only those nucleotides – will provide detectable fluorescence, after which the dye is cleaved away, ending the signal for that position [83] . This process can sequence single molecules in a very short amount of time. The PacBio range possesses a number of other advantageous features that are not widely shared among other commercially available machines. As sequencing occurs at the rate of the polymerase it produces kinetic data, allowing for detection of modified bases [84] . PacBio machines are also capable of producing incredibly long reads, up to and exceeding 10 kb in length, which are useful for de novo genome assemblies [73] , [81] .

An external file that holds a picture, illustration, etc.
Object name is gr3.jpg

Third-generation DNA sequencing nucleotide detection. (a): Nucleotide detection in a zero-mode waveguide (ZMW), as featured in PacBio sequencers. DNA polymerase molecules are attached to the bottom of each ZMW (*), and target DNA and fluorescent nucleotides are added. As the diameter is narrower than the excitation light's wavelength, illumination rapidly decays travelling up the ZMW: nucleotides being incorporated during polymerisation at the base of the ZMW provide real-time bursts of fluorescent signal, without undue interference from other labelled dNTPs in solution. (b): Nanopore DNA sequencing as employed in ONT's MinION sequencer. Double stranded DNA gets denatured by a processive enzyme (†) which ratchets one of the strands through a biological nanopore (‡) embedded in a synthetic membrane, across which a voltage is applied. As the ssDNA passes through the nanopore the different bases prevent ionic flow in a distinctive manner, allowing the sequence of the molecule to be inferred by monitoring the current at each channel.

Perhaps the most anticipated area for third-generation DNA sequencing development is the promise of nanopore sequencing, itself an offshoot of a larger field of using nanopores for the detection and quantification of all manner of biological and chemical molecules [85] . The potential for nanopore sequencing was first established even before second-generation sequencing had emerged, when researchers demonstrated that single-stranded RNA or DNA could be driven across a lipid bilayer through large α -hemolysin ion channels by electrophoresis. Moreover, passage through the channel blocks ion flow, decreasing the current for a length of time proportional to the length of the nucleic acid [86] . There is also the potential to use non-biological, solid-state technology to generate suitable nanopores, which might also provide the ability to sequence double stranded DNA molecules [87] , [88] . Oxford Nanopore Technologies (ONT), the first company offering nanopore sequencers, has generated a great deal of excitement over their nanopore platforms GridION and MinION ( Fig. 3 b) [89] , [90] , the latter of which is a small, mobile phone sized USB device, which was first released to end users in an early access trial in 2014 [91] . Despite the admittedly poor quality profiles currently observed, it is hoped that such sequencers represent a genuinely disruptive technology in the DNA sequencing field, producing incredibly long read (non-amplified) sequence data far cheaper and faster than was previously possible [92] , [90] , [85] . Already MinIONs have been used on their own to generate bacterial genome reference sequences [93] , [94] and targeted amplicons [95] , [96] , or used to generate a scaffold to map Illumina reads to [97] , [98] , [96] , combining the ultra long read length of the nanopore technology and the high read depth and accuracy afforded by the short read sequencing. The fast run times and compact nature of the MinION machine also presents the opportunity to decentralize sequencing, in a move away from the core services that are common today. They can even be deployed it in the field, as proved by Joshua Quick and Nicholas Loman earlier this year when they sequenced Ebola viruses in Guinea two days after sample collection [99] . Nanopore sequencers could therefore revolutionize not just the composition of the data that can be produced, but where and when it can be produced, and by whom.

5. Conclusions

It is hard to overstate the importance of DNA sequencing to biological research; at the most fundamental level it is how we measure one of the major properties by which terrestrial life forms can be defined and differentiated from each other. Therefore over the last half century many researchers from around the globe have invested a great deal of time and resources to developing and improving the technologies that underpin DNA sequencing. At the genesis of this field, working primarily from accessible RNA targets, researchers would spend years laboriously producing sequences that might number from a dozen to a hundred nucleotides in length. Over the years, innovations in sequencing protocols, molecular biology and automation increased the technological capabilities of sequencing while decreasing the cost, allowing the reading of DNA hundreds of basepairs in length, massively parallelized to produce gigabases of data in one run. Researchers moved from the lab to the computer, from pouring over gels to running code. Genomes were decoded, papers published, companies started – and often later dissolved – with repositories of DNA sequence data growing all the while. Therefore DNA sequencing – in many respects a relatively recent and forward-focussed research discipline – has a rich history. An understanding of this history can provide appreciation of current methodologies and provide new insights for future ones, as lessons learnt in the previous generation inform the progress of the next.

Book cover

Soft Computing for Security Applications pp 723–732 Cite as

Classification of DNA Sequence Using Machine Learning

  • Satya Sandeep Kanumalli 17 ,
  • S. Swathi 17 ,
  • K. Sukanya 17 ,
  • V. Yamini 17 &
  • N. Nagalakshmi 17  
  • Conference paper
  • First Online: 30 September 2022

554 Accesses

1 Citations

Part of the Advances in Intelligent Systems and Computing book series (AISC,volume 1428)

In the field of medical information research, the genetic series is widely used as a component of a category. One of the applications of ML is biochemistry. Bioinformatics is an interdisciplinary science that uses computers and communication science to understand biological data. One of its most difficult tasks is to distinguish between regular genes and disease-causing genes. The classification of gene sequences into existing categories is utilized in genomic research to discover the functions of novel proteins. As a result, it is critical to identify and categorize such genes. We employ ML approaches to distinguish between infected and normal genes using classification methods. AdaBoost has a high degree of precision; relative to the bagging algorithm and Random Forest Algorithm, AdaBoost fully considers the weight of each classifier. To generate a sequence of weak classifiers, an AdaBoost-based learning approach is used to find the most ‘informative’ or ‘discriminating’ features. The identification cascade structure can also help to limit false-positive results. This study provides an overview of the mechanics of gene sequence classification using ML Techniques, including a brief introduction to bioinformatics and important challenges in DNA Sequencing with ML.

  • Machine learning
  • DNA sequencing
  • AdaBoost algorithm
  • Bioinformatics

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Dixit, P., & Prajapati, G. I. (2015, February). Machine learning in bioinformatics: A novel approach for DNA sequencing. In  2015 fifth international conference on advanced computing & communication technologies  (pp. 41–47). IEEE.

Google Scholar  

Moyer, E., & Das, A. (2020, December). Machine learning applications to DNA subsequence and restriction site analysis. In  2020 IEEE signal processing in medicine and biology symposium (SPMB)  (pp. 1–6). IEEE.

Vinodhini, R., Suganya, R., Karthiga, S., & Priyanka, G. (2019). Literature survey on DNA sequence by using machine learning algorithms and image registration technique. In Advances in data and information sciences (pp. 55–63).

Saghir, H., & Megherbi, D. B. (2013, November). A random-forest-based efficient comparative machine learning predictive DNA-codon metagenomics binning technique for WMD events and applications. In  2013 IEEE international conference on technologies for homeland security (HST)  (pp. 171–177). IEEE.

Silva, R., Padovani, K., Góes, F., & Alves, R. C. (2019, October). A random forest classifier for prokaryotes gene prediction. In  2019 8th Brazilian conference on intelligent systems (BRACIS)  (pp. 545–550). IEEE.

Schapire, R. E. (2013). Explaining adaboost. In  Empirical inference  (pp. 37–52). Springer.

Shahraki, A., Abbasi, M., & Haugen, Ø. (2020). Boosting algorithms for network intrusion detection: A comparative evaluation of real adaboost, gentle adaboost and modest adaboost. Engineering Applications of Artificial Intelligence, 94 , 103770.

Article   Google Scholar  

Zhang, Y., Ni, M., Zhang, C., Liang, S., Fang, S., Li, R., & Tan, Z. (2019, May). Research and application of AdaBoost algorithm based on SVM. In  2019 IEEE 8th joint international information technology and artificial intelligence conference (ITAIC)  (pp. 662–666). IEEE.

Wang, Y., Ru, J., Jiang, Y., & Zhang, J. (2019). Adaboost-SVM-based probability algorithm for the prediction of all mature miRNA sites based on structured-sequence features. Scientific reports, 9 (1), 1–14.

Yang, L., Li, X., Shu, T., Wang, P., & Li, X. (2021). PseKNC and Adaboost-based method for DNA-binding proteins recognition.  International Journal of Pattern Recognition and Artificial Intelligence , 2150022.

Richardson, R. T., Bengtsson-Palme, J., & Johnson, R. M. (2017). Evaluating and optimizing the performance of software commonly used for the taxonomic classification of DNA metabarcoding sequence data. Molecular Ecology Resources, 17 (4), 760–769.

Akar, Ö., & Güngör, O. (2012). Classification of multispectral images using random forest algorithm. Journal of Geodesy and Geoinformation, 1 (2), 105–112.

Korada, N. K., Kumar, N. S. P., & Deekshitulu, Y. V. N. H. (2012). Implementation of naïve Bayesian classifier and Ada-boost algorithm using maize expert system.  International Journal of Information Sciences and Techniques (IJIST) ,  2 .

Mazini, M., Shirazi, B., & Mahdavi, I. (2019). Anomaly network-based intrusion detection system using a reliable hybrid artificial bee colony and AdaBoost algorithms. Journal of King Saud University-Computer and Information Sciences, 31 (4), 541–553.

Fan, C., Deng, Q., & Zhu, T. F. (2021). Bioorthogonal information storage in l-DNA with a high-fidelity mirror-image Pfu DNA polymerase. Nature Biotechnology, 39 (12), 1548–1555.

Mardis, E. R. (2017). DNA sequencing technologies: 2006–2016. Nature protocols, 12 (2), 213–218.

Lu, Y., Qu, W., Shan, G., & Zhang, C. (2015). DELTA: A distal enhancer locating tool based on AdaBoost algorithm and shape features of chromatin modifications. PLoS ONE, 10 (6), e0130622.

Dou, L., Li, X., Zhang, L., Xiang, H., & Xu, L. (2020). iGlu_AdaBoost: Identification of lysine glutarylation using the Adaboost classifier. Journal of Proteome Research, 20 (1), 191–201.

Gopi, A. P., & Naik, K. J. (2021, December). A model for analysis of IoT based aquarium water quality data using CNN model. In 2021 international conference on decision aid sciences and application (DASA) (pp. 976–980). IEEE.

Naik, K. J., Pedagandam, M., & Mishra, A. (2021). Workflow scheduling optimisation for distributed environment using artificial neural networks and reinforcement learning. International Journal of Computational Science and Engineering, 24 (6), 653–670.

Smys, S., Chen, J. I. Z., & Shakya, S. (2020). Survey on neural network architectures with deep learning. Journal of Soft Computing Paradigm (JSCP), 2 (03), 186–194.

Krishna, K. V. S. S. R., et al. (2021). Classification of Glaucoma optical coherence tomography (OCT) images based on blood vessel identification using CNN and firefly optimization. Traitement du Signal, 38 (1).

Gopi, A. P., et al. (2020). Classification of tweets data based on polarity using improved RBF kernel of SVM. International Journal of Information Technology , 1–16.

Sirisha, A., Chaitanya, K., Krishna, K. V. S. S. R., & Kanumalli, S. S. (2021). Intrusion detection models using supervised and unsupervised algorithms—A comparative estimation. International Journal of Safety and Security Engineering, 11 (1), 51–58. https://doi.org/10.18280/ijsse.110106.

Rani, B. M. S., Majety, V. D., Pittala, C. S., Vijay, V., Sandeep, K. S., & Kiran, S. (2021). Road identification through efficient edge segmentation based on morphological operations. Traitement du Signal, 38 (5), 1503–1508. https://doi.org/10.18280/ts.38052.

Download references

Author information

Authors and affiliations.

CSE Department, Vignan’s Nirula Institute of Technology and Science for Women, Guntur, India

Satya Sandeep Kanumalli, S. Swathi, K. Sukanya, V. Yamini & N. Nagalakshmi

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Satya Sandeep Kanumalli .

Editor information

Editors and affiliations.

Department of Electronics and Communication Engineering, Gnanamani College of Technology, Namakkal, Tamil Nadu, India

G. Ranganathan

Ryerson Communications Lab, Toronto, ON, Canada

Xavier Fernando

Department of Information Systems, University of Florida, Gainesville, FL, USA

Selwyn Piramuthu

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Kanumalli, S.S., Swathi, S., Sukanya, K., Yamini, V., Nagalakshmi, N. (2023). Classification of DNA Sequence Using Machine Learning. In: Ranganathan, G., Fernando, X., Piramuthu, S. (eds) Soft Computing for Security Applications. Advances in Intelligent Systems and Computing, vol 1428. Springer, Singapore. https://doi.org/10.1007/978-981-19-3590-9_57

Download citation

DOI : https://doi.org/10.1007/978-981-19-3590-9_57

Published : 30 September 2022

Publisher Name : Springer, Singapore

Print ISBN : 978-981-19-3589-3

Online ISBN : 978-981-19-3590-9

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Open access
  • Published: 25 July 2022

A review of deep learning applications in human genomics using next-generation sequencing data

  • Wardah S. Alharbi 1 &
  • Mamoon Rashid   ORCID: orcid.org/0000-0003-1457-477X 1  

Human Genomics volume  16 , Article number:  26 ( 2022 ) Cite this article

23k Accesses

15 Citations

5 Altmetric

Metrics details

Genomics is advancing towards data-driven science. Through the advent of high-throughput data generating technologies in human genomics, we are overwhelmed with the heap of genomic data. To extract knowledge and pattern out of this genomic data, artificial intelligence especially deep learning methods has been instrumental. In the current review, we address development and application of deep learning methods/models in different subarea of human genomics. We assessed over- and under-charted area of genomics by deep learning techniques. Deep learning algorithms underlying the genomic tools have been discussed briefly in later part of this review. Finally, we discussed briefly about the late application of deep learning tools in genomic. Conclusively, this review is timely for biotechnology or genomic scientists in order to guide them why, when and how to use deep learning methods to analyse human genomic data.

Introduction

Understanding the genomes of diverse species, specifically, the examination of more than 3 billion base-pairs of Homo sapiens DNA, is a crucial aim of genomic studies. Genomics takes a comprehensive view that implicates all the genes within an organism, including protein-coding genes, RNA genes, cis- and trans- elements, etc. It is a data-driven science involving the high-throughput technological development of next-generation sequencing (NGS) that generates the entire DNA data of an organism. These techniques include whole genome sequencing (WGS), whole exome sequencing (WES), transcriptomic and proteomic profiling [ 1 , 2 , 3 , 4 , 5 ]. With the recent rapid accumulation of these omics data, increased attention has been paid to bioinformatics and machine learning (ML) tools with established superior performance in several genomics implementations [ 6 ]. These implementations involve finding a genotype–phenotype correlation, biomarker identification and gene function prediction, as well as mapping the biomedically active genomic regions, for example, transcriptional enhancers [ 7 , 8 , 9 , 10 ].

Machine learning (ML) has been deliberated as a core technology in artificial intelligence (AI), which enables the use of algorithms and makes critical predictions based on data learning and not simply following instructions. It has broad technology applications; however, standard ML methods are too narrow to deal with complex, natural, highly dimensional raw data, such as those of genomics. Alternatively, the deep learning (DL) approach is a promising and exciting field currently employed in genomics. It is an ML derivative that extracts features by applying neural networks (NN) automatically [ 11 , 12 , 13 , 14 ]. Deep learning has been effectively applied in fields such as image recognition, audio classification, natural language processing, online web tools, chatbots and robotics. In this regard, the utilisation of DL as a genomic methodology is totally apt to analyse a large amount of data. While it is still in its infant stages, DL in genomics holds the promise of updating arenas such as clinical genetics and functional genomics [ 15 ]. Undoubtedly, DL algorithms have dominated computational modelling approaches in which they are currently regularly expanded to report a variety of genomics questions ranging from understanding the effects of mutations on protein–RNA binding [ 16 ], prioritising variants and genes, diagnosing patients with rare genetic disorders [ 17 ], predicting gene expression levels from histone modification data [ 18 ] and to identifying trait-associated single-nucleotide polymorphisms (SNPs) [ 19 ].

Although the first concept of the DL theory originated in the 1980s was based on the perceptron model and neuron concept [ 20 ], within the last decade, DL algorithms have become a state-of-the-art predictive technology for big data [ 21 , 22 , 23 ]. The initial efficient implementation of DL prediction models in genomics was in the 2000s (Fig.  1 ) [ 24 ]. The difficulty associated with the requirement of DL models to train an enormous amount of training datasets and the need for powerful computing resources limited their applications until the introduction of modern hardware, such as the high-efficiency graphical processing units (GPUs) with equivalent structures. Now, the architectures of DL models (also known as DNNs) are implemented in diverse areas, as mentioned earlier. Classical neural networks consist of only two to three hidden layers; however, DL networks extend this up to 200 layers. Thus, the word “deep” reflects the number of layers that the information passes through. However, DL requires superior hardware and substantial parallelism to be applicable [ 25 ]. Due to overwhelmed hardware limitations and demanding resources, several DL packages and resources were introduced to facilitate DL model implementation (discussed in section  deep learning resources for genomics ).

figure 1

Timeline of implementing deep learning algorithms in genomics. This timeline plot demonstrated the delay of implementing DL tools in genomics; for example, both (LSTM) and (BLSTM) algorithms have been invented in 1997 and the first genomic application was implemented in 2015. Similar observations are for the rest of the deep learning algorithms (Table 6 )

The evolution of software, hardware (GPUs) and big data in genomics has facilitated the development of deep learning-based prediction models for the prediction of functional elements in genomes. These genetic variants from NGS data predict splice sites in genomic DNA, predict the transcription factor binding sites (TFBSs) via classification tasks, classify the pathogenicity of missense mutations and predict drug response and synergy [ 26 , 27 , 28 , 29 , 30 , 31 ]. An example of a technological evolution that has enhanced DL implementation is cloud platforms, which provide GPU resources as a DL solution. GPUs can considerably escalate the training speed as the neural network training style can be more adaptable in certain model architecture situations, thus permitting fast mathematical processes through the use of larger processing unit numbers and high-memory capacities. Primary examples of cloud computing platforms include Amazon Web Services, Google Compute Engine and Microsoft Azure. However, these elucidations still require users to implement model codes [ 32 ].

For all ML models, the evaluation metrics are essential in understanding the model performance. Basically, these metrics are crucial to be considered in case of genomic datasets which generate naturally a highly imbalanced classes that makes them demanding to be applied by ML and DL models. A sufficient number of solutions usually applied in this case such as transfer learning [ 33 ] and Matthews correlation coefficient (MCC) [ 34 ]. In common sense, every ML task can be divided into a regression task (e.g. predicting certain outcomes/effects of a disease) or a classification task (e.g. predicting the presence/absence of a disease); additionally, multiple measurement metrics are obtained from those tasks. Generally, some, but not all, performance metrics used in ML regression-based methods include: mean absolute error (MAE), mean squared error (MSE), root-mean-squared error (RMSE) and coefficient of determination (R 2 ). In contrast, the performance metrics in ML classification-based methods include: accuracy, confusion matrix, area under the curve (AUC) or/and area under receiver operating characteristics (AUROC) and F1-score. The classification tasks are most commonly applied to problems in research areas in genomics and for comparing different models’ performance. For example, AUC is the most widely used metric for evaluating the model performance ranging from [0, 1]. It measures the true-positive rate (TPR) or sensitivity, true-negative rate (TNR) or specificity and the false-positive rate (FPR). Additionally, the F1-score is used to test the model accuracy in highly imbalanced dataset and is the harmonic mean between the precision and recall (also ranging from [0, 1]). For both AUC and F1-score, a greater value reflects better model performance. Also, the confusion matrix describes the complete model performance by measuring the model accuracy to calculate true-positive values plus true-negative values and dividing the sum over the total number of samples [ 35 , 36 ]. For a greater understanding of the ML evaluation metrics—purpose, calculation, etc.—recommended papers include Handelman et al. (2019) and England and Cheng (2019).

This article reviews deep learning tools/methods based on their current applications in human genomics. We began by collecting recent (i.e. published in 2015–2020) DL tools in five main genomics areas: variant calling and annotation, disease variants, gene expression and regulation, epigenomics and pharmacogenomics. Then, we briefly discussed DL genomics-based algorithms and their application strategies and data structure. Finally, we mentioned DL-based practical resources to facilitate DL adoption that would be extremely beneficial mostly to biomedical researchers and scientists working in human genomics. For further information on the field of DL applications in genomics, we recommend: [ 37 , 38 , 39 ].

Deep learning tools/software/pipelines in genomics

Multiple genomic disciplines (e.g. variant calling and annotation, disease variant prediction, gene expression and regulation, epigenomics and pharmacogenomics) take advantage of generating high-throughput data and utilising the power of deep learning algorithms for sophisticated predictions (Fig.  2 ). The modern evolution of DNA/RNA sequencing technologies and machine learning algorithms especially deep learning opens a new chapter of research capable of transforming big biological data into new knowledge or novel findings in all subareas of genomics. The following sections will discuss the latest software/tools/pipelines developed using deep learning algorithms in various genomics areas.

figure 2

Deep learning applications in genomics. This figure represents the application of deep learning tools in five major subareas of genomics. One example deep learning tool and underlying network architecture has been shown for each of the genomic subareas, and its input data type and the predictive output were mentioned briefly. Each bar plot depicts the frequency of most used deep learning algorithms underlying deep learning tools in that subarea of genomics (Tables 1 , 2 , 3 , 4 , 5 )

Variant calling and annotation

This first section discusses the applications of the latest DL algorithms in variant calling and annotation. We provided a short list of tools/algorithms for variant calling and annotation with their source code links, if available (Table 1 ), to facilitate the selection of the most suitable DL tool for a particular data type.

NGS, including whole genome or exome, sets the stage for early developments in personalised medicine, along with its known implications in Mendelian disease research. With the advent of massively parallel, high-throughput sequencing, sequencing thousands of human genomes to identify genetic variations has become a routine practice in genomics, including cancer research. Sophisticated bioinformatics and statistical frameworks are available for variant calling.

The weakness of high-throughput sequencing procedures is represented by significantly high technical and bioinformatics error rates [ 40 , 41 , 42 ]. Numerous computational problems have originated due to the enormous amounts of medium or low coverage genome sequences, short read fragments and genetic variations among individuals [ 43 ]. Such weaknesses make the NGS data dependent on bioinformatics tools for data interpretation. For instance, several variant calling tools are broadly used in clinical genomic variant analyses, such as genome analysis toolkit (GATK) [ 44 ], SAMtools [ 45 ], Freebayes [ 46 ] and Torrent Variant Caller (TVC; [ 47 ]). However, despite the availability of whole genome sequencing, some actual variants are yet to be discovered [ 48 ].

Contemporary deep learning tools have been proposed in the field of next-generation sequencing to overcome the limitations of conventional interpretation pipelines. For example, Kumaran et al . demonstrated that combining DeepVariant, a deep learning-based variant caller, with conventional variant callers (e.g. SAMtools and GATK) improved the accuracy scores of single-nucleotide variants and Indel detections [ 49 ]. Implementing deep learning algorithms in DNA sequencing data interpretation is in its infancy, as seen with the recent pioneering example, DeepVariant, developed by Google. DeepVariant relies on the graphical dissimilarities in input images to perform the classification task for genetic variant calling from NGS short reads. It treats the mapped sequencing datasets as images and converts the variant calls into image classification tasks [ 30 ]. However, this model does not provide details about the variant information, for example, the exact alternative allele and type of variant. As such, it is classified as an incomplete variant caller model [ 50 ].

Later, several DL models for variant calling and annotation were introduced. For instance, Cai et al. (2019) introduced DeepSV, a genetic variant caller that aims to predict long genomic deletions (> 50 bp) extracted from sequencing read images but not other types of structural variants, such as long insertions or inversions. It processes the BAM format or VCF files as inputs and outputs the results in the VCF form. In terms of evaluating DeepSV, it was compared with another eight deletion calling tools and one machine learning-based tool called Concod [ 51 ]. The results reveal that although Concod has shorter training times in the case of fewer trained samples, DeepSV shows a higher accuracy score and fewer training losses using the same dataset [ 52 ]. Another genomic variant filtering tool, GARFIELD-NGS, can be applied directly to the variant caller outputs. It relies on an MLP algorithm to investigate the true and false variants in exome sequencing datasets generated from the Ion Torrent and Illumina platforms. It represents a robust performance at low coverage data (up to 30X) by handling standard VCF file, resulting in another VCF file. Ravasio et al. (2018) observed that the GARFIELD-NGS model recorded a significant reduction in the false candidate variants after applying a canonical pipeline for the variant prioritisation of disease-related data [ 53 ].

The Clairvoyante model was introduced to predict variant type (SNP or Indel), zygosity, allele alternative and Indel length. Thus, it overcomes the DeepVariant model’s drawback of lacking the full variant details, including the precise alternative allele and variant type. The Clairvoyante model was specifically designed to utilise long-read sequencing data generated from SMS technologies (e.g. PacBio and ONT), although it is commonly applicable for short read datasets as well [ 50 ]. Another variant caller and annotation model, Intelli-NGS, was introduced by Singh and Bhatia (2019). One variant calling was based on artificial neural network (ANN), which utilises the data generated from the Ion Torrent platform to identify true and false effectively. Intelli-NGS takes any number of VCF files as batch inputs and processes them in order. The processed data results in an excel sheet related to each VCF file containing the HGVS codes of all variants [ 54 ]. All in all, several studies confirmed the capabilities of deep learning in genetic variant calling and annotation from sequencing data.

  • Disease variants

Deep learning-based models for the prediction of pathogenic variants, their application and input/output formats with source codes (if available) are listed in Table 2 .

Considering extra data from patient relatives or relevant cohorts, medical geneticists frequently prioritise and filter the observed genetic variants after variant calling and annotation (Müller et al. [ 55 ]). Variant prioritisation is a method of determining the most likely pathogenic variant within genetic screening that damages gene function and underlying the disease phenotype [ 56 ]. Variant prioritisation involves variant annotation to discover clinically insignificant variants, such as synonymous, deep-intronic variants and benign polymorphisms. Subsequently, the remaining variants, such as known variants or variants of unknown clinical significance (VUSs), become attainable [ 57 ]. Furthermore, complications in interpreting rare genetic variants in individuals, for example, and understanding their impacts on disorder risk influence the clinical capability of diagnostic sequencing. For example, the numerous and infrequent VUSs in rare genetic diseases represent a challenging obstacle in sequencing implementation for personalised medicine and healthy population assessment (Sundaram et al., 2018). Although statistical methods, such as GWAS, have had huge success in combining genetic variants to disorders, they still require heavy sampling to distinguish rare genetic variants and cannot deliver information about de novo variants (Fu et al., 2014). Thus, current annotation approaches, such as PolyPhen [ 58 ], SIFT [ 59 ] and GERP [ 60 ], represent beneficial methods for prioritising the causative variants, despite facing some drawbacks. For such problems, DL-based models have been implemented to enable a powerful method for exploiting the deep neural network (DNN) architecture to prioritise variants, for instance, the Basset model, a variant annotator, that relies on a CNN algorithm and is designed to predict the causative SNP exploiting DNase I hypersensitivity sequencing data as an input (Kelley, Snoek and Rinn, 2016).

The clinical and molecular validations cannot be replaced by in silico prediction models; however, in a sense, they can contribute to decrease waiting times for results and can prioritise variants for further functional analysis. These predictable models are mainly suitable when several poorly understood candidate variants convey certain phenotypes [ 27 ]. Medical genetics has been significantly transformed following the proposition of NGS technology, particularly with WGS because of its power to interpret genomic variations in both coding and non-coding fragments within the entire human genome. Recently, several ML-based methods have offered to prioritise non-coding variants; still, the recognition of disease-associated variants in complex traits, such as cancers, is challenging. Plus, the majority of positive variants associated with a certain phenotype is required to predict general and precise novel correlations (Schubach et al., 2017). Lately, several DL approaches have been proposed to overcome these challenges. For example, the DeepWAS model relies on a CNN algorithm that allows regulatory impact prediction of each variant on numerous cell-type-specific chromatin features. The key result of the DeepWAS model is the direct determination of the disease-associated SNPs with a common effect on a certain chromatin trait in the related tissue. The DeepWAS model demonstrated the ability to detect the disease-relevant, transcriptionally active genomic position after combining the expression and methylation quantitative-trait loci data (eQTL and meQTL, respectively) of various resources and tissues [ 19 ]. Nevertheless, several deep learning algorithms have been described as discovering novel genes. For this reason, deep learning approaches are particularly suited for variant investigation for genes not yet related to specific disease phenotypes [ 61 , 62 ].

Gene expression and regulation

In this section, we focused on the most efficient deep learning-based tools in the area of gene expression and regulation in the genome. We listed several models applying various deep learning algorithms and summarised the information and source codes mostly in splicing and gene expression applications, if available (Table 3 ).

Gene expression involves the initial transcriptional regulators (e.g. pre-mRNA splicing, transcription and polyadenylation) to functional protein production [ 63 ]. The high-throughput screening technologies that test thousands of synthetic sequences have provided rich knowledge concerning the quantitative regulation of gene expression, although with some limitations. The main limitation is that huge biological sequence regions cannot be explored using experimental or computational techniques [ 64 ]. Although recent NGS technology has provided great knowledge in the gene-regulation field, the majority of natural mRNA screening approaches still utilise chromatin accessibility, ChIP-seq and DNase-seq information; they focus on studying promoter regions. Therefore, a robust method is required to understand the relationship between various regions of gene regulatory structures and their networks expression connection [ 65 ]. Likewise, the current technology in RNA sequencing has empowered the direct sequencing of single cells, identified as single-cell RNA sequencing (scRNA-seq), that permits querying biological systems at unique intention. For example, the data of scRNA-seq produce valuable information into cellular heterogeneity that could expand the interpretation of human diseases and biology [ 66 , 67 ]. Its major applications of scRNA-seq data understanding involved in detecting the type and state of the cells [ 68 , 69 ]. However, the two main computational questions include how to cluster the data and how to retrieve them [ 70 ].

Deep learning has empowered essential progress for constructing predictive methods linking regulatory sequence elements to the molecular phenotypes [ 71 , 72 , 73 , 74 ]. Just recently, Gundogdu and his colleagues (2022) demonstrate an excellent classification model based on deep neural networks (DNNs). It constricted numerous types of previous biological information on functional networks between genes to understand a biological significant illustration of the scRNA-seq data [ 70 ]. Moreover, Li et al. (2020) present a DESC an unsupervised deep learning algorithm implemented based on python, which understands iteratively representation of cluster-specific gene expression and the scRNA-seq analysis cluster tasks [ 75 ]. Further, deep learning model has also been applied for single-cell sequencing data. Its deep neural network (DNN) model designed to measure the immune infiltration in both colorectal and breast cancers bulk scRNA-seq data. This approach permits quantifying a particular type of immune cells such as CD8 + and CD4Tmem plus the general population of lymphocytes together with Stromal content and B cells [ 76 ].

Recently, Jaganathan et al . (2019) constructed SpliceAI, a deep residual neural network that predicts splice function using only pre-mRNA transcript sequencing as inputs. An architecture contained a 32-dilated convolutional layer employed to identify sequence determinates crossing enormous genomic gaps since there are tens of thousands of nucleotides separated splice-donors and splice-acceptors [ 71 ].

Many experimental datasets, such as the ChIP-seq and DNase-seq assays, do not measure the effects on gene expression directly; however, they are an ideal complement to deep neural network methods. For instance, Movva et al. (2019) introduced the MPRA-DragoNN model, based on CNN architecture for prediction and analysis of the transcription regulatory activity of non-coding DNA sequencing data measured from (MPRAs) data. Approximately 16 K distinct regulatory regions in K562 and HepG2 cell lines of 295 bp cis -regulatory elements cloned upstream of either minimal-promoter or strong-promoter used in the Sharpr-MPRA evaluation [ 77 ]. A very contemporary DL model, introduced by Agarwal and Shendure, named the Xpresso model, a deep convolutional neural network (CNN), conjointly models the promoter sequence and its related mRNA stability features to predict the gene expression levels of mRNA. Interestingly, Xpresso models are simple to train at several arbitrary cell types, even when they lack experimental information, such as ChIP and DNase [ 73 ]. Zhang Z. et al. (2019) developed a deep learning-based model called DARTS; deep learning augmented RNA-seq analysis of transcript splicing, that use a wide-ranging RNA-seq resources of a various alternative splicing. It consists of two main modules: deep neural network (DNN) and Bayesian hypothesis testing (BHT) [ 78 ]. More DL-based models (specifically, four different CNN architectures) designed by Bretschneider et al. (2018), named the competitive splice site model (COSSMO), which adapts to various quantities of alternative splice sites and precisely estimates them via genome-wide cross-validation. The frameworks consist of convolutional layers, communication layers, long short-term memory (LSTM) and residual networks, correspondingly, to discover related motifs from DNA sequences. In every putative splice site, the used model inputs are DNA and RNA sequences with 80 nucleotide-wide windows around the alternative splice sites and opposite constitutive splice sites together with the intron length. The outputs of the model are predictions of percent selected index (PSI) distribution of every putative splice-site. All of COSSMO model’s performance exceeds MaxEntScan; however, there were large performance variances among the four frameworks, in which recurrent LSTM reached the best accuracy over the communication networks, which did not consider the splice-site ordering [ 79 ]. However, to learn the automated relationships among heterogeneous datasets in imperfect biological situations, deep learning models offer unprecedented opportunities.

  • Epigenomics

This section discusses some epigenomics challenges and summarises up-to-date deep learning models in epigenomics, their implementation, data types and source code (Table 4 ). Modifications in phenotypes that are not based on genotype modifications are referred to as epigenetics. It is defined as the study of heritable modifications in gene expressions which does not include DNA sequence modifications [ 80 ]. Epigenomic mechanisms, including DNA methylation, histone modifications and non-coding RNAs, are considered fundamental in understanding disease developments and finding new treatment targets. Although in clinical implementations, epigenetics has yet to be completely employed. Recently, complications initiated in developing data interpretation tools to advances in next-generation sequencing and microarray technology to produce epigenetic data. The insufficiency of suitable and efficient computational approaches has led current research to focus on a specific epigenetic mark separately, although several mark interactions and genotypes occurred in vivo [ 81 ]. Several previous studies have disclosed the fundamental applications of deep learning models in epigenomics. They reached unlimited success in predicting 3D chromatin interactions, methylation status from single-cell datasets and histone modification sites based on DNase-Seq data [ 62 , 82 , 83 , 84 ].

Liu et al. (2018) introduced a hybrid deep CNN model, Deopen, which was applied to predict chromatin accessibility within a whole genome from learned regulatory DNA sequence codes. In order to analytically evaluate Deopen’s function in capturing the accessibility codes of a genome, a series of experiments were conducted from the perspective of binary classification [ 31 ]. As an example of Deopen applications, in the androgen-sensitive human prostate adenocarcinoma cell lines (LN-CaP), the EGR1 recovered by the Deopen model is assumed to play a critical role as a treatment target in gene therapy for prostate cancer [ 31 , 85 ]. Recently, Yin et al. (2019) proposed the DeepHistone framework, a CNN-based algorithm to predict the histone modifications to various site-specific markers. For precise predictions, this model combines DNA sequence data with chromatin accessibility information. It has revealed the capability to discriminate functional SNPs from their adjacent genetic variants, thus having the possibility to be utilised for investigating functional impacts of putative disorder-related variants [ 84 ]. Hence, efficient deep learning models are necessary for genome research to elucidate the epigenomic modifications’ impact on the downstream outputs.

  • Pharmacogenomics

We listed the most deliberated deep learning pharmacogenomics models, their common purposes, input/output formats and the source of code (Table 5 ). Although there has been a great interest in deep learning approaches in the last few years, until very recently, deep learning tools have been rarely employed for pharmacogenomics problems, such as to predict drug response [ 86 ]. Knowledge concerning the association between genetic variants in enormous gene clusters up to whole genomes and the impacts of varying drugs is called pharmacogenomics [ 87 ]. A key challenge in modern therapeutic methods is understanding the underlying mechanisms of variability. Sometimes the medication response distribution through a certain population is evidently bimodal, proposing a dominant function for one variable, which is usually genetic. Nonetheless, an understanding of the underlying mechanisms of pharmacokinetics or pharmacodynamics could be utilised to detect candidate genes, wherein the function of those gene variants could explicate various drug reactions ( 88 ). The clinical experiments generate various errors during the investigation of drug combination efficiency, which is time- and cost-intensive. Besides, it could expose the patient to excessive risky therapy [ 89 , 90 ]. In order to identify alternative drug synergy strategies without harming patients, high-throughput screening (HTS) using several concentrations of a couple of drugs employed to a cancer cell line is utilised [ 91 ]. Utilising existing HTS synergy datasets allowed the use of accurate computational models to investigate an enormous synergistic space. Such reliable models would provide direction for both in vitro and in vivo studies, and they are great steps towards personalised medicine, for instance, prediction approaches of anticancer synergic, systems biology [ 92 ], kinetic methods [ 93 ] and in silico-based models of gene expression screening after single-drug and dose-reaction treatments [ 94 ]. Nonetheless, these approaches are limited to particular targets, pathways or certain cell lines and sometimes need a particular omics dataset of treated cell lines with specific compounds [ 95 ].

To investigate these pharmacogenomics associations, statistical, such as the analysis of variance (ANOVA) test, is utilised. This can identify, for example, oncogenic changes that occur in patients, which are indicators of drug-sensitivity variances in cell lines. In order to move beyond the drug’s relations to the actual drug reaction predictions, numerous statistical and machine learning methods can be employed, from linear regression models to nonlinear ones, such as kernel methods, neural networks and SVM. A central weakness of these approaches is the massive number of inputs feature alongside the low sampling, such as in standard gene expression analysis, and the total number of input genes (or features) exceeds the sample number. An up-to-date strategy to overcome the low sampling number issue is to engage multitasking models [ 96 ].

Deep learning methods are reportedly well suited to treatment response prediction tasks based on cell-line omics datasets [ 95 , 97 ]. One of the examples is, DrugCell, a visible neural network (VNN) interpretation model for the structure and function of human cancer cells in therapy response. It pairs the model’s central mechanisms to the human cell-biology structure. Permitting the prediction of any drug response within any cancer then smartly plans the successful combination of treatments. DrugCell was developed to capture both elements of therapy response in an explainable model with two divisions, the VNN-integrating cell genotype and the artificial neural network (ANN)-integrating drug design. The first VNN model inputs comprise text files of the hierarchal association between molecular sub-systems in human cells, which contain 2086 biological process standards in the Gene Ontology (GO) database. The second ANN model inputs were conventional ANN integrating text files of the Morgan fingerprint of medicine, the chemical structure of a canonical vector symbol. The outputs from these two divisions were combined into a single layer of neurons that produced the response of a given genotype to a certain therapy. The prediction accuracy of each drug separately revealed a drug sub-population with significant accuracy. This, in turn, competes with the state-of-the-art regression methods applied in previous models to predict the drug response. Additionally, comparing DrugCell with a parallel neural network model trained merely on drug design and labelled tissue extremely outperformed the tissue-based model. This means that DrugCell has learned data from somatic mutations exceeding the tissue-only method [ 26 ]. Another recent model called DeepBL is based on deep learning architecture executed based on Small VGGNet structure (a type of CNNs) and TensorFlow library. This approach detects the beta-lactamases (BLs) and their varieties that provide resistance to beta-lactam antibiotics, with protein sequences as inputs. It is based on well-interpreted massive RefSeq datasets covering > 39 K BLs extracted from the NCBI database. Comparing this model with the other conventional machine learning-based algorithms, including SVM, RF, NB and LR, DeepBL outperformed them after evaluation on an independent test set comprising more than 10 K sequences [ 98 ]. Until very recently, deep learning applications in pharmacogenomics remained under consideration.

Deep learning algorithms/techniques used in genomics

The accomplishment of the recent, attainable models mentioned in deep learning tools/software/pipelines in genomics section suggests that deep learning is a powerful technique in genomic research. Here, we focus on deep learning algorithms recently applied in genomic applications: convolutional neural networks (CNNs), feedforward neural networks (FNN), natural language processing (NLP), recurrent neural networks (RNNs), long short-term memory networks (LSTMs), bidirectional long short-term memory networks (BLSTMs) and gated recurrent unit (GRU; Table 6 ; Fig.  1 ).

Deep learning is a contemporary and rapidly expanding subarea of machine learning. It endeavours to model concepts from wide-ranging data by occupying multi-layered DNNs, hence creating data logic, such as pictures, sounds and texts. Generally, deep learning has two features: first, the structure of nonlinear processing parts is multiple layers, and second, the feature extraction fashion on each layer is either the supervised or unsupervised method [ 99 ]. In the 1980s, the initial deep learning architecture was constructed on artificial neural networks (ANNs) [ 100 ], but the actual power of deep learning developed outward in 2006 [ 101 , 102 ]. Since then, deep learning has been functional in various arenas involving genomics, bioinformatics, drug discovery, automated speech detection, image recognition and natural language processing [ 6 , 13 , 103 ].

Artificial neural networks (ANNs) were motivated by the human brain’s neurons and their networks [ 104 ]. They consist of clusters of fully connected nodes, or neurons, demonstrating the stimulus circulation of synapses in the brain through the neural networks. This architecture of deep learning networks is utilised for feature extraction, classification, decreased data dimensions or sub-elements of a deeper framework such as CNNs [ 105 ].

Multi-omics study generates huge volumes of data, as mentioned earlier, basically because of the evolution that has been pursued in genomics and improvements in biotechnology. Symbolic examples involve the high-throughput technology, which extent thousands of gene expression or non-coding transcription, such as miRNAs. Moreover, the genotyping platforms and NGS techniques and the associated GWAS that generates measurable gene expression reports, such as RNA-Seq, discover numerous genetic variants, together with further genomic modifications in various populations [ 11 ]. However, some DL models rely purely on DNA sequence datasets that seemingly lack the power to create predictions of a cell-line-exclusive method due to the identical DNA sequencing of various cell lines. In order to overcome this deficiency, several hybrid deep learning models have been advised and revealed obvious enhancement in certain studies through joining DNA sequencing data with biological experiments information [ 84 ].

Feedforward Neural Networks (FNNs) Are a type of artificial neural network that consists of one forward direction network starting from input layers, crossing the hidden layers and reaching to the output layer, without forming loops such as RNNs [ 106 ]. It is used in genomics to comprehend the expression of target genes from the expression of landmark genes using the D-GEX model [ 12 ]. Moreover, active enhancers and promoters have been predicted across the human genome utilising the DECRES model [ 107 ]. Moreover, anticancer drug synergy predictions have been made via the DeepSynergy model [ 95 ].

Convolutional Neural Networks (CNNs) Also called ConvNet, CNN is a deep learning algorithm that has a deep feedforward architecture consisting of various building blocks, such as convolution layers, pooling layers and fully connected layers [ 97 , 108 ]. It illustrates a fully connected network since each node in a single layer is fully connected to the entire node of the next layer. The convolution units in the CNN layers can obtain the input data from units of the earlier one, which all together generate a prediction. The key principle of such deep construction is that massive processing and connection feature represents inferring nonlinear association between both inputs and outputs [ 109 , 110 ]. The most common analysis uses of CNNs were applied in graphical images and were initially considered a fully automated image network interpreter for classifying handcraft fonts [ 105 ].

For genomic functions, CNNs considered the dominant algorithm utilised genomic information (Fig.  2 ). The primary CNN implementation, DeepBind, was proposed by [ 111 ] for binding protein predictions and showed greater prediction power than conventional models (Table 6 ). More examples of CNN are used as a single algorithm in gene expression, and regulations include the DeepExpression model, which has been effectively used to predict gene expression using promoter sequences and enhancer–promoter interactions [ 112 ]. The SpliceAI model was introduced to identify splice function from pre-mRNA sequencing [ 71 ]. Further, the SPOT-RNA model was developed for predicting RNA secondary structure [ 16 ]. CNN was also used for DNA sequencing in call genetic variants, such as Clairvoyante, Intelli-NGS and DeepSV models [ 52 , 54 , 113 ]. In epigenomics, the DeepTACT model was used for predicting the 3D chromatin interactions [ 82 ], and the Basenji model was employed for predicting cell-type-specific epigenetic and transcriptional profiles in large mammalian genomes [ 114 ]. In disease variants, the ExPecto model was used to predict tissue-specific transcriptional effects of mutations/functions [ 32 ], and the DeepWAS model was used to identify disease or trait-associated SNPs [ 19 ]. Finally, in pharmacogenomics applications, CNN was utilised to create the DrugCell model for drug response and synergy predictions [ 26 ]. Additionally, the DeepDrug3D model was obtained for characterising and classifying the 3D protein binding pockets [ 115 ].

Additionally, CNN algorithms were combined with other algorithms to build up efficient approaches in epigenomics, combining CNN with GRU to predict the methylation states from single-cell data [ 83 ], while in terms of gene expression and regulation, [ 74 ] linked CNN algorithms with MLP in the DECRES model to predict active enhancers and promoters across the human genome. Besides, [ 116 ] used CNN with RNN algorithms in a DNA sequencing application to create the DAVI model and identify NGS read variants.

Recurrent neural networks (RNNs) are ANNs with a recurrent layer consisting of typical recurrent layers that enable state updates of past and current inputs with feedback connections. They are distinguished by the internal cycle connections between recurrent layer units and are concerned with sequential datasets [ 117 , 118 ]. Recurrent neural networks have regularly expended for the task that comprised in learning sequencing datasets, such as translation languages and recognising speech. However, it has not been utilised widely on DNA sequencing data which is the data style where the order link between bases are crucial for its assessment [ 119 ]. Maraziotis et al. [ 24 ] initiated RNN implementation in genomics using microarray experimental data based on recurrent the neuro-fuzzy protocol to infer the complicated causative relationship between genes by predicting the time-series of gene expression (Table 6 ).

Most RNNs are applied in genomics combined with other algorithms, such as CNNs. For example, to identify NGS read variants, the DAVI model introduced the combination of CNN and RNN algorithms [ 116 ]. The FactorNet model was designed based on both CNN and RNN algorithms and raised to predict the cell-type-specific transcriptional binding factors (TFBSs) [ 120 ]. However, CNN algorithms are perfect at capturing local DNA sequence patterns; contrastingly, RNN derivatives, such as LSTM, are ideal for capturing long-distance dependencies between sequence datasets [ 119 ].

Long short-term memory networks (LSTMs) are standard recurrent cells with “gates” to handle long-term dependency tasks [ 118 ]. They deliberate to prevent long-term dependency difficulties through their competence in acquiring long-term dependencies. It has a node, input gate, output gate and forget gate as core LSTM unit. The node considers values through certain time gaps, whereas the input and output gates control information flow [ 121 ]. The preliminary implementations of LSTM algorithms in genomics advised the SPEID model, which used a pattern of deep learning algorithms utilising both LSTM and CNN for EPI predictions (Table 6 ; [ 18 ]). Park et al.[ 122 ] obtained DeepMiRGene, a fusion of the RNN and LSTM models, to predict miRNA precursors.

Bidirectional Long Short-Term Memory Networks (BLSTMs) In BLSTM, two RNNs with two hidden layers (forward and backward layers) can be trained in both time directions in parallel to enable the previous context usage that cannot be accomplished via standard RNNs [ 118 ]. Quang et al. [ 123 ] expressed the DanQ model, the original employment in genomics that predicted DNA function directly from sequence data developed from CNN and BLSTM constructions (Table 6 ). Later, [ 124 ] presented DeepCLIP, also utilising CNN and BLSTM, to predict the effect of mutations on protein–RNA binding.

Gated Recurrent Unit (GRU) is categorised as a variant of the LSTM algorithm with cell has only “two gates”: the update gate and reset gate [ 118 ]. It couples neural networks opposing each other. The first network produces artificial, accurate information, while the second estimates the validity of the information [ 125 ]. It was initially applied in gene expression and regulation by [ 126 ], who presented the BiRen model, an architecture consisting of RNNs, CNNs and GRUs, to predict enhancers (Table 6 ). After, the DeepCpG model appeared, combining CNN and GRU frameworks to predict the methylation states from single-cell data [ 83 ].

Natural Language Processing (NLP) It examines the computers usage to recognise human languages for the purpose of executing beneficial tasks [ 127 ]. In the field of NLP, in fact, the “distributed representations” technique is utilised in several state-of-the-art DL models [ 128 ]. For example, the word2vec model is an achieved NLP that utilises the distribution representation process, “neural embedding”. This is because of the embedding task that is frequently expressed through neural networks beside numerous parameters. The aim of word embedding is to convey linear mapping and then generate a direct advantage of representing a single word, thereby distinguishing vectors in continuous space and hence become open for backpropagation-based methods in neural networks [ 129 ]. In terms of deep learning demands in the field of gene expression and regulation, Du et al. (2019) explored the Gene2vec model, an idea of distributed representation of genes. It engages genes’ natural contexts and their expression and co-expression patterns from GEO data. The essential layer of a multilayer neural network uses the embedded gene, which predicts gene-to-gene interactions with a 0.72 AUC score. This is an interesting outcome because the initial model input is the names of two genes merely. Thus, the distributed representation of genes technique is burdened with rich indications about gene function [ 130 ]. Another NLP implementation in the same field was shown by Zeng et al. (2018), who combined NLP with GBRT and introduced the EP2vec model to EPIs.

Graphical Neural Network (GNN) Due to the emerging biological network data sets in genomics, graph neural network has been evolved as an important deep learning method to tackle these data sets[ 131 ]. GNN was proposed by Gori et al. (2005) as a novel neural network model to tackle graph structure data [ 132 ]. Out of many applications of GNN in analysing multi-omics data, the few salient ones are disease gene prediction, drug discovery, drug interaction network, protein–protein interaction network and biomedical imaging. GNN is capable of modelling both the molecular structure data [ 133 ] and biological network data[ 134 ].

Deep learning resources for genomics

We collected the most efficient user-friendly genomic resources developed based on deep learning architectures (Table 7 ). The adoption of various deep learning solutions and models is still limited, despite the enormous success of these tools in genomics and bioinformatics. One reason for this is the lack of deep learning-based published protocols to adapt to new, heterogeneous datasets requiring significant data engineering [ 135 ]. In genomics, high-throughput data (e.g. WGS, WES, RNA-seq, ChIP-seq, etc.) are utilised to train neural networks and have become typical for disease predictions or understanding regulatory genomics. Similarly, developing new DL models and testing current models on new datasets face great challenges due to the lack of inclusive, generalisable, practical deep learning libraries for biology [ 136 ]. In this respect, software frameworks and genomic packages are necessary to allow rapid progress in adopting a novel research question or hypothesis, combining original data or investigating using different neural network structures [ 135 ]. In order to facilitate the DL model implementation in genomics, the following software packages or libraries could become critical for genomic scientists and biomedical researchers.

Janggu is a deep learning python library based on deep CNN for genomic implementations. It aims at a data-procuring facility and model assessment by supporting flexible neural network prototype models. The Janggu library provides three use cases: transcriptional factor predictions, utilising and enhancing the published deep learning designs and predicting the CAGE-tag count normalisation of promoters. This library offers easy access and pre-processing to convert data from standard file formats (e.g. FASTA, BAM, Bigwig, BED and narrowPeak) to BigWig files [ 135 ].

Selene is a deep learning library based on PyTorch for biological sequence data training and model architecture development. Selene supports the prediction of genetic variant effects and visualises the variant scores as a Manhattan plot. It also automatically generates training, testing and validation split from the given input dataset. Further, Selene automatically trains the data and can examine the model on a test set, thereby producing a visualised figure to display the model’s performance [ 137 ].

ExPecto is a variant prioritisation model for predicting the gene expression levels from a broad regulatory region (~ 40 kb) range of promoter-proximal sequencing regions. It relies on CNN to convert the input sequences into epigenomic features. ExPecto facilitates rare variants or unprecedented variants prediction. This is because of its unique design architecture, which does not utilise any variant information during the training process. ExPecto processes VCF files and outputs CSV files [ 138 ].

Pysster is a python library package based on CNN for biological sequencing data training and classification. Pysster provides automatic hyperparameter optimisation and motif visualisation options along with their position and class enrichment information [ 139 ].

Kipoi (Greek for “gardens”; pronounced “kípi”) is a genomic repository for sharing and reusing trained genome-related models. Kipoi provides more than 2 K distinctly trained models from 22 different studies covering significant predictive genomic tasks. The prediction includes chromatin accessibility determination, transcription factor binding and alternative splicing from DNA sequences [ 136 ].

Implementation of these deep learning, genome-based libraries/packages requires accessing the computer power and familiarity with web-based resources (Table 7 ). Several major cloud-computing platforms have proposed on-demand GPU access in user-friendly manners, including Google CloudML, IBM cloud, Vertex AI and Amazon EC2 [ 140 , 141 , 142 ]. User configuration and the installation of the appropriate environments for general GPU coding are required in these cloud-based machines. Concurrently, for users who need to avoid semi-manual setup methods, an expert plug-and-play (PnP) platform GPU access is offered, such as Google Colaboratory (Colab). Google Colab is considered the simplest alternative python-based notebook and provides free K80 GPU utilisation for 12 continuous hours [ 143 , 144 ]. Links to the resources (packages/libraries and web platforms) for the application of deep learning in genomics are provided in Table 7 .

This manuscript catalogues different deep learning tools/software developed in different subareas of genomics to fulfil the predictive tasks of various genomic analyses. We discussed, in detail, the data types in different genomics assays so that readers could have primary knowledge of the basic requirements to develop deep learning-based prediction models using human genomics datasets. In the later part of the manuscript, different deep learning architectures were briefly introduced to genomic scientists in order to help them decide the deep learning network architecture for their specific data types and/or problems. We also briefly discussed the late application of the deep learning technique in genomics and its underlying causes and solutions. Towards the end of the manuscript, various computational resources, software packages or libraries and web-based computational platforms are provided to act as pointers for researchers to create their very first deep learning model utilising genomic datasets. In conclusion, this timely review holds the potential to assist genomic scientists in adopting state-of-the-art deep learning techniques for the exploration of genomic NGS datasets and analyses. This will certainly be beneficial for biomedicine and human genomics researchers.

Availability of data and materials

Not applicable.

Abbreviations

Next-generation sequencing

Whole genome sequencing

Whole exome sequencing

Single-molecule sequencing

RNA sequencing

Chromatin immunoprecipitation sequencing

Pacific biosciences

Oxford nanopore technology

Massively parallel reporter assays

Genome-wide association study

Percent selected index

Human genome variation society

International multiple sclerosis genetics consortium

Variant of uncertain significance

Combined annotation dependent depletion

Genomic Analysis ToolKit

Binary alignment map

Variant call format

Text-based format for either nucleotide sequences or amino acids

Browser extensible data

Comma-separated values

Cap analysis of gene expression

Gene expression omnibus

Enhancer–promoter interaction

Transcription factor binding sites

Deep learning

Machine learning

Deep neural network

Multilayer perceptron

Convolutional neural networks

Recurrent neural network

Long short-term memory network

Bidirectional long short-term memory network

Artificial neural network

Feedforward neural networks

Natural language processing

Gated recurrent unit

Visual geometry group networks

Gradient boosted regression trees

Linear regression

Random forest

Naive Bayes

Deep belief networks

Support vector regression

Area under the curve

Area under the precision–recall curve

Area under the receiver operating characteristic

Auffray C, Imbeaud S, Roux-Rouquié M, Hood L. From functional genomics to systems biology: concepts and practices. C R Biol. 2003;326(10–11):879–92.

Article   CAS   PubMed   Google Scholar  

Goldfeder RL, Priest JR, Zook JM, Grove ME, Waggott D, Wheeler MT, et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 2016;8(1):24.

Article   PubMed   PubMed Central   CAS   Google Scholar  

Goodwin S, McPherson JD, McCombie WR. Coming of age: Ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333–51.

Yue T, Wang H. Deep Learning for Genomics: A Concise Overview. 2018

Honoré B, Østergaard M, Vorum H. Functional genomics studied by proteomics. BioEssays. 2004;26(8):901–15.

Article   PubMed   CAS   Google Scholar  

Talukder A, Barham C, Li X, Hu H. Interpretation of deep learning in genomics and epigenomics. Brief Bioinform. 2020;2:447.

Google Scholar  

Fulco CP, Munschauer M, Anyoha R, Munson G, Grossman SR, Perez EM, et al. Systematic mapping of functional enhancer–promoter connections with CRISPR interference. Science (80-). 2016;354(6313):769–73.

Article   CAS   Google Scholar  

Kulasingam V, Pavlou MP, Diamandis EP. Integrating high-throughput technologies in the quest for effective biomarkers for ovarian cancer. Nat Rev Cancer. 2010;10(5):371–8.

Nariai N, Kolaczyk ED, Kasif S. Probabilistic protein function prediction from heterogeneous genome-wide data. PLoS One. 2007;2(3):e337.

Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D. Methods of integrating data to uncover genotype–phenotype interactions. Nat Rev Genet. 2015;16(2):85–97.

Koumakis L. Deep learning models in genomics; are we there yet? Comput Struct Biotechnol J. 2020;18:1466–73.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Cao C, Liu F, Tan H, Song D, Shu W, Li W, et al. Deep learning and its applications in biomedicine. Genom Proteom Bioinform. 2018;16(1):17–32.

Article   Google Scholar  

Telenti A, Lippert C, Chang PC, DePristo M. Deep learning of genomic variation and regulatory network data. Hum Mol Genet. 2018;27(R1):R63-71.

Kopp W, Monti R, Tamburrini A, Ohler U, Akalin A. Deep learning for genomics using Janggu. Nat Commun. 2020;11(1):3488.

Deep learning for genomics. Nat Genet. 2019;51(1):1–1.

Singh J, Hanson J, Paliwal K, Zhou Y. RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nat Commun. 2019;10(1):5407.

Hsieh T-C, Mensah MA, Pantel JT, Aguilar D, Bar O, Bayat A, et al. PEDIA: prioritization of exome data by image analysis. Genet Med. 2019;21(12):2807–14.

Article   PubMed   PubMed Central   Google Scholar  

Singh R, Lanchantin J, Robins G, Qi Y. DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics. 2016;32(17):i639–48.

Arloth J, Eraslan G, Andlauer TFM, Martins J, Iurato S, Kühnel B, et al. DeepWAS: multivariate genotype-phenotype associations by directly integrating regulatory information using deep learning. PLOS Comput Biol. 2020;16(2):e1007616.

Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev. 1958;65(6):386–408.

Sarker IH. Machine learning: algorithms, real-world applications and research directions. SN Comput Sci. 2021;2(3):160.

Wang C, Tan XP, Tor SB, Lim CS. Machine learning in additive manufacturing: state-of-the-art and perspectives. Addit Manuf. 2020;36:101538.

Muzio G, O’Bray L, Borgwardt K. Biological network analysis with deep learning. Brief Bioinform. 2021;22(2):1515–30.

Article   PubMed   Google Scholar  

Maraziotis I, Dragomir A, Bezerianos A. Gene networks inference from expression data using a recurrent neuro-fuzzy approach. In: 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference. IEEE; 2005. p. 4834–7.

LeCun Y. 1.1 Deep learning hardware: past, present, and future. In: 2019 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE; 2019. p. 12–9.

Kuenzi BM, Park J, Fong SH, Sanchez KS, Lee J, Kreisberg JF, et al. Predicting drug response and synergy using a deep learning model of human cancer cells. Cancer Cell. 2020;38(5):672-684.e6.

Sundaram L, Gao H, Padigepati SR, McRae JF, Li Y, Kosmicki JA, et al. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet. 2018;50(8):1161–70.

Lanchantin J, Singh R, Wang B, Qi Y. Deep motif dashboard: visualizing and understanding genomic sequences using deep neural networks. World Sci. 2017;3:254–65.

Albaradei S, Magana-Mora A, Thafar M, Uludag M, Bajic VB, Gojobori T, et al. Splice2Deep: an ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA. Gene X. 2020;5:100035.

CAS   PubMed   PubMed Central   Google Scholar  

Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T, Ku A, et al. A universal snp and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018;36(10):983.

Liu Q, Xia F, Yin Q, Jiang R. Chromatin accessibility prediction via a hybrid deep convolutional neural network. Bioinformatics. 2018;2:1147.

Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer on deep learning in genomics. Nat Genet. 2019;51(1):12–8.

Al-Stouhi S, Reddy CK. Transfer learning for class imbalance problems with inadequate data. Knowl Inf Syst. 2016;48(1):201–28.

Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020;21(1):6.

Handelman GS, Kok HK, Chandra RV, Razavi AH, Huang S, Brooks M, et al. Peering into the black box of artificial intelligence: evaluation metrics of machine learning methods. Am J Roentgenol. 2019;212(1):38–43.

England JR, Cheng PM. Artificial intelligence for medical image analysis: a guide for authors and reviewers. Am J Roentgenol. 2019;212(3):513–9.

Eraslan G, Avsec Ž, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet. 2019;20(7):389–403.

Pérez-Enciso M, Zingaretti LM. A guide for using deep learning for complex trait genomic prediction. Genes (Basel). 2019;10(7):12258.

Abnizova I, Boekhorst RT, Orlov YL. Computational errors and biases in short read next generation sequencing. J Proteom Bioinform. 2017;10(1):400089.

Ma X, Shao Y, Tian L, Flasch DA, Mulder HL, Edmonson MN, et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 2019;20(1):50.

Pfeiffer F, Gröber C, Blank M, Händler K, Beyer M, Schultze JL, et al. Systematic evaluation of error rates and causes in short samples in next-generation sequencing. Sci Rep. 2018;8(1):10950.

Horner DS, Pavesi G, Castrignano T, De Meo PD, Liuni S, Sammeth M, et al. Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief Bioinform. 2010;11(2):181–97.

McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–303.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.

Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. Science. 2012;7:4458.

Hwang S, Kim E, Lee I, Marcotte EM. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci Rep. 2015;5(1):17875.

Kotlarz K, Mielczarek M, Suchocki T, Czech B, Guldbrandtsen B, Szyda J. The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines. J Appl Genet. 2020;61(4):607–16.

Kumaran M, Subramanian U, Devarajan B. Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data. BMC Bioinform. 2019;20(1):342.

Luo R, Sedlazeck FJ, Lam T, Schatz MC, Kong H, Genome H. Clairvoyante: a multi-task convolutional deep neural network for variant calling in single molecule sequencing. Science. 2018;3:7745.

Cai L, Chu C, Zhang X, Wu Y, Gao J. Concod: an effective integration framework of consensus-based calling deletions from next-generation sequencing data. Int J Data Min Bioinform. 2017;17(2):153.

Cai L, Wu Y, Gao J. DeepSV: accurate calling of genomic deletions from high-throughput sequencing data using deep convolutional neural network. BMC Bioinform. 2019;20(1):665.

Ravasio V, Ritelli M, Legati A, Giacopuzzi E. GARFIELD-NGS: genomic vARiants FIltering by dEep learning moDels in NGS. Bioinformatics. 2018;34(17):3038–40.

Singh A, Bhatia P. Intelli-NGS: intelligent NGS, a deep neural network-based artificial intelligence to delineate good and bad variant calls from IonTorrent sequencer data. bioRxiv. 2019;12:879403.

Müller H, Jimenez-Heredia R, Krolo A, Hirschmugl T, Dmytrus J, Boztug K, et al. VCF.Filter: interactive prioritization of disease-linked genetic variants from sequencing data. Nucleic Acids Res. 2017;45(W1):W567-72.

Eilbeck K, Quinlan A, Yandell M. Settling the score: variant prioritization and Mendelian disease. Nat Rev Genet. 2017;18(10):599–612.

Roy S, Coldren C, Karunamurthy A, Kip NS, Klee EW, Lincoln SE, et al. Standards and guidelines for validating next-generation sequencing bioinformatics pipelines. J Mol Diagn. 2018;20(1):4–27.

Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, et al. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7(4):248–9.

Ng PC. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31(13):3812–4.

Cooper GM. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005;15(7):901–13.

Boudellioua I, Kulmanov M, Schofield PN, Gkoutos GV, Hoehndorf R. DeepPVP: phenotype-based prioritization of causative variants using deep learning. BMC Bioinform. 2019;20(1):65.

Hoffman GE, Bendl J, Girdhar K, Schadt EE, Roussos P. Functional interpretation of genetic variants using deep learning predicts impact on chromatin accessibility and histone modification. Nucleic Acids Res. 2019;3:5589.

Tupler R, Perini G, Green MR. Expressing the human genome. Nature. 2001;409(6822):832–3.

Zrimec J, Börlin CS, Buric F, Muhammad AS, Chen R, Siewers V, et al. Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure. Nat Commun. 2020;11(1):6141.

Angerer P, Simon L, Tritschler S, Wolf FA, Fischer D, Theis FJ. Single cells make big data: new challenges and opportunities in transcriptomics. Curr Opin Syst Biol. 2017;4:85–91.

Falco MM, Peña-Chilet M, Loucera C, Hidalgo MR, Dopazo J. Mechanistic models of signaling pathways deconvolute the glioblastoma single-cell functional landscape. NAR Cancer. 2020;2(2):5589.

Poulin J-F, Tasic B, Hjerling-Leffler J, Trimarchi JM, Awatramani R. Disentangling neural cell diversity using single-cell transcriptomics. Nat Neurosci. 2016;19(9):1131–41.

Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, et al. A survey of human brain transcriptome diversity at the single cell level. Proc Natl Acad Sci. 2015;112(23):7285–90.

Gundogdu P, Loucera C, Alamo-Alvarez I, Dopazo J, Nepomuceno I. Integrating pathway knowledge with deep neural networks to reduce the dimensionality in single-cell RNA-seq data. BioData Min. 2022;15(1):1.

Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, Li YI, et al. Predicting splicing from primary sequence with deep learning. Cell. 2019;176(3):535-548.e24.

Bogard N, Linder J, Rosenberg AB, Seelig G. A deep neural network for predicting and engineering alternative polyadenylation. Cell. 2019;71:9886.

Agarwal V, Shendure J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 2020;31(7):107663.

Li Y, Shi W, Wasserman WW. Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. BMC Bioinform. 2018;19(1):202.

Li X, Wang K, Lyu Y, Pan H, Zhang J, Stambolian D, et al. Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. Nat Commun. 2020;11(1):2338.

Torroja C, Sanchez-Cabo F. Digitaldlsorter: deep-learning on scRNA-seq to deconvolute gene expression data. Front Genet. 2019;10:77458.

Movva R, Greenside P, Marinov GK, Nair S, Shrikumar A, Kundaje A. Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PLoS One. 2019;71:466689.

Zhang Z, Pan Z, Ying Y, Xie Z, Adhikari S, Phillips J, et al. Deep-learning augmented RNA-seq analysis of transcript splicing. Nat Methods. 2019;16(4):307–10.

Bretschneider H, Gandhi S, Deshwar AG, Zuberi K, Frey BJ. COSSMO: predicting competitive alternative splice site selection using deep learning. In: Bioinformatics. 2018.

Lo Bosco G, Rizzo R, Fiannaca A, La Rosa M, Urso A. A deep learning model for epigenomic studies. In: 2016 12th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS). IEEE; 2016. p. 688–92.

Cazaly E, Saad J, Wang W, Heckman C, Ollikainen M, Tang J. Making sense of the epigenome using data integration approaches. Front Pharmacol. 2019;19:10.

Li W, Wong WH, Jiang R. DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning. Nucleic Acids Res. 2019;47(10):e60–e60.

Angermueller C, Lee HJ, Reik W, Stegle O. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol. 2017;18(1):67.

Yin Q, Wu M, Liu Q, Lv H, Jiang R. DeepHistone: a deep learning approach to predicting histone modifications. BMC Genomics. 2019;20(2):193.

Baron V, Adamson ED, Calogero A, Ragona G, Mercola D. The transcription factor Egr1 is a direct regulator of multiple tumor suppressors including TGFβ1, PTEN, p53, and fibronectin. Cancer Gene Ther. 2006;13(2):115–24.

Baptista D, Ferreira PG, Rocha M. Deep learning for drug response prediction in cancer. Brief Bioinform. 2021;22(1):360–79.

Lesko LJ, Woodcock J. Translation of pharmacogenomics and pharmacogenetics: a regulatory perspective. Nat Rev Drug Discov. 2004;3(9):763–9.

Roden DM. Pharmacogenomics: challenges and opportunities. Ann Intern Med. 2006;145(10):749.

Pang K, Wan Y-W, Choi WT, Donehower LA, Sun J, Pant D, et al. Combinatorial therapy discovery using mixed integer linear programming. Bioinformatics. 2014;30(10):1456–63.

Day D, Siu LL. Approaches to modernize the combination drug development paradigm. Genome Med. 2016;8(1):115.

White RE. High-throughput screening in drug metabolism and pharmacokinetic support of drug discovery. Annu Rev Pharmacol Toxicol. 2000;40(1):133–57.

Feala JD, Cortes J, Duxbury PM, Piermarocchi C, McCulloch AD, Paternostro G. Systems approaches and algorithms for discovery of combinatorial therapies. Wiley Interdiscip Rev Syst Biol Med. 2010;2(2):181–93.

Sun X, Bao J, You Z, Chen X, Cui J. Modeling of signaling crosstalk-mediated drug resistance and its implications on drug combination. Oncotarget. 2016;7(39):63995–4006.

Goswami CP, Cheng L, Alexander P, Singal A, Li L. A new drug combinatory effect prediction algorithm on the cancer cell based on gene expression and dose-response curve. CPT Pharmacometrics Syst Pharmacol. 2015;4(2):80–90.

Preuer K, Lewis RPI, Hochreiter S, Bender A, Bulusu KC, Klambauer G. DeepSynergy: predicting anti-cancer drug synergy with deep learning. Bioinformatics. 2018;34(9):1538–46.

Kalamara A, Tobalina L, Saez-Rodriguez J. How to find the right drug for each patient? advances and challenges in pharmacogenomics. Curr Opin Syst Biol. 2018;10:53–62.

Chiu Y-C, Chen H-IH, Zhang T, Zhang S, Gorthi A, Wang L-J, et al. Predicting drug response of tumors from integrated genomic profiles by deep neural networks. BMC Med Genom. 2019;12(51):18.

Wang Y, Li F, Bharathwaj M, Rosas NC, Leier A, Akutsu T, et al. DeepBL: a deep learning-based approach for in silico discovery of beta-lactamases. Brief Bioinform. 2020;7:8859.

Yu D, Deng L. Deep learning and its applications to signal and information processing exploratory DSP. IEEE Signal Process Mag. 2011;28(1):145–54.

Fukushima K, Miyake S. Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition. In 1982. p. 267–85.

Hinton GE. Reducing the dimensionality of data with neural networks. Science (80-). 2006;313(5786):504–7.

Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep belief nets. Neural Comput. 2006;18(7):1527–54.

Shi L, Wang Z. Computational strategies for scalable genomics analysis. Genes (Basel). 2019;10(12):1–8.

Nelson D, Wang J. Introduction to artificial neural systems. Neurocomputing. 1992;4(6):328–30.

Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

Zell A. Simulation Neuronaler Netze. London: Addison-Wesley; 1994. p. 73.

Zeng W, Wu M, Jiang R. Prediction of enhancer-promoter interactions via natural language processing. BMC Genom. 2018;19(S2):84.

Indolia S, Goswami AK, Mishra SP, Asopa P. Conceptual understanding of convolutional neural network-a deep learning approach. Procedia Comput Sci. 2018;132:679–88.

Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, et al. Recent advances in convolutional. Neural Netw. 2015;5:71143.

Rawat W, Wang Z. Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput. 2017;29(9):2352–449.

Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33(8):831–8.

Zeng W, Wang Y, Jiang R. Integrating distal and proximal information to predict gene expression via a densely connected convolutional neural network. Bioinformatics. 2019;6:7110.

Lysenkov V. Introducing deep learning-based methods into the variant calling analysis pipeline. Science. 2019;6:7789.

Kelley DR, Reshef YA, Bileschi M, Belanger D, Mclean CY, Snoek J. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Science. 2018;71:739–50.

Pu L, Govindaraj RG, Lemoine JM, Wu H, Brylinski M. DeepDrug3D: classification of ligand-binding pockets in proteins with a convolutional neural network. PLOS Comput Biol. 2019;15(2):e1006718.

Gupta G, Saini S. DAVI: deep learning based tool for alignment and single nucleotide variant identification. Science. 2019;2:1–27.

CAS   Google Scholar  

Marhon SA, Cameron CJF, Kremer SC. Recurrent Neural Networks. In 2013. p. 29–65.

Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019;31(7):1235–70.

Trieu T, Martinez-Fundichely A, Khurana E. DeepMILO: a deep learning approach to predict the impact of non-coding sequence variants on 3D chromatin structure. Genome Biol. 2020;21(1):79.

Quang D, Xie X. FactorNet: A deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods. 2019;166:40–7.

Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.

Park S, Min S, Choi H-S, Yoon S. Deep Recurrent Neural Network-Based Identification of Precursor microRNAs. In: Guyon I, Luxburg U V, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2017.

Quang D, Xie X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 2016;44(11):e107–e107.

Grønning AGB, Doktor TK, Larsen SJ, Petersen USS, Holm LL, Bruun GH, et al. DeepCLIP: predicting the effect of mutations on protein–RNA binding with deep learning. Nucleic Acids Res. 2020;22:7449.

Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. Science. 2015;6:7789.

Yang B, Liu F, Ren C, Ouyang Z, Xie Z, Bo X, et al. BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone. Bioinformatics. 2017;33(13):1930–6.

Deng L, Liu Y. Deep Learning in Natural Language Processing. Singapore: Springer; 2018.

Book   Google Scholar  

Schuler GD, Epstein JA, Ohkawa H, Kans JA. [10] Entrez: Molecular biology database and retrieval system. In 1996. p. 141–62.

Mikolov T, Chen K, Corrado G, Dean J. Efficient Estimation of Word Representations in Vector Space. 2013;

Du J, Jia P, Dai Y, Tao C, Zhao Z, Zhi D. Gene2vec: distributed representation of genes based on co-expression. BMC Genom. 2019;20(1):82.

Zhang X-M, Liang L, Liu L, Tang M-J. Graph neural networks and their current applications in bioinformatics. Front Genet. 2021;12:4799.

Gori M, Monfardini G, Scarselli F. A new model for learning in graph domains. In: Proceedings 2005 IEEE International Joint Conference on Neural Networks, 2005. IEEE; p. 729–34.

Kwon Y, Yoo J, Choi Y-S, Son W-J, Lee D, Kang S. Efficient learning of non-autoregressive graph variational autoencoders for molecular graph generation. J Cheminform. 2019;11(1):70.

Barabási A-L, Gulbahce N, Loscalzo J. Network medicine: a network-based approach to human disease. Nat Rev Genet. 2011;12(1):56–68.

Avsec Ž, Kreuzhuber R, Israeli J, Xu N, Cheng J, Shrikumar A, et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat Biotechnol. 2019;37(6):592–600.

Chen KM, Cofer EM, Zhou J, Troyanskaya OG. Selene: a PyTorch-based deep learning library for sequence data. Nat Methods. 2019;16(4):315–8.

Zhou J, Theesfeld CL, Yao K, Chen KM, Wong AK, Troyanskaya OG. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat Genet. 2018;50(8):1171–9.

Budach S, Marsico A. pysster: classification of biological sequences by learning sequence and structure motifs with convolutional neural networks. Bioinformatics. 2018;34(17):3035–7.

Neloy AA, Alam S, Bindu RA, Moni NJ. Machine Learning based Health Prediction System using IBM Cloud as PaaS. In: 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI). IEEE; 2019. p. 444–50.

Ciaburro G, Ayyadevara VK, Perrier A. Hands-On Machine Learning on Google Cloud Platform: Implementing smart and efficient analytics using Cloud ML Engine. Packt Publishing; 2018. 500 p.

Peng L, Peng M, Liao B, Huang G, Li W, Xie D. The advances and challenges of deep learning application in biological big data processing. Curr Bioinform. 2018;13(4):352–9.

Carneiro T, Da Medeiros NRV, Nepomuceno T, Bian G-B, De Albuquerque VHC, Filho PPR. Performance analysis of google colaboratory as a tool for accelerating deep learning applications. IEEE Access. 2018;6:61677–85.

Bisong E. Google Colaboratory. In: Building Machine Learning and Deep Learning Models on Google Cloud Platform. Berkeley: Apress; 2019. p. 59–64.

Luo R, Sedlazeck FJ, Lam TW, Schatz MC. A multi-task convolutional deep neural network for variant calling in single molecule sequencing. Nat Commun. 2019;10(1):1–11.

Ravasio V, Ritelli M, Legati A, Giacopuzzi E. GARFIELD-NGS: genomic vARiants fIltering by dEep learning moDels in NGS. Bioinformatics. 2018;34(17):3038–40.

Singh A, Bhatia P. Intelli-NGS: Intelligent NGS, a deep neural network-based artificial intelligence to delineate good and bad variant calls from IonTorrent sequencer data. bioRxiv. 2019;2019:879403.

Gurovich Y, Hanani Y, Bar O, Nadav G, Fleischer N, Gelbman D, et al. Identifying facial phenotypes of genetic disorders using deep learning. Nat Med. 2019;25(1):60–4.

Park S, Min S, Choi H, Yoon S. deepMiRGene: deep neural network based precursor microRNA prediction. Science. 2016;71:89968.

Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016;26(7):990–9.

Singh S, Yang Y, Póczos B, Ma J. Predicting enhancer-promoter interaction from genomic sequence with deep neural networks. Quant Biol. 2019;7(2):122–37.

Chen Y, Li Y, Narayan R, Subramanian A, Xie X. Gene expression inference with deep learning. Bioinformatics. 2016;32(12):1832–9.

Zeng W, Wang Y, Jiang R. Integrating distal and proximal information to predict gene expression via a densely connected convolutional neural network. Bioinformatics. 2019;2:7889.

Kalkatawi M, Magana-Mora A, Jankovic B, Bajic VB. DeepGSR: an optimized deep-learning structure for the recognition of genomic signals and regions. Bioinformatics. 2019;35(7):1125–32.

Zuallaert J, Godin F, Kim M, Soete A, Saeys Y, De Neve W. SpliceRover: interpretable convolutional neural networks for improved splice site prediction. Bioinformatics. 2018;34(24):4180–8.

Paggi JM, Bejerano G. A sequence-based, deep learning model accurately predicts RNA splicing branchpoints. RNA. 2018;24(12):1647–58.

Almagro AJJ, Sønderby CK, Sønderby SK, Nielsen H, Winther O. DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics. 2017;33(21):3387–95.

Grønning AGB, Doktor TK, Larsen SJ, Petersen USS, Holm LL, Bruun GH, et al. DeepCLIP: predicting the effect of mutations on protein–RNA binding with deep learning. Nucleic Acids Res. 2020;5:9956.

Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning–based sequence model. Nat Methods. 2015;12(10):931–4.

Lanchantin J, Singh R, Lin Z, Qi Y. Deep Motif: visualizing genomic sequence classifications. Science. 2016;78:1–5.

Xie L, He S, Song X, Bo X, Zhang Z. Deep learning-based transcriptome data classification for drug-target interaction prediction. BMC Genom. 2018;19(S7):667.

Kohut K, Limb S, Crawford G. The changing role of the genetic counsellor in the genomics Era. Curr Genet Med Rep. 2019;7(2):75–84.

Frank H. Guenther. Neural Networks: Biological Models and Applications. In: Smel-ser NJ, Baltes PB editors, editor. Oxford: International Encyclopedia of the Social & Behavioral Sciences; 2001. p. 10534–7.

Eskiizmililer S. An intelligent Karyotyping architecture based on Artificial Neural Networks and features obtained by automated image analysis. 1993.

Catic A, Gurbeta L, Kurtovic-Kozaric A, Mehmedbasic S, Badnjevic A. Application of neural networks for classification of patau, edwards, down, turner and klinefelter syndrome based on first trimester maternal serum screening data, ultrasonographic findings and patient demographics. BMC Med Genom. 2018;11(1):19.

Fukushima K. Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern. 1980;36(4):193–202.

Sakellaropoulos T, Vougas K, Narang S, Koinis F, Kotsinas A, Polyzos A, et al. A deep learning framework for predicting response to therapy in cancer. Cell Rep. 2019;29(11):3367-3373.e4.

Kalinin AA, Higgins GA, Reamaroon N, Soroushmehr S, Allyn-Feuer A, Dinov ID, et al. Deep learning in pharmacogenomics: from gene regulation to patient stratification. Pharmacogenomics. 2018;19(7):629–50.

Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323(6088):533–6.

Shen Z, Bao W, Huang D-S. Recurrent neural network for predicting transcription factor binding sites. Sci Rep. 2018;8(1):15270.

Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans Signal Process. 1997;45(11):2673–81.

Almagro Armenteros JJ, Sønderby CK, Sønderby SK, Nielsen H, Winther O. DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics. 2017;33(21):3387–95.

Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Science. 2014;7:44598.

Download references

Acknowledgements

We duly acknowledge Dr. Mohamed Aly Hussain for his motivation and useful discussion regarding the inception of this review article. We also appreciate Dr. Lamya Alomair for her support during the development of this manuscript.

This study is not funded by any funding source.

Author information

Authors and affiliations.

Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia

Wardah S. Alharbi & Mamoon Rashid

You can also search for this author in PubMed   Google Scholar

Contributions

WA and MR conceptualised this study. WA collected the data and performed investigation. MR supervised this study. WA and MR wrote original draft. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mamoon Rashid .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Alharbi, W.S., Rashid, M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics 16 , 26 (2022). https://doi.org/10.1186/s40246-022-00396-x

Download citation

Received : 24 November 2021

Accepted : 12 July 2022

Published : 25 July 2022

DOI : https://doi.org/10.1186/s40246-022-00396-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Human genomics
  • Deep learning applications
  • Gene expression
  • Variant calling

Human Genomics

ISSN: 1479-7364

research paper on dna sequencing

Nanopore-based DNA long-read sequencing analysis of the aged human brain

  • PMID: 38370753
  • PMCID: PMC10871260
  • DOI: 10.1101/2024.02.01.578450

Aging disrupts cellular processes such as DNA repair and epigenetic control, leading to a gradual buildup of genomic alterations that can have detrimental effects in post-mitotic cells. Genomic alterations in regions of the genome that are rich in repetitive sequences, often termed "dark loci," are difficult to resolve using traditional sequencing approaches. New long-read technologies offer promising avenues for exploration of previously inaccessible regions of the genome. Using nanopore-based long-read whole-genome sequencing of DNA extracted from aged 18 human brains, we identify previously unreported structural variants and methylation patterns within repetitive DNA, focusing on transposable elements ("jumping genes") as crucial sources of variation, particularly in dark loci. Our analyses reveal potential somatic insertion variants and provides DNA methylation frequencies for many retrotransposon families. We further demonstrate the utility of this technology for the study of these challenging genomic regions in brains affected by Alzheimer's disease and identify significant differences in DNA methylation in pathologically normal brains versus those affected by Alzheimer's disease. Highlighting the power of this approach, we discover specific polymorphic retrotransposons with altered DNA methylation patterns. These retrotransposon loci have the potential to contribute to pathology, warranting further investigation in Alzheimer's disease research. Taken together, our study provides the first long-read DNA sequencing-based analysis of retrotransposon sequences, structural variants, and DNA methylation in the aging brain affected with Alzheimer's disease neuropathology.

Publication types

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • 25 March 2020

How cancer genomics is transforming diagnosis and treatment

  • Bianca Nogrady 0

Bianca Nogrady is a freelance science writer in Sydney, Australia.

You can also search for this author in PubMed   Google Scholar

DNA sequencing allows oncologists to characterize tumours on the basis of genetic mutations. Credit: KTSDESIGN/SPL

When cancer was first described by the ancient Greek physician Hippocrates, he identified just two forms: the non-ulcer-forming carcinos and the ulcer-forming carcinoma. In the late nineteenth century, physicians found, with the help of the microscope, that cancer had multiple cellular forms.

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 51 print issues and online access

185,98 € per year

only 3,65 € per issue

Rent or buy this article

Prices vary by article type

Prices may be subject to local taxes which are calculated during checkout

Nature 579 , S10-S11 (2020)

doi: https://doi.org/10.1038/d41586-020-00845-4

This article is part of Nature Outlook: Cancer diagnosis , an editorially independent supplement produced with the financial support of third parties. About this content .

Schmitz, R. et al. N. Engl. J. Med. 378 , 1396–1407 (2018).

Article   PubMed   Google Scholar  

van der Velden, D. L. et al. Nature 574 , 127–131 (2019).

Berland, L. et al. J. Thorac. Dis. 11 , S71–S80 (2019).

Download references

Related Articles

research paper on dna sequencing

  • Health care

Super-speedy sequencing puts genomic diagnosis in the fast lane

Super-speedy sequencing puts genomic diagnosis in the fast lane

Technology Feature 19 FEB 24

The future of precision cancer therapy might be to try everything

The future of precision cancer therapy might be to try everything

News Feature 14 FEB 24

Deep whole-genome analysis of 494 hepatocellular carcinomas

Deep whole-genome analysis of 494 hepatocellular carcinomas

Article 14 FEB 24

New Chinese databases are a boost for rare-disease science

Correspondence 20 FEB 24

Ambitious survey of human diversity yields millions of undiscovered genetic variants

Ambitious survey of human diversity yields millions of undiscovered genetic variants

News 19 FEB 24

Introducing meat–rice: grain with added muscles beefs up protein

Introducing meat–rice: grain with added muscles beefs up protein

News 14 FEB 24

Smoking scars the immune system for years after quitting

Smoking scars the immune system for years after quitting

A researcher-exchange programme made me a better doctor at home and abroad

A researcher-exchange programme made me a better doctor at home and abroad

Career Q&A 12 FEB 24

Postdoctoral Fellow - Boyi Gan lab

New postdoctoral positions are open in a cancer research laboratory located within The University of Texas MD Anderson Cancer Center. The lab curre...

Houston, Texas (US)

The University of Texas MD Anderson Cancer Center - Experimental Radiation Oncology

research paper on dna sequencing

R&D Principal/Project Engineer

We are looking for a skilled R&D Principal/Project Engineer. This position will actively contribute to Hydro’s sustainability agenda.

Porsgrunn, Norway

Siri Romsbotn

research paper on dna sequencing

Professor Helminthology

The Department of Biomedical Sciences is opening a new research unit in the field of pathogen-host-vector interactions. We are looking for a Professor

Antwerp (BE)

Institute of Tropical Medicine

research paper on dna sequencing

Postdoctoral Research Associate at the RTG Chemical Biology of Ion Channels (Chembion)

An ambitious postdoctoral research associate with a fitting research idea to be included in our multidisciplinary team.

Münster, Nordrhein-Westfalen (DE)

University of Münster

research paper on dna sequencing

Training Support Specialist, China

Purpose of the role:       This role is responsible for maintaining and developing client relationships for all contracted customers to Nature Mast...

Shanghai (CN)

Springer Nature Ltd

research paper on dna sequencing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

IMAGES

  1. (PDF) DNA Sequencing Basics and its Applications

    research paper on dna sequencing

  2. (PDF) DNA Sequencing: Methods, Strategies and Protocols in Molecular

    research paper on dna sequencing

  3. DNA Sequencing Overview

    research paper on dna sequencing

  4. molecular_dna_sequencing_analysis

    research paper on dna sequencing

  5. DNA Sequencing: Definition, Importance, Methods and More

    research paper on dna sequencing

  6. [PDF] Trends in Next-Generation Sequencing and a New Era for Whole

    research paper on dna sequencing

VIDEO

  1. DNA Sequencing

  2. DNA Sequencing

  3. DNA sequencing

  4. DNA SEQUENCING || SANGER COULSON METHOD || BIOTECHNOLOGY || STB || NMDCAT || AKU -2024 || NUMS ||

  5. DNA analysis in parentage testing

  6. Introduction to DNA and DNA Sequencing

COMMENTS

  1. The sequence of sequencers: The history of sequencing DNA

    We review the drastic changes to DNA sequencing technology over the last 50 years. • First-generation methods enabled sequencing of clonal DNA populations. • The second-generation massively increased throughput by parallelizing many reactions. • Third-generation methods allow direct sequencing of single DNA molecules. Go to: 1. Introduction " ...

  2. Human Molecular Genetics and Genomics

    The focus of genomics research has recently moved beyond analyzing DNA variation to studying patterns of gene expression in individual cells, a step that has been driven by new methods for...

  3. PDF DNA sequencing at 40: past, present and future

    DNA sequencing has been extensively and creatively repurposed, including as a 'counter' for a vast range of molecular phenomena. We predict that in the long view of history, the impact of DNA sequencing will be on a par with that of the microscope.

  4. DNA sequencing at 40: past, present and future

    Figure 1: DNA sequencing technologies. Schematic examples of first, second and third generation sequencing are shown. Second generation sequencing is also referred to as next-generation...

  5. DNA sequencing

    DNA sequencing is any chemical, enzymatic or technological procedure for determining the linear order of nucleotide bases in DNA. Sanger sequencing by replicative synthesis in the presence of...

  6. Dna sequencing

    DNA sequencing articles within Nature Reviews Genetics Featured Review Article | 18 January 2024 The expanding diagnostic toolbox for rare genetic diseases Genomic technologies have greatly...

  7. Applications of DNA Sequencing Technologies for Current Research

    DNA sequencing techniques have transformed biomedical research. Sequencing techniques with enhanced sensitivity and throughput are much required for the molecular and genomics study (Diaz-Sanchez et al., 2013; Buermans & den Dunnen, 2014; Bansal et al., 2018).Four building blocks (A, T, G, C) known as the nitrogenous base are involved in DNA synthesis, and their orderly arrangement to make a ...

  8. DNA sequencing

    Data analysis of 2nd generation sequencing results has three major components: (1) Base Calling, (2) Alignment and (3) Variant Calling. For most systems the base calling is closely linked to the sequencing system and is done using software that is provided by the supplier of the sequencing device.

  9. The complete sequence of a human genome

    The GRCh38 reference assembly contains 151 mega-base pairs (Mbp) of unknown sequence distributed throughout the genome, including pericentromeric and subtelomeric regions, recent segmental duplications, ampliconic gene arrays, and ribosomal DNA (rDNA) arrays, all of which are necessary for fundamental cellular processes ().Some of the largest reference gaps include human satellite (HSat ...

  10. DNA sequencing

    Nucleic acid sequencing is the mainstay of biological research. There are several generations of DNA sequencing technologies that can be well characterized through their nature and the kind of output they provide. Dideoxy terminator sequencing developed by Sanger dominated for 30 years and was the workhorse used for the Human Genome Project. In 2005 the first 2nd generation sequencer was ...

  11. (PDF) DNA Sequencing: Methods and Applications

    DNA Sequencing: Methods and Applications October 2013 In book: Advances in Biotechnology (pp.11-21) Chapter: DNA Sequencing: Methods and Applications Authors: Satpal Singh Singh Bisht Abstract...

  12. Nanopore sequencing technology, bioinformatics and applications

    Fig. 1: Principle of nanopore sequencing. A MinION flow cell contains 512 channels with 4 nanopores in each channel, for a total of 2,048 nanopores used to sequence DNA or RNA. The wells are...

  13. Library preparation for next generation sequencing: A review of

    1. Introduction. DNA sequencing technology has evolved rapidly over the last few decades, from the discovery of the double helix DNA shape (Lander et al., 2001; Watson and FHC, 1953), to the complete sequencing of a human genome and, most recently to a continuously widening range of applications in research, agriculture and public health (Adams et al., 2009; Barzon et al., 2011; Bonnefond et ...

  14. (PDF) DNA Sequencing

    The paper gives review of current DNA sequencing algorithms and techniques as well as next-generation of DNA sequencing. Since the DNA sequencing field is changing rapidly the information...

  15. Classification of DNA Sequence Using Machine Learning

    The classification of gene sequences into existing categories is utilized in genomic research to discover the functions of novel proteins. As a result, it is critical to identify and categorize such genes. We employ ML approaches to distinguish between infected and normal genes using classification methods.

  16. Super-speedy sequencing puts genomic diagnosis in the fast lane

    Illumina was, and is, the market leader for short-read sequencing, a process that produces billions of 100-200-nucleotide 'reads' of DNA sequence, which can then be computationally ...

  17. A brief review on DNA storage, compression, and digitalization

    DNA Storage medium Compression Digital information and representation 1. Introduction Nowadays, the advance in technology of several areas of sciences such as Engineering and Biology has allowed some scientific disciplines to join efforts and produce efficient models that use a mimic of the proper nature and its characteristics.

  18. A review of deep learning applications in human genomics using next

    Multiple genomic disciplines (e.g. variant calling and annotation, disease variant prediction, gene expression and regulation, epigenomics and pharmacogenomics) take advantage of generating high-throughput data and utilising the power of deep learning algorithms for sophisticated predictions (Fig. 2).The modern evolution of DNA/RNA sequencing technologies and machine learning algorithms ...

  19. Nanopore-based DNA long-read sequencing analysis of the aged ...

    These retrotransposon loci have the potential to contribute to pathology, warranting further investigation in Alzheimer's disease research. Taken together, our study provides the first long-read DNA sequencing-based analysis of retrotransposon sequences, structural variants, and DNA methylation in the aging brain affected with Alzheimer's ...

  20. How cancer genomics is transforming diagnosis and treatment

    DNA sequencing allows oncologists to characterize tumours on the basis of genetic mutations. Credit: KTSDESIGN/SPL When cancer was first described by the ancient Greek physician Hippocrates, he...