What Is the Difference Between Single End Read and Double Ended Read

Cast your minds back a few years..

Plenty of success stories with microarrays

Why exercise sequencing?

Microarrays vs sequencing

Probe design issues with microarrays
- 'Dorian Gray issue' http://world wide web.biomedcentral.com/1471-2105/5/111
- '…mappings are frozen, as a Dorian Grey-like syndrome: the credible eternal youth of the mapping does not reflect that somewhere the 'motion-picture show of it' decays'
Sequencing data are 'future proof'
- if a new genome version comes along, just re-align the data!
- tin catch published-data from public repositories and re-marshal to your own choice of genome / transcripts and aligner
Limited number of novel findings from microarrays
- can't discover what you're not looking for!
Genome coverage
- some areas of genome are problematic to design probes for
Fusion genes, re-arrangements and complex events
- not possible with microarray technology
Maturity of analysis techniques
- on the other hand, assay methods and workflows for microarrays are well-established
- until recently…

What did we learn from arrays?

Experimental Blueprint; despite this fancy new technolgy, if we don't design the experiments properly we won't go meaningful conclusions
Quality assessment; Aye, NGS experiments can even so become wrong!
Normalisation; NGS data come with their own set of biases and fault that demand to be accounted for
Reproducibility: this is a good thing
Plenty of tools and workflows were established.
Don't forget most arrays; the data are all out there somewhere waiting to be discovered and explored
- e.g. repositive.io platform for browse multiple repositories for data for a particular condition

Illumina sequencing overview*

This video gives an overview of the 'sequencing-past-synthesis' approach used by Illumina. Other companies will have different techniques, simply Illumina is probably the most-pop sequencing technology out there. For most of what we will discuss, it won't actually affair how your samples were sequenced.

* Other sequencing technologies are bachelor

Illumina sequencing

Images from:- http://world wide web.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

A library is prepared by breaking the DNA of involvement into shorter, single-strand fragments. Adapters are each added to each finish of the fragments. A flow-cell has already been prepared to have a "lawn" of sequences that are complementary to the adapters.

Bridge amplification forms clusters of identical fragments on the flow-cell surface. This is required every bit we're going to be taking images of the catamenia-cell and need many copies of each fragment so that nosotros get a decent signal. However, every bit we'll run into later on, this can introduce errors.

The construction of a sequencing reads involves adding a single, terminated, DNA base (each given a distinct flourescent label) 1 at a time, simulataneously across the whole menstruum-prison cell, and taking an image.

So, we try adding an "A" with a ruby label and take an image

and then a "T" with a green, and take another epitome

Analysis of the images can determine which base was succesfully incorporated to each fragment.

The process of added each base repeats for the next cycle, and so-on for "Northward" cycles. east.g. 100 times for a 100-base of operations sequence.

Although this procedure is loftier-optimised and refined, the sheer number of reactions being performed means that errors are inevitable. The signal from a particular cluster could be affected by interference from its neighbours

Or sometimes nosotros get a scrap over-excited and add together too many bases

Therefore the identification of bases comes with some degree of dubiousness which nosotros must capture. This doubt can be incorporated into our analysis, every bit nosotros will see afterwards.

Image processing

Sequencing produces high-resolution .TIFF images; not unlike microarray data
100 tiles per lane, 8 lanes per flow cell, 100 cycles
4 images (A,M,C,T) per tile per cycle = 320,000 images
Each .TIFF image ~ 7Mb = two,240,000 Mb of data (2.24TB)

Base-calling

"Bustard"

"Uses cluster intensities and racket estimate to output the sequence of bases read from each cluster, along with a confidence level for each base."
- http://openwetware.org/wiki/BioMicroCenter:IlluminaDataPipeline
Y'all volition never have to do this
- In fact, the TIFF images are deleted by the instrument on-the-wing

Raw reads

The most basic file type y'all will see is probably going to be fastq
- Data in public-repositories (e.g. Short Read Archive, GEO) tend to exist in this format
This represents all sequences created after imaging process
- No idea at this stage whether the sequences will align or not
No standard file extension. .fq, .fastq, .sequence.txt
Essentially they are text files
- Can be manipulated with standard unix tools; e.g. true cat, caput, grep, more than, less
They tin can be compressed and appear as .fq.gz
Same format regardless of sequencing protocol (i.e. RNA-seq, ChIP-seq, Dna-seq etc)
Each sequence is described over four lines
For paired-terminate data you get two files
- they should accept the aforementioned number of lines
- the sequences should exist in the same order

We don't demand any special software to view these, but comport in heed there tin be ~ 250 Million reads (sequences) per Hi-Seq lane. Then using Word or Notepad is probably not a peachy idea.

An example .fastq file has been provided for you in the binder /data/test. If you are curious how this fastq file was generated, you can see the Appendix for details.

We tin launch the CRUK Docker from the Desktop and navigate to the directory containing the files using the cd control

You don't need to blazon the full path. Start typing and apply the Tab key to automobile-complete

              cd /data/test

If we wanted to check the directory we are currently looking at, we can employ the pwd command.

pwd

The ls command will listing all the files in this directory

ls

Bold the files are present, we can print the first few lines using the standard unix command head

We fix the argument -n 12 to print the first 12 lines of the file sample.fq1

              head -n 12 sample.fq1

The unix command wc can count the number of lines in a file with the pick -l

              wc -l sample.fq1

What would y'all type to print the first 12 lines of sample.fq2?

Fastq sequence names

The name of a sequence is unique and can encode some useful information. e.g.

                @HWUSI-EAS100R:half-dozen:73:941:1973#0/1

The proper noun of the sequencer (HWUSI-EAS100R)
The flow cell lane (6)
Tile number with the lane (73)
ten co-ordinate within the tile (941)
y co-ordinate within the tile (1973)
#0 index number for a multiplexed sample
/1; the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

However, this depends on musical instrument setup and processing pipelines. Sometimes the tile and coordinate data is omitted to relieve space.

Fastq quality scores

Every bit nosotros saw earlier the process of deciding which base is present at each bicycle of each fragment comes with some probability (p) that we make a fault. The quality score expresses our confidence in a particular base-call; higher quality score, college confidence

Ane such score for each base of sequencing. i.e. 100 scores for 100 bases of sequencing
These are of importance if we desire to call SNVs etc.
- need to be certain that differences detected from the reference genome and legitimate, and not caused past sequencing error

The raw base-calling probabilities are converted to text characters to make it easier to store in a file

                N?>:<9>>>:=;>>?<>:@?>;==@@@>?=AAA<>=A@?6>4B=<>>.@>?<@;?#############

Commencement of all, we convert the base-calling probability (p) into a Q score using the formula

Quality scores \[ Q = -10log_{10}p\]
- Q = thirty, p=0.001
- Q = xx, p=0.01
- Q = 10, p=0.1
These numeric quanties are encoded as ASCII code
- ASCII codes ane - 32 have historical uses such as start of text, carriage return, new line etc
- At least 33 to get to meaningful characters

Annoyingly, unlike sequencing instruments have used different offsets over fourth dimension. It'south of import to cheque what encoding has been used for your data

Most modern sequencing will be Phred+33
Tools should be able to detect what is in-use

This handy graphic from wikipedia compares the different schemes

Given a detail quality string, we have to look-up the ASCII code for each graphic symbol and decrease the offset to get the Q score. We can so convert to a probability using the formula:-

\[ p = 10^{-Q/x} \]

So for our detail example:

                N?>:<nine>>>:=;>>?<>:@?>;==@@@>?=AAA<>=A@?6>4B=<>>.@>?<@;?#############

it works out as follows:-

                                  Graphic symbol Code Minus.Commencement..33.. Probability 1          Northward   78                 45     0.00003 ii          ?   63                 xxx     0.00100 3          >   62                 29     0.00126 4          :   58                 25     0.00316 5          <   sixty                 27     0.00200 half dozen          9   57                 24     0.00398 7          >   62                 29     0.00126 8          >   62                 29     0.00126 9          >   62                 29     0.00126 10         :   58                 25     0.00316

...

...

                                  Character Code Minus.First..33.. Probability 58         #   35                  2     0.63096 59         #   35                  two     0.63096 60         #   35                  2     0.63096 61         #   35                  two     0.63096 62         #   35                  2     0.63096 63         #   35                  2     0.63096 64         #   35                  2     0.63096 65         #   35                  2     0.63096 66         #   35                  2     0.63096 67         #   35                  2     0.63096 68         #   35                  2     0.63096

For the extremely-keen we can do this in R:-

                pq <- "Due north?>:<ix>>>:=;>>?<>:@?>;==@@@>?=AAA<>=A@?6>4B=<>>.@>?<@;?#############" code <- as.integer(charToRaw(equally.character(pq))) qs <- lawmaking -33 qs

                ##  [ane] 45 30 29 25 27 24 29 29 29 25 28 26 29 29 xxx 27 29 25 31 30 29 26 28 ## [24] 28 31 31 31 29 30 28 32 32 32 27 29 28 32 31 thirty 21 29 19 33 28 27 29 ## [47] 29 xiii 31 29 30 27 31 26 thirty  two  2  2  two  2  2  two  ii  2  ii  2  2  two

                probs <- x^(unlist(qs)/-10) round(probs,5)

                ##  [i] 0.00003 0.00100 0.00126 0.00316 0.00200 0.00398 0.00126 0.00126 ##  [9] 0.00126 0.00316 0.00158 0.00251 0.00126 0.00126 0.00100 0.00200 ## [17] 0.00126 0.00316 0.00079 0.00100 0.00126 0.00251 0.00158 0.00158 ## [25] 0.00079 0.00079 0.00079 0.00126 0.00100 0.00158 0.00063 0.00063 ## [33] 0.00063 0.00200 0.00126 0.00158 0.00063 0.00079 0.00100 0.00794 ## [41] 0.00126 0.01259 0.00050 0.00158 0.00200 0.00126 0.00126 0.05012 ## [49] 0.00079 0.00126 0.00100 0.00200 0.00079 0.00251 0.00100 0.63096 ## [57] 0.63096 0.63096 0.63096 0.63096 0.63096 0.63096 0.63096 0.63096 ## [65] 0.63096 0.63096 0.63096 0.63096

Information technology is possible to interrogate the fastq files in R. Withal, in practice we tend to use other tools such as fastqc described in the side by side section.

                library(ShortRead) fq <- readFastq("/information/test/sample.fq1") fq

Fastqc Primer

(Acknowledgement to Ines De Santiago for her session at the previous summer schoolhouse)

FastQC from the Babraham Institute Bioinformatics Core has emerged as the standard tool for performing quality assessment on sequencing reads

The manual for fastqc is available online and is very comprehensive; especially the parts which describe particular sections of the study. The authors besides run a "QCfail" blog which discusses some sequencing QC errors they have encountered and how they were diagnosed.

A "traffic light" organization is used to draw your attention to sections of the report that require further investigation. Withal, it is worth bearing in mind that fastqc is designed to be run on fastq files from whatsoever type of sequencing experiment and has no knowledge of the particular library preparation, or conditions that you are studying. Information technology could be that you look loftier levels of duplication or GC content. Always consider the nature of your written report before taking whatsoever drastic activity!

Also, fastqc will non actually do anything to your data. If you decide to trim or remove contamination for your samples, you volition need to utilise another tool.

The sections of a `fastqc` study

Basic Statistics

Some simple statistics about the composition of your file, which tin can exist useful to see if it has guessed the encoding correctly and identified the right number of reads. This department of the report is designed never to requite a warning bulletin

Per-base sequence quality

This department of the report is probably the ane that receives most attention. It'south mostly accepted that at that place is a deposition of quality over the duration of a sequencing run, only the extent to which the quality "drops-off" should be monitored. A boxplot is produced for every base-position in the read and the central line and yellow box correspond the median and inter-quartile range in the usual fashion.

Ideally, the plot should look something like following:-

Even so, a warning volition exist triggered if the lower quartile (25% of the data) of any base in less than ten, or if the median for any base is less than 25. A failure (carmine cross in the traffic light system) occurs if the lower quartile for any base is less than 5, or if the median for any base is less than xx.

Per-sequence quality scores

With this section of the study, we are checking to see if there is a population of sequences that accept low quality values. A warning occurs when the mean quality is below 27, whereas a failure indicates a mean beneath 20.

Running `fastqc`

fastqc tin be run from the CRUK Docker as follows;

            fastqc /information/test/sample.fq1

As a outcome, y'all should get two files; sample.fq1_fastqc.zip and sample.fq1_fastqc.html. The .cypher file contains all the metrics that fastqc computes, should you wish to perform extra manipulation and visualisation beyond what fastqc offers.

Practise

What directory are the output files from fastqc stored?
- how can nosotros check the contents of a particular directory?
- navigate to this directory with the file explorer and open up the HTML file by double-clicking
Look at the assistance for fastqc
- What statement exercise y'all need to change to make the output become to /home/participant/Course_Materials/Day1/?

              fastqc -h

Appendix

For your reference, here is how the example files were created

Kickoff, nosotros download an example bam file from the 1000 genomes project

            wget ftp://ftp.1000genomes.ebi.ac.britain/vol1/ftp/technical/other_exome_alignments/NA06984/exome_alignment/NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam

And so the file is downsampled to give a much reduced set of reads. The original file can be removed

            java -jar $PICARD DownsampleSam I=NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam O=random.bam P=0.1 VALIDATION_STRINGENCY=SILENT  rm NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam

For convenience, we filter the file to keep only reads that are properly paired. And so picard is used again to extract the fastq information from the file. As the original alignments were a mix of 68 and 76 bases, we trim all the reads to 68 bases to processing easier.

            samtools view -f 0x02 -b random.bam > paired.bam   java -jar $PICARD SamToFastq I=paired.bam F=sample.fq1 VALIDATION_STRINGENCY=SILENT F2=sample.fq2 R1_MAX_BASES=68 R2_MAX_BASES=68

(OPTIONAL / Advanced)

If you are interested this department covers how to trim our data

Based on these plots we may desire to trim our data; fastqc will not do this for us.

Pop choices are fastx_toolit, trimmomatic and cutadapt; all of which should exist installed on your computers. As implied past the proper noun, cutadapt is useful for removing adaptor sequences.

fastx_toolkit allows us to remove reads that do not accept a high enough proportion of high-quality reads.

You can run these commands after having run the CRUK Docker shortcut

            ## Check we're in the right directory cd /abode/participant/Course_Materials/Day1 fastq_quality_filter -v Q 33 -q 20 -p 75 -i /data/examination/sample.fq1 -o sample_filtered.fq

the options were used:-

            -i: input file -o: output file -five: study the number of sequences -Q:  33, determines the input quality ASCII offset -q: 20, the quality value required -p 75: the percent of bases required to have that quality value

the output should be something like…

            Quality cutting-off: 20 Minimum percentage: 75 Input: 6290535 reads. Output: 5399767 reads. discarded 890768 (fourteen%) low-quality reads.

It can also exercise straightforward trimming to a particular read length

            fastx_trimmer -5 -f 1 -l lx -i /data/test/sample.fq1 -o sample.fastx.trimmed.fq1

the options specified were:-

            -f: first base to continue -fifty: terminal base to go on -i: input file -o: output file -v: report number of sequences

Trimmomatic is interesting as it allows for differently-sized reads.

To use trimmomatic we tin can run the following in a Terminal. Here $TRIMMOMATIC is used to refer to the particular location that the tool has been installed to.

                          java -jar $TRIMMOMATIC SE -phred33 /information/test/sample.fq1 sample.trimmed.fq1 TRAILING:three MINLEN:30

            SE: run single-cease mode -phred33: use the get-go of 33 to interpret quality scores TRAILING:three cut bases from the stop of the read if below iii

We can interrogate the files nosotros have just created using the ShortRead package.

How many were removed?

            length(trimmed.fq)/length(fq)

            summary(width(sread(trimmed.fq))) hist(width(sread(trimmed.fq)))

Nosotros can verify why some of the reads were removed

            old.ids <- as.grapheme(id(fq)) ids.left <- every bit.character(id(trimmed.fq))  missing.ids <- setdiff(old.ids,ids.left) myquals <- quality(fq) myquals[old.ids %in% missing.ids]

Do

Re-run fastqc and observe what event this has on the results

              fastqc sample.trimmed.fq1 fastqc sample.fastx.trimmed.fq1 fastqc sample.filtered.fq1

deckeringdp1936.blogspot.com

Source: https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2017/Day1/Session4-seqIntro.html

What Is the Difference Between Single End Read and Double Ended Read

Cast your minds back a few years..

Plenty of success stories with microarrays

Why exercise sequencing?

Microarrays vs sequencing

What did we learn from arrays?

Illumina sequencing overview*

Illumina sequencing

Image processing

Base-calling

Raw reads

Fastq sequence names

Fastq quality scores

Fastqc Primer

The sections of a `fastqc` study

Running `fastqc`

Practise

Appendix

(OPTIONAL / Advanced)

Do

0 Response to "What Is the Difference Between Single End Read and Double Ended Read"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel

What Is the Difference Between Single End Read and Double Ended Read

Cast your minds back a few years..

Plenty of success stories with microarrays

Why exercise sequencing?

Microarrays vs sequencing

What did we learn from arrays?

Illumina sequencing overview*

Illumina sequencing

Image processing

Base-calling

Raw reads

Fastq sequence names

Fastq quality scores

Fastqc Primer

The sections of a fastqc study

Running fastqc

Practise

Appendix

(OPTIONAL / Advanced)

Do

0 Response to "What Is the Difference Between Single End Read and Double Ended Read"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel

The sections of a `fastqc` study

Running `fastqc`