What Is the Difference Between Single End Read and Double Ended Read
Cast your minds back a few years..
Plenty of success stories with microarrays
Why exercise sequencing?
Microarrays vs sequencing
- Probe design issues with microarrays
- 'Dorian Gray issue' http://world wide web.biomedcentral.com/1471-2105/5/111
- '…mappings are frozen, as a Dorian Grey-like syndrome: the credible eternal youth of the mapping does not reflect that somewhere the 'motion-picture show of it' decays'
- Sequencing data are 'future proof'
- if a new genome version comes along, just re-align the data!
- tin catch published-data from public repositories and re-marshal to your own choice of genome / transcripts and aligner
- Limited number of novel findings from microarrays
- can't discover what you're not looking for!
- Genome coverage
- some areas of genome are problematic to design probes for
- Fusion genes, re-arrangements and complex events
- not possible with microarray technology
- Maturity of analysis techniques
- on the other hand, assay methods and workflows for microarrays are well-established
- until recently…
What did we learn from arrays?
- Experimental Blueprint; despite this fancy new technolgy, if we don't design the experiments properly we won't go meaningful conclusions
- Quality assessment; Aye, NGS experiments can even so become wrong!
- Normalisation; NGS data come with their own set of biases and fault that demand to be accounted for
- Reproducibility: this is a good thing
- Plenty of tools and workflows were established.
- Don't forget most arrays; the data are all out there somewhere waiting to be discovered and explored
- e.g. repositive.io platform for browse multiple repositories for data for a particular condition
Illumina sequencing overview*
This video gives an overview of the 'sequencing-past-synthesis' approach used by Illumina. Other companies will have different techniques, simply Illumina is probably the most-pop sequencing technology out there. For most of what we will discuss, it won't actually affair how your samples were sequenced.
* Other sequencing technologies are bachelor
Illumina sequencing
Images from:- http://world wide web.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf
A library is prepared by breaking the DNA of involvement into shorter, single-strand fragments. Adapters are each added to each finish of the fragments. A flow-cell has already been prepared to have a "lawn" of sequences that are complementary to the adapters.
Bridge amplification forms clusters of identical fragments on the flow-cell surface. This is required every bit we're going to be taking images of the catamenia-cell and need many copies of each fragment so that nosotros get a decent signal. However, every bit we'll run into later on, this can introduce errors.
The construction of a sequencing reads involves adding a single, terminated, DNA base (each given a distinct flourescent label) 1 at a time, simulataneously across the whole menstruum-prison cell, and taking an image.
So, we try adding an "A" with a ruby label and take an image
and then a "T" with a green, and take another epitome
Analysis of the images can determine which base was succesfully incorporated to each fragment.
The process of added each base repeats for the next cycle, and so-on for "Northward" cycles. east.g. 100 times for a 100-base of operations sequence.
Although this procedure is loftier-optimised and refined, the sheer number of reactions being performed means that errors are inevitable. The signal from a particular cluster could be affected by interference from its neighbours
Or sometimes nosotros get a scrap over-excited and add together too many bases
Therefore the identification of bases comes with some degree of dubiousness which nosotros must capture. This doubt can be incorporated into our analysis, every bit nosotros will see afterwards.
Image processing
- Sequencing produces high-resolution
.TIFFimages; not unlike microarray data - 100 tiles per lane, 8 lanes per flow cell, 100 cycles
- 4 images (A,M,C,T) per tile per cycle = 320,000 images
- Each
.TIFFimage ~ 7Mb = two,240,000 Mb of data (2.24TB)
Base-calling
- "Bustard"
- "Uses cluster intensities and racket estimate to output the sequence of bases read from each cluster, along with a confidence level for each base."
- http://openwetware.org/wiki/BioMicroCenter:IlluminaDataPipeline
- Y'all volition never have to do this
- In fact, the TIFF images are deleted by the instrument on-the-wing
Raw reads
- The most basic file type y'all will see is probably going to be fastq
- Data in public-repositories (e.g. Short Read Archive, GEO) tend to exist in this format
- This represents all sequences created after imaging process
- No idea at this stage whether the sequences will align or not
- No standard file extension. .fq, .fastq, .sequence.txt
- Essentially they are text files
- Can be manipulated with standard unix tools; e.g. true cat, caput, grep, more than, less
- They tin can be compressed and appear as .fq.gz
- Same format regardless of sequencing protocol (i.e. RNA-seq, ChIP-seq, Dna-seq etc)
- Each sequence is described over four lines
- For paired-terminate data you get two files
- they should accept the aforementioned number of lines
- the sequences should exist in the same order
We don't demand any special software to view these, but comport in heed there tin be ~ 250 Million reads (sequences) per Hi-Seq lane. Then using Word or Notepad is probably not a peachy idea.
An example .fastq file has been provided for you in the binder /data/test. If you are curious how this fastq file was generated, you can see the Appendix for details.
We tin launch the CRUK Docker from the Desktop and navigate to the directory containing the files using the cd control
- You don't need to blazon the full path. Start typing and apply the Tab key to automobile-complete
cd /data/test If we wanted to check the directory we are currently looking at, we can employ the pwd command.
pwd The ls command will listing all the files in this directory
ls Bold the files are present, we can print the first few lines using the standard unix command head
- We fix the argument
-n 12to print the first 12 lines of the filesample.fq1
head -n 12 sample.fq1 The unix command wc can count the number of lines in a file with the pick -l
wc -l sample.fq1 What would y'all type to print the first 12 lines of sample.fq2?
Fastq sequence names
The name of a sequence is unique and can encode some useful information. e.g.
@HWUSI-EAS100R:half-dozen:73:941:1973#0/1 - The proper noun of the sequencer (HWUSI-EAS100R)
- The flow cell lane (6)
- Tile number with the lane (73)
- ten co-ordinate within the tile (941)
- y co-ordinate within the tile (1973)
- #0 index number for a multiplexed sample
- /1; the member of a pair, /1 or /2 (paired-end or mate-pair reads only)
However, this depends on musical instrument setup and processing pipelines. Sometimes the tile and coordinate data is omitted to relieve space.
Fastq quality scores
Every bit nosotros saw earlier the process of deciding which base is present at each bicycle of each fragment comes with some probability (p) that we make a fault. The quality score expresses our confidence in a particular base-call; higher quality score, college confidence
- Ane such score for each base of sequencing. i.e. 100 scores for 100 bases of sequencing
- These are of importance if we desire to call SNVs etc.
- need to be certain that differences detected from the reference genome and legitimate, and not caused past sequencing error
The raw base-calling probabilities are converted to text characters to make it easier to store in a file
N?>:<9>>>:=;>>?<>:@?>;==@@@>?=AAA<>=A@?6>4B=<>>.@>?<@;?############# Commencement of all, we convert the base-calling probability (p) into a Q score using the formula
- Quality scores \[ Q = -10log_{10}p\]
- Q = thirty, p=0.001
- Q = xx, p=0.01
- Q = 10, p=0.1
- These numeric quanties are encoded as ASCII code
- ASCII codes ane - 32 have historical uses such as start of text, carriage return, new line etc
- At least 33 to get to meaningful characters
Annoyingly, unlike sequencing instruments have used different offsets over fourth dimension. It'south of import to cheque what encoding has been used for your data
- Most modern sequencing will be
Phred+33 - Tools should be able to detect what is in-use
This handy graphic from wikipedia compares the different schemes
Given a detail quality string, we have to look-up the ASCII code for each graphic symbol and decrease the offset to get the Q score. We can so convert to a probability using the formula:-
\[ p = 10^{-Q/x} \]
So for our detail example:
N?>:<nine>>>:=;>>?<>:@?>;==@@@>?=AAA<>=A@?6>4B=<>>.@>?<@;?############# it works out as follows:-
Graphic symbol Code Minus.Commencement..33.. Probability 1 Northward 78 45 0.00003 ii ? 63 xxx 0.00100 3 > 62 29 0.00126 4 : 58 25 0.00316 5 < sixty 27 0.00200 half dozen 9 57 24 0.00398 7 > 62 29 0.00126 8 > 62 29 0.00126 9 > 62 29 0.00126 10 : 58 25 0.00316 ... ... Character Code Minus.First..33.. Probability 58 # 35 2 0.63096 59 # 35 two 0.63096 60 # 35 2 0.63096 61 # 35 two 0.63096 62 # 35 2 0.63096 63 # 35 2 0.63096 64 # 35 2 0.63096 65 # 35 2 0.63096 66 # 35 2 0.63096 67 # 35 2 0.63096 68 # 35 2 0.63096 For the extremely-keen we can do this in R:-
pq <- "Due north?>:<ix>>>:=;>>?<>:@?>;==@@@>?=AAA<>=A@?6>4B=<>>.@>?<@;?#############" code <- as.integer(charToRaw(equally.character(pq))) qs <- lawmaking -33 qs ## [ane] 45 30 29 25 27 24 29 29 29 25 28 26 29 29 xxx 27 29 25 31 30 29 26 28 ## [24] 28 31 31 31 29 30 28 32 32 32 27 29 28 32 31 thirty 21 29 19 33 28 27 29 ## [47] 29 xiii 31 29 30 27 31 26 thirty two 2 2 two 2 2 two ii 2 ii 2 2 two probs <- x^(unlist(qs)/-10) round(probs,5) ## [i] 0.00003 0.00100 0.00126 0.00316 0.00200 0.00398 0.00126 0.00126 ## [9] 0.00126 0.00316 0.00158 0.00251 0.00126 0.00126 0.00100 0.00200 ## [17] 0.00126 0.00316 0.00079 0.00100 0.00126 0.00251 0.00158 0.00158 ## [25] 0.00079 0.00079 0.00079 0.00126 0.00100 0.00158 0.00063 0.00063 ## [33] 0.00063 0.00200 0.00126 0.00158 0.00063 0.00079 0.00100 0.00794 ## [41] 0.00126 0.01259 0.00050 0.00158 0.00200 0.00126 0.00126 0.05012 ## [49] 0.00079 0.00126 0.00100 0.00200 0.00079 0.00251 0.00100 0.63096 ## [57] 0.63096 0.63096 0.63096 0.63096 0.63096 0.63096 0.63096 0.63096 ## [65] 0.63096 0.63096 0.63096 0.63096 Information technology is possible to interrogate the fastq files in R. Withal, in practice we tend to use other tools such as fastqc described in the side by side section.
library(ShortRead) fq <- readFastq("/information/test/sample.fq1") fq Fastqc Primer
(Acknowledgement to Ines De Santiago for her session at the previous summer schoolhouse)
FastQC from the Babraham Institute Bioinformatics Core has emerged as the standard tool for performing quality assessment on sequencing reads
The manual for fastqc is available online and is very comprehensive; especially the parts which describe particular sections of the study. The authors besides run a "QCfail" blog which discusses some sequencing QC errors they have encountered and how they were diagnosed.
A "traffic light" organization is used to draw your attention to sections of the report that require further investigation. Withal, it is worth bearing in mind that fastqc is designed to be run on fastq files from whatsoever type of sequencing experiment and has no knowledge of the particular library preparation, or conditions that you are studying. Information technology could be that you look loftier levels of duplication or GC content. Always consider the nature of your written report before taking whatsoever drastic activity!
Also, fastqc will non actually do anything to your data. If you decide to trim or remove contamination for your samples, you volition need to utilise another tool.
The sections of a fastqc study
- Basic Statistics
Some simple statistics about the composition of your file, which tin can exist useful to see if it has guessed the encoding correctly and identified the right number of reads. This department of the report is designed never to requite a warning bulletin
- Per-base sequence quality
This department of the report is probably the ane that receives most attention. It'south mostly accepted that at that place is a deposition of quality over the duration of a sequencing run, only the extent to which the quality "drops-off" should be monitored. A boxplot is produced for every base-position in the read and the central line and yellow box correspond the median and inter-quartile range in the usual fashion.
Ideally, the plot should look something like following:-
Even so, a warning volition exist triggered if the lower quartile (25% of the data) of any base in less than ten, or if the median for any base is less than 25. A failure (carmine cross in the traffic light system) occurs if the lower quartile for any base is less than 5, or if the median for any base is less than xx.
- Per-sequence quality scores
With this section of the study, we are checking to see if there is a population of sequences that accept low quality values. A warning occurs when the mean quality is below 27, whereas a failure indicates a mean beneath 20.
Running fastqc
fastqc tin be run from the CRUK Docker as follows;
fastqc /information/test/sample.fq1 As a outcome, y'all should get two files; sample.fq1_fastqc.zip and sample.fq1_fastqc.html. The .cypher file contains all the metrics that fastqc computes, should you wish to perform extra manipulation and visualisation beyond what fastqc offers.
Practise
- What directory are the output files from
fastqcstored?- how can nosotros check the contents of a particular directory?
- navigate to this directory with the file explorer and open up the HTML file by double-clicking
- Look at the assistance for
fastqc- What statement exercise y'all need to change to make the output become to
/home/participant/Course_Materials/Day1/?
- What statement exercise y'all need to change to make the output become to
fastqc -h Appendix
For your reference, here is how the example files were created
Kickoff, nosotros download an example bam file from the 1000 genomes project
wget ftp://ftp.1000genomes.ebi.ac.britain/vol1/ftp/technical/other_exome_alignments/NA06984/exome_alignment/NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam And so the file is downsampled to give a much reduced set of reads. The original file can be removed
java -jar $PICARD DownsampleSam I=NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam O=random.bam P=0.1 VALIDATION_STRINGENCY=SILENT rm NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam For convenience, we filter the file to keep only reads that are properly paired. And so picard is used again to extract the fastq information from the file. As the original alignments were a mix of 68 and 76 bases, we trim all the reads to 68 bases to processing easier.
samtools view -f 0x02 -b random.bam > paired.bam java -jar $PICARD SamToFastq I=paired.bam F=sample.fq1 VALIDATION_STRINGENCY=SILENT F2=sample.fq2 R1_MAX_BASES=68 R2_MAX_BASES=68 (OPTIONAL / Advanced)
If you are interested this department covers how to trim our data
Based on these plots we may desire to trim our data; fastqc will not do this for us.
Pop choices are fastx_toolit, trimmomatic and cutadapt; all of which should exist installed on your computers. As implied past the proper noun, cutadapt is useful for removing adaptor sequences.
fastx_toolkit allows us to remove reads that do not accept a high enough proportion of high-quality reads.
You can run these commands after having run the CRUK Docker shortcut
## Check we're in the right directory cd /abode/participant/Course_Materials/Day1 fastq_quality_filter -v Q 33 -q 20 -p 75 -i /data/examination/sample.fq1 -o sample_filtered.fq the options were used:-
-i: input file -o: output file -five: study the number of sequences -Q: 33, determines the input quality ASCII offset -q: 20, the quality value required -p 75: the percent of bases required to have that quality value the output should be something like…
Quality cutting-off: 20 Minimum percentage: 75 Input: 6290535 reads. Output: 5399767 reads. discarded 890768 (fourteen%) low-quality reads. It can also exercise straightforward trimming to a particular read length
fastx_trimmer -5 -f 1 -l lx -i /data/test/sample.fq1 -o sample.fastx.trimmed.fq1 the options specified were:-
-f: first base to continue -fifty: terminal base to go on -i: input file -o: output file -v: report number of sequences Trimmomatic is interesting as it allows for differently-sized reads.
To use trimmomatic we tin can run the following in a Terminal. Here $TRIMMOMATIC is used to refer to the particular location that the tool has been installed to.
java -jar $TRIMMOMATIC SE -phred33 /information/test/sample.fq1 sample.trimmed.fq1 TRAILING:three MINLEN:30 SE: run single-cease mode -phred33: use the get-go of 33 to interpret quality scores TRAILING:three cut bases from the stop of the read if below iii We can interrogate the files nosotros have just created using the ShortRead package.
How many were removed?
length(trimmed.fq)/length(fq) summary(width(sread(trimmed.fq))) hist(width(sread(trimmed.fq))) Nosotros can verify why some of the reads were removed
old.ids <- as.grapheme(id(fq)) ids.left <- every bit.character(id(trimmed.fq)) missing.ids <- setdiff(old.ids,ids.left) myquals <- quality(fq) myquals[old.ids %in% missing.ids] Do
- Re-run
fastqcand observe what event this has on the results
fastqc sample.trimmed.fq1 fastqc sample.fastx.trimmed.fq1 fastqc sample.filtered.fq1 Source: https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2017/Day1/Session4-seqIntro.html
0 Response to "What Is the Difference Between Single End Read and Double Ended Read"
Post a Comment