You are on page 1of 9

Volume 2, Issue 9, September 2017 International Journal of Innovative Science and Research Technology

ISSN No: - 2456 2165

CpG Frequency Analysis in Human Genome Using R


Programming
S. Balamurugan* and Dr. S Prasanna
Department of Computer Applications, Vels University,
Pallavaram, Chennai 600 117, Tamil Nadu, India
sivabala76@gmail.com, prasanna.scs@velsuniv.ac.in

Abstract:-In the present information society, biological silencing or expressing particular gene action. Based onCpG
data are enormously increasing. After the successful frequencies, CpGs are classified as CpG Island (CGI), Non
achievement of Human Genome Project (HGP), the CpG Island (non-CGI), CpG Island shores (CGI shores) and
complete human genome sequence (reference genome) was CpG Island shelves (CGI shelves).[5]
made available in online resources for bioinformatics tools
and services. Those data sets are large and complex (Big A. CpG Island (CGI)
data), but analyzing and comparing these massive amount
of genomic sequences paves way to understand the
complex diseases to find the personalized medicines. CGIs are the genomic regions (~1000 base pairs long)with
Consequently, it is very important that every individual high frequencies of CpG sites in a GC-rich sequence. The p
should know about human genome. Resources to access in CpG refers to the phosphodiester bond between Cytosine
human genome are enormous, but only the researchers can and Guanine, which indicates that C and G are next to each
access those resources by providing appropriate keywords other in a sequence. [6,7,8] In humans, 40% of CGIs are found
and is not possible for a biologist with little computer in gene promoters and the remaining 60% have been termed
knowledge. Therefore, a convenient and instinctual data orphan CGIs in the remaining portion of the sequence. [9]
interfaces are important to easily access, download,
visualize and analyze human genomic data. With these B. Algorithms for CGI Identification
requirements in mind, the present study proposed a web
applicationdevelopment using R programming along with Gardiner-Garden and Frommer's (1987)[10] algorithm criteria
Bioconductor packages for CpG site frequency analysis of include: length over 200 base pairs, over 50% GC pairs, and a
CpG Island, CpG non-island and CpG Island Shores & ratio of observed to expected number of CpG dinucleotides
shelves (Downstream and Upstream) in human reference over 0.60.Takai and Jones (2002)[11] algorithm criteria
genome.Later the App will be hosted for the users further include:length over 500 base pairs, over 55% GC pairs, and a
analysis. ratio of observed to expected number of CpG dinucleotides
over 0.65. Another algorithm, CpGcluster detects CpG
clusters through statistical significance based on the physical
Keywords: Human Genome; Human Genome Project; CPG distance between neighboring CpGdinucleotides in a
Island; R and Bioconductor. chromosome. [12] According to the introduction of the three
algorithms above, the present study considered Gardiner-
I. INTRODUCTION Garden and Frommer algorithm as a major algorithm.

Frequency analysis, that the possibility of four different C. CGI Shores and Shelves (Up and Downstream)
nitrogenous base pairs (A, T, G, C) on 22 body and XY sex
chromosomes of human reference genome is the first and CGIs are interspaced by long stretches of highly methylated
basic analysis.[1]The second important basic feature is CpG-poor regions that are found both within and between
dinucleotide frequencies such as GC and AT content genes. One can find the CGI shore from 0 to 2 kb on either
analysis.[2]Analyzing the frequency of GC content is side of a CGI flanking regions and the CGI shelf from 2 to
important, because the GC-content determines the stable 4kb on either side of a CGI flanking regions. [9]
nature of DNA and length of the coding sequence is directly
proportional to higher GC content.[3] Similarly,CpG sites are D. Non CpG Island (Non-CGI)
the regions of DNA where a Cytosine is followed by a
Guanine linearly.[4] Studying methylation in CpG site is Non CpG Island is the upstream and downstream regions to
important that they act as gene markers and involve in gene the CpG Island where CpG sites are absent or not with the
regulation. They are playing key role in disease onset through specified conditions as in the above said algorithms.In non-

IJISRT17SP14 www.ijisrt.com 34
Volume 2, Issue 9, September 2017 International Journal of Innovative Science and Research Technology
ISSN No: - 2456 2165

CGI, CpG site is not located in a promoter, a gene body, a RStudio and its dependent Bioconductor packages are used for
CGI, a CGI shore or a CGI shelf and are located in the open nucleotide frequency (CpG) analysis on Human reference
sea. As like in CpG island, DNA methylation is also found at genome. Bioconductor is a collection of R packages for the
Non-CpG island and are also important in Epigenetic Gene analysis and comprehension of high-throughput genomic data.
Regulation and Brain Function. [13] It consists of 1296 software packages, 309 experiment data
packages, and 933 up-to-date annotation packages. The
There are number of software available to predict CpG Bioconductor project provides a data packages like
islands. CpGProDis for identifying mammalian promoter Biostrings and BSgenome that helps to access the full
regions associated with CpG islands in large genomic genome sequences of a given organism from data resources
sequences. [14]Newcpgseek is another tool to scores each for sequence analysis. These packages are called Biostrings-
position of CpG site in the sequence using a running sum based genome data packages and require the BSgenome
calculated from all positions in the sequence, starting with the package to work properly. As mentioned earlier, the full
first and ending in the last.[15] The tools cpgplot and genome sequences for Human as provided by UCSC (hg19)
newcpgreportis to identifyCpG islands in one or more stored in Biostrings objects. [18]Biostrings is a memory
nucleotide sequences. [15]CgiHunter is a tool used for CpG efficient string container, string matching algorithms, and
island annotation and it has been proven to identify all genome other utilities, for fast manipulation of large biological
regions (http://cgihunter.bioinf.mpi- sequences or sets of sequences. RSQLite embeds the SQLite
inf.mpg.de).CpGPAP (CpG island Predictor Analysis database engine, providing a DBI-compliant interface. The
Platform) is a web-based application that provides an interface DBI package defines a common interface between the R and
for predicting CpG islands in genome sequences or in user database management systems (DBMS). Sqldf ()
input sequences. [16]But these tools can be accessed mostly transparently sets up a database, imports the data frames into
with the bioinformaticians. To our knowledge, there is no that database. These packages assist in managing data in R
separate web application available for visualizing CpG sites in environment.
different regions of Human genome. The present study
proposed a web application development to provide III. RESULTS AND DISCUSSION
visualization and comparison of CGI vs non-CGI, CGI vs CGI
shores and CGI vs CGI shelves. After Mendels discovery in Genetics, more than 6000 genetic
disorders have been studied, but still we do not have a clear
II. MATERIALS AND METHODS understanding of many of their roles in health and diseases.
[19]
Due to the revolution of Human Genome Project and
Personal Genomics, for the past few years, the size of Human
The proposed web application development in the present Genome information in the data resources like NCBI and PGP
study is written entirely in the open-source R programming has grown exponentially. Due to this increase, the data of
language. [17] The methodology adopted for the SAFA-HG human genome, is one of the important Big data source exist
App development description is as follows. today. Research contribution of the fields such as
computational biology, bioinformatics and systems biology
A. Data Resources mostly involve in analyzing these genomic data to improve
human health. With this genomic revolution, it is possible in
future that, all the individuals may have their own genome
The experimental data, used for the present study were information as their personal medical card and genetic
retrieved from UCSC for the CpGfrequency analysis. For passport which will be the reference to the physician to have
CpG Island analysis the table in *.txt personalized medicine. [20,21]So, it is important to know that
(cpgislandExtUnmasked.txt) format was downloaded. everyone should have knowledge on Human genome and its
Similarly, to analyze CpG Island Shores and Shelves, the same features. In this regard, the present study proposed an
table in *.bed (cpgislandExtUnmasked.bed) file format was interactive web application development on CpG analysis in
downloaded from UCSC genome browser Human genome using R programming.
(https://genome.ucsc.edu/cgi-bin/hgTables) using the Library
(BSgenome.Hsapiens.UCSC.hg19). This application has the power and flexibility to be resident on
a local computer or serve as a web-based environment,
a). R Programming Language Related Interfaces and enabling easy sharing and visualization of data to the
Packages biological researchers with little computer knowledge. Unlike
the traditional system (download the data and stored in a local
R is an open source programming language and software hard drive for analysis), the present study acquired data (real
environment for statistical computing and graphics. RStudio time data) from online as the size of human genome data is too
is an integrated development environment (IDE) for R that big. The results in the form of tables and plot images
provides an alternative interface to R. In the present study,

IJISRT17SP14 www.ijisrt.com 35
Volume 2, Issue 9, September 2017 International Journal of Innovative Science and Research Technology
ISSN No: - 2456 2165

(histograms) of nucleotide (CpG) frequency analysis are away from the CGI. Methylation in CGI shores (both
shown and described as follows. upstream and downstream) is more responsible for many
diseases. Therefore, the identification and analyzing the CGI
A. CpG Island and Non-Island Tables shores compare with CpG Island is important. In the proposed
web application development, the present study compared
CpG Island and CpG Island shores (Up and Downstream) and
User can view the list of CpG Island details for minimum 10 the plot for Chromosome 1 and is shown in Figure 4. In the
and maximum 50 entries in a sliding window screen. The user density plot, red color indicates CGI frequencies, dark blue
can also find the details by providing this keyword in the color indicates CGI shores downstream and light blue color
search box. In such a way, Chromosome 1 has 4332 entries indicates CGI shores upstream. As like previous, in these
covered in 280 pages,Chromosome 2 has 3464 entries covered plots also the frequency distribution is not normalized.CpG
in 231 pages, Chromosome 3 has 2681 entries covered in 179 sites information in shores region can be retrieved and
pages,Chromosome 4 has 2473 entries covered in 165 pages, visualized and the significance of methylation in context with
Chromosome 5 has 2637 entries covered in 176 many diseases can be studied by the end users.
pages,Chromosome 6 has 2647 entries covered in 177
pages,Chromosome 7 has 2841 entries covered in 190 pages, D. CpG Island and CpG Island Shelves (Up and
Chromosome 8 has 1982 entries covered in 133 Downstream) Plot
pages,Chromosome 9 has 2308 entries covered in 154
pages,Chromosome 10 has 2095 entries covered in 140 pages, CpG Shelves are defined as the 2 kb outside of a shores
Chromosome 11 has 2295 entries covered in 153 flanking regions. As like the previous plots, the comparison of
pages,Chromosome 12 has 2429 entries covered in 162 pages, CGI frequencies with CGI shelves (both up and downstream)
Chromosome 13 has 1429 entries covered in 96 is constructed and the plot for chromosome 1 is shown in
pages,Chromosome 14 has 1540 entries covered in 103 Figure 5. In the density plot, red color indicates CGI
pages,Chromosome 15 has 1456 entries covered in 98 pages, frequencies, dark blue color indicates CGI shelves
Chromosome 16 has 2176 entries covered in 146 downstream and light blue color indicates CGI shelves
pages,Chromosome 17 has 2505 entries covered in 167 pages, upstream. In these plots, also the frequency distribution is not
Chromosome 18 has 1089 entries covered in 73 normalized. Similar to CpG Island shores, the users (biologist)
pages,Chromosome 19 has 3275 entries covered in 219 pages, who are interested to work with major diseases like cancer can
Chromosome 20 has 1288 entries covered in 86 pages, also concentrate on CpG Island shelves regions in both
Chromosome 21 has 682 entries covered in 46 upstream and downstream.
pages,Chromosome 22 has 1036 entries covered in 70 pages,
Chromosome X has 1945 entries covered in 130 pages and the IV. CONCLUSION
Chromosome Y has 424 entries covered in 29 pages. The end
users can be easily visualizing the CpG site in the CGI In the present study, the web application is developed is
regions. For a reference, the screenshot of the first pages of proposed for frequency analysis on CpG sites in Human
CGI and non-CGI of chromosome 1 are shown in Figure 1 and reference genome. The methodology followed is summarized
2, respectively. in the section Materials and Methods. Through this App,
visualization, download and analysis of CpG Island data can
B. CpG Island Plots be done by the end user for their further analysis. The
development of this App is ongoing and we intend to add to
In the obtained histograms, the Island frequencies are plotted improve upon the visualization and analysis features. The tool
in red and non-island frequencies are plotted in blue color. will be well developed with the improved facilitates that to
The comparative plot for the chromosomes 1 is shown in predict CpG sites responsible for diseases like cancer.Apart
Figure 3. In the plot, the frequencies are not normally from CGI and its shores, shelves and Non-CGI, the work on
distributed. The graphs show the density plots, which can CpG canyon, CpG ocean, Gene body and Gene desert will be
allow us to easily review the whole distribution of CGI and done.In future, the work will be continued on the above-
non-CGI data. The graph was plotted after satisfied with mentioned regions and the App will be maintained.
Gardiner-Garden and Frommeralgsorithm. The users can use
these outputs (both graphical and statistical) for the further
CGI and non-CGI analysis.

C. CpG Island and CpG Island Shores (Up and


Downstream) Plot

Shores are the regions immediately flanking CpG islands


(CGI) the consensus definition of a CpG shore is up to 2kbp

IJISRT17SP14 www.ijisrt.com 36
Volume 2, Issue 9, September 2017 International Journal of Innovative Science and Research Technology
ISSN No: - 2456 2165

REFERENCES [16]. Chuang LY et al., CpGPAP: CpG island predictor


analysis platform, BMC Genetics, 2012; 13:13.
[1]. Louie E et al., Nucleotide Frequency Variation Across [17]. R Core Team, R: A language and environment for
Human Genes, Genome Research. 2003; 13(12):2594- statistical computing, R Foundation for Statistical
2601. Computing, Vienna, Austria, 2015.
[2]. Beleza Yamagishi ME and Shimabukuro AI, Nucleotide [18]. Pags H et al., Biostrings: String objects
Frequencies in Human Genome and Fibonacci Numbers, representing biological sequences, and matching
Bulletin of Mathematical Biology, 2008; 70(3): 643. algorithms, R package version, 2017; 2.44.1.
[3]. Vinogradov AE., DNA helix: the importance of being [19]. Rehm HL et al., ClinGen--the Clinical Genome
GC-rich, Nucleic Acids Research, 2003; 31(7):1838- Resource, N Engl J Med. 2015; 4;372(23):2235-2242.
1844. [20]. Akgun M, Privacy preserving processing of
[4]. Sharif J et al., Divergence of CpG island promoters: a genomic data: A survey, Journal of Biomedical
consequence or cause of evolution?, Dev Growth Informatics,2015; 56: 103-111.
Differ, 2010; 52(6):545-554. [21]. Baranov VS, Genome paths: A way to personalized
[5]. Edgar R et al., Meta-analysis of human methylomes and predictive medicine, Acta Naturae, 2009.
reveals stably methylated sequences surrounding CpG
islands associated with high gene expression,
Epigenetics & Chromatin, 2014; 7:28,1-12.
[6]. Bird AP, CpG rich islands and the function of DNA
methylation, Nature, 1986; 321: 209-213.
[7]. Larsen F et al., CpG islands as gene markers in the
human genome, Genomics,1992; 13: 1095-1107.
[8]. Han L et al., CpG island density and its correlations with
genomic features in mammalian genomes, Genome
Biology, 2008; 9(5):R79:1-12.
[9]. Cooper DN et al., Methylation-mediated deamination of
5-methylcytosine appears to give rise to mutations
causing human inherited disease in CpNpG trinucleotides,
as well as in CpG dinucleotides, Human Genomics,
2010; 4(6):406-410.
[10]. Gardiner-Garden M and Frommer M, CpG islands
in vertebrate genomes, J. Mol. Biol.1987; 196(2), 261.
[11]. Takai D and Jones P, Comprehensive analysis of
CpG islands in human chromosomes 21 and 22, Proc.
Natl Acad. Sci.,2002; 99(6), 37403745.
[12]. Hackenberg M et al., CpGcluster: a distance-based
algorithm for CpG-island detection, BMC Bioinform,
2006; 7, 446.
[13]. Jang HS et al., Review on CpG and Non-
CpGMethylation in Epigenetic Gene Regulation and
Brain Function, genes,2017; 8(148):1-20.
[14]. Ponger L and Mouchiroud D, CpGProD: identifying
CpG islands associated with transcription start sites in
large genomic mammalian sequences,
Bioinformatics,2002; 18(4), 631633.
[15]. Rice P et al., EMBOSS: the European Molecular
Biology Open Software Suite, Trends Genet,2000;
16(6):276-277.

IJISRT17SP14 www.ijisrt.com 37
Volume 2, Issue 9, September 2017 International Journal of Innovative Science and Research Technology
ISSN No: - 2456 2165

LIST OF FIGURES

1. Figure 1: Table view of CGI details along with number of CpGs, Obs/Exp ratio etc. of Chromosome 1.
2. Figure 2: Table view of non-CGI details along with number of CpGs, Obs/Exp ratio etc. of Chromosome 1.
3. Figure 3: Comparative histogram plot of CGI and non-CGI of Chromosome 1.
4. Figure 4: Comparative histogram plot of CGI and CGI shores (both up and downstream) of Chromosome 1.
5. Figure 5: Comparative histogram plot of CGI and CGI shelves (both up and downstream) of Chromosome 1.

Figure 1: Table View of CGI Details Along With Number of Cpgs, Obs/Exp Ratio Etc of Chromosome 1.

IJISRT17SP14 www.ijisrt.com 38
Volume 2, Issue 9, September 2017 International Journal of Innovative Science and Research Technology
ISSN No: - 2456 2165

Figure 2: Table View of Non-CGI Details Along With Number of Cpgs, Obs/Exp Ratio Etc of Chromosome 1.

IJISRT17SP14 www.ijisrt.com 39
Volume 2, Issue 9, September 2017 International Journal of Innovative Science and Research Technology
ISSN No: - 2456 2165

Figure 3: Comparative Histogram Plot of CGI and Non-CGI of Chromosome 1.

IJISRT17SP14 www.ijisrt.com 40
Volume 2, Issue 9, September 2017 International Journal of Innovative Science and Research Technology
ISSN No: - 2456 2165

Figure 4: Comparative Histogram Plot of CGI and CGI Shores (Both Up and Downstream) of Chromosome 1.

IJISRT17SP14 www.ijisrt.com 41
Volume 2, Issue 9, September 2017 International Journal of Innovative Science and Research Technology
ISSN No: - 2456 2165

Figure 5: Comparative Histogram Plot of CGI and CGI Shelves (Both Up and Downstream) of Chromosome 1.

IJISRT17SP14 www.ijisrt.com 42

You might also like