The purpose of this website:
The main aim is to provide an easy way of introducing new sequences into
existing datasets for plant transcription factors and to explore ways of
displaying further information so that direct comparisons can be made
easily between structurally related genes to highlight similarities or
differences, thereby assisting with an understanding of gene function.
A longer term aim is to highlight TFs that
regulate secondary metabolism and provide displays to visualise
regulator and target genes together, based on experimental evidence of
their correlations in gene expression and other high throughput data types.
How to interrogate the TF families:
This process is described on the home page and is designed to be a very easy proceedure,
taking away all the fiddly work of performing phylogenetic analysis from scratch.
It is important to a select a TF family or subfamily that matches your query
sequence, otherwise you will get meaningless results returned. Usually this is
obvious with the query sequence(s) displaced to one end of the tree with no
relationship to any other gene in the tree. In the future, a quick pre-screening
procedure will be implemented to ensure that query sequences and families are
Query sequence(s) must be amino acid sequence in
FASTA format and should be as complete as posible over the conserved domain of
the family in question. Coverage of about 70 % or more of domain in the query
sequence(s) should be OK, but lower coverage may 'distort' the tree. If your
sequences contain unusual deletions in the region of the domain, then
that's OK, as long as there is sufficent evolutionary 'signal' in the rest of
the sequence matching the domain; unusual insertions in the region of the domain
are ommitted from the phylogenetic analysis.
Tips: There is no upper limit to the number of query
sequences you can enter in the form, but for the moment, the number of sequences
has been capped at 100 to avoid excessive cpu load. The tree display is capable
of displaying up to about 350 tree tips. The largest tree - currently with over
300 tree tips - takes less than 30 seconds to display using one query sequence
as input; 45 query sequences takes about 3 minutes. The returning pages work
correctly with Internet Explorer [version 8.0] and Firefox [version 3.5] using their
default display settings. I like Firefox the best. The returning pages are optimised for a screen
resolution of 1280 by 1024 pixels and a screen size of at least 17 inches (the
bigger the better).
If you want to test out this web form, there are links to
sequence files to the right of the Interogatory Tree Form. It is also possible to
browse a tree by selecting a family from the pulldown menu then simply pressing the
Information On The Phylogenetic Tree And Expression Data Display For Each Protein
The returning display allows the tree to be searched
with a locus identifier or other gene name by using 'Find' from the browser's Edit
menu. For each gene, there is a link on the right hand side of the tree to the organisation
providing the original data for the predicted gene or to an appropriate accession at NCBI.
All links to other web pages open in the same browser window as the results page, so don't
forget it's also possible to open these links in a new tab in the same broswer window; make a
right hand click with the mouse and select 'Open (link) in New Tab' in the menu. At the bottom
of the tree, there is a key that describes the colours used in the web page.
The top of the the page is fixed by a browser frame that contains
links to the following information and files. To download the files, make a right hand click with
the mouse and select 'Save Target/Link As...' in the menu:
- Full alignment display for visualising subgroups - the Perl program that
creates this page adds the query sequences to an 'existing' alignment which contains
the full complement of proteins for the family from Arabidopsis and rice. Each query
sequence should be checked to ensure that its alignment to the rest of the data set
is sensible. Note - at the moment, there is no check that query sequence(s) contain
the domain in question.
Closely related proteins appear as 'neighbours'
in this alignment so it is quite easy to identify any sequence motifs, present
outside the conserved family domain, that are shared by these proteins and so
it is possible to observe the family divided into subgroups. These motif regions
of relatively rare amino acid conservation are highlighted in
blue, some of which represent interspecies subgroup-specific motifs,
highlighted in green. The green highlight is
triggered in a case where neighbouring residues are identical amino acids
that come from species contained in at least two of these sets of species: set 1
(dicots: Arabidopsis, Lotus), set 2 (monocots: rice, Brachypodium), set 3 (a lower plant: moss).
The region of the alignment coresponding to the
protein family domain contains only the match states
of the HMM profile. In this region, a column that contains a single amino acid
that occupies 90 % or more of the column, that amino acid is highlighted in bold and
slightly larger than the rest of the amino acids in the alignment. The remaining amino
acids in the column are highlighted in red.
- PHYLIP format alignment file used for tree - this file can be used for
bootstrap analysis, although the main subgroups should be well defined
with the distance matrix method used. The alignment is exactly the one shown for
the protein family domain, between the two purple/dark vertical bars in the full
- Newick format file of tree - import this file into a tree drawing
tool e.g. MEGA4 or the PHYLIP
program to obtain a tree picture and graphics file. The resulting
tree will be horizontally 'fatter' than the the tree in the web
page because the tree in the web page has been horizontally
squashed by one third.
- Tree in colour, as in web page (.png file)
A subgroup tree can be generated by deciding on a subgroup
identifier (shown on the tree in purple), selecting the
corresponding radio button, then hitting the vertical green
submit bar. As a general rule, subgroups can be identified on the tree as clusters of
closely related sequences, separated from other subgroups on the tree by a relatively
In order to keep track of new sequences from the literature, e.g.
sequences that are potentially from a single subgroup, you could keep a simple fasta format file
containing sequences of interest, periodically update it with new sequences, then submit them via The
Interrogative Tree Web Form to observe instantly how all the sequences relate to the existing data.
More detail about subgroup trees to follow below.
Gene expression profiles for each gene are presented in
a table to the right of the tree for a range of tissues and environmental conditions.
The mouse arrow can be placed over the coloured table cells of the expression
experiments to reveal the normalised natural siganl intensity value. More detail
about the expression data to follow below.
The diagram below explains briefly how the query sequences are added
to the tree:
About The Data Sets
The aim for each TF family is to present a tree that contains the full
compliment of proteins from two dicot and two monocot model species. These
trees will be sufficient to define the subgroups that are very likely to be
present in a typical higher plant, especially the dicot and monocot lineages.
To aid clarity for large families, a separate tree has been derived from
protein sequences from a representative model higher plant (Arabidopsis) and an
ancestral plant species (moss) so that the subgroups in common between these
diverse plant lineages can be observed easily. For smaller families, all
species' sequences are included in a single tree.
The analysis of each family divides
the proteins into three categories (category 1, 2 and 3). The phylogenetic trees only contain protein
sequences from Category 1. Unlike Category 2 proteins, Category1 proteins
contain a sufficiently complete DNA binding domain. Category 3 proteins were
present on long terminal branches (>0.8) in a preliminary tree so have been removed
to reduce the chance of these proteins appearing in a misleading position e.g. in
an otherwise well defined subgroup. Category 2 and 3 protein sequences will
be added to the bottom of the Global Alignment Display for inspection, in
due course. The fasta file of protein sequences in the data sets table contains
the proteins present in all three categories.
Also available in The Datasets Table are links to
all the protein sequences of a family, separated out by category and by species,
as well as links to the Hidden Markov Model of the aligned protein domain on which
the trees are based. There can often exist faint but detectable homology between
different protein family domains. Where homology exists, the pFAM database organises
these families into clans. Links to these clans at pFAM are included in the Datasets
Table. The datasets table is organised based on these clans and other widely appreciated
TF domain fold types rather than, perhaps less informatively, an A-->Z list.
The Trees As An 'Anchor' For Displaying Other Data Types
(i) Highlighting Proteins Encoded By Genes Residing On Regions Of The Genome That Have Been Duplicated In An Ancestor
This data is available for genes in Arabidopsis, indicated on the trees by (dark) red dots.
Hover the mouse over the locus ID printed in the table that is to the right of the tree
to see further information. The data originates from two independant analyses (from 2003 and
about 2008). For some genes, the two datasets have given different results but these
anomalies might be resolved by looking at the tree and the sequence alignments.
The duplication data from the analysis by Blanc, Hokamp and Wolfe (2003)
contain locus IDs that have since changed, for example, if the At1g01010 gene is now considered
to be two genes, this locus ID might have been split into gene At1g01010 and gene At1g01015 but the At1g01015
gene would not be picked out of this data set when using gene IDs from more recent versions of the TAIR
predicted peptides (> version 9). To overcome this potential snag, the locus IDs were truncated by one
digit (as indicated by an asterisk) and the data was searched with the shortened string. This means
that sometimes two or more genes will be picked up but is not a problem because the genes in question
will originate from the same place in the genome.
(ii) Gene Expression Profiles
This diagram explains how the expression intensity values are colour coded: