IT3F Documentation
Home Page

The purpose of this website: The main aim is to provide an easy way of introducing new sequences into existing datasets for plant transcription factors and to explore ways of displaying further information so that direct comparisons can be made easily between structurally related genes to highlight similarities or differences, thereby assisting with an understanding of gene function.

       A longer term aim is to highlight TFs that regulate secondary metabolism and provide displays to visualise regulator and target genes together, based on experimental evidence of their correlations in gene expression and other high throughput data types.

How to interrogate the TF families: This process is described on the home page and is designed to be a very easy proceedure, taking away all the fiddly work of performing phylogenetic analysis from scratch. It is important to a select a TF family or subfamily that matches your query sequence, otherwise you will get meaningless results returned. Usually this is obvious with the query sequence(s) displaced to one end of the tree with no relationship to any other gene in the tree. In the future, a quick pre-screening procedure will be implemented to ensure that query sequences and families are correctly matched.

       Query sequence(s) must be amino acid sequence in FASTA format and should be as complete as posible over the conserved domain of the family in question. Coverage of about 70 % or more of domain in the query sequence(s) should be OK, but lower coverage may 'distort' the tree. If your sequences contain unusual deletions in the region of the domain, then that's OK, as long as there is sufficent evolutionary 'signal' in the rest of the sequence matching the domain; unusual insertions in the region of the domain are ommitted from the phylogenetic analysis.

Tips: There is no upper limit to the number of query sequences you can enter in the form, but for the moment, the number of sequences has been capped at 100 to avoid excessive cpu load. The tree display is capable of displaying up to about 350 tree tips. The largest tree - currently with over 300 tree tips - takes less than 30 seconds to display using one query sequence as input; 45 query sequences takes about 3 minutes. The returning pages work correctly with Internet Explorer [version 8.0] and Firefox [version 3.5] using their default display settings. I like Firefox the best. The returning pages are optimised for a screen resolution of 1280 by 1024 pixels and a screen size of at least 17 inches (the bigger the better).

       If you want to test out this web form, there are links to sequence files to the right of the Interogatory Tree Form. It is also possible to browse a tree by selecting a family from the pulldown menu then simply pressing the submit button.


Information On The Phylogenetic Tree And Expression Data Display For Each Protein Family

The returning display allows the tree to be searched with a locus identifier or other gene name by using 'Find' from the browser's Edit menu. For each gene, there is a link on the right hand side of the tree to the organisation providing the original data for the predicted gene or to an appropriate accession at NCBI. All links to other web pages open in the same browser window as the results page, so don't forget it's also possible to open these links in a new tab in the same broswer window; make a right hand click with the mouse and select 'Open (link) in New Tab' in the menu. At the bottom of the tree, there is a key that describes the colours used in the web page.

       The top of the the page is fixed by a browser frame that contains links to the following information and files. To download the files, make a right hand click with the mouse and select 'Save Target/Link As...' in the menu:

  • Full alignment display for visualising subgroups - the Perl program that creates this page adds the query sequences to an 'existing' alignment which contains the full complement of proteins for the family from Arabidopsis and rice. Each query sequence should be checked to ensure that its alignment to the rest of the data set is sensible. Note - at the moment, there is no check that query sequence(s) contain the domain in question.

           Closely related proteins appear as 'neighbours' in this alignment so it is quite easy to identify any sequence motifs, present outside the conserved family domain, that are shared by these proteins and so it is possible to observe the family divided into subgroups. These motif regions of relatively rare amino acid conservation are highlighted in blue, some of which represent interspecies subgroup-specific motifs, highlighted in green. The green highlight is triggered in a case where neighbouring residues are identical amino acids that come from species contained in at least two of these sets of species: set 1 (dicots: Arabidopsis, Lotus), set 2 (monocots: rice, Brachypodium), set 3 (a lower plant: moss).

           The region of the alignment coresponding to the protein family domain contains only the match states of the HMM profile. In this region, a column that contains a single amino acid that occupies 90 % or more of the column, that amino acid is highlighted in bold and slightly larger than the rest of the amino acids in the alignment. The remaining amino acids in the column are highlighted in red.


  • PHYLIP format alignment file used for tree - this file can be used for bootstrap analysis, although the main subgroups should be well defined with the distance matrix method used. The alignment is exactly the one shown for the protein family domain, between the two purple/dark vertical bars in the full alignment Display.


  • Newick format file of tree - import this file into a tree drawing tool e.g. MEGA4 or the PHYLIP DRAWGRAM program to obtain a tree picture and graphics file. The resulting tree will be horizontally 'fatter' than the the tree in the web page because the tree in the web page has been horizontally squashed by one third.


  • Tree in colour, as in web page (.png file)

A subgroup tree can be generated by deciding on a subgroup identifier (shown on the tree in purple), selecting the corresponding radio button, then hitting the vertical green submit bar. As a general rule, subgroups can be identified on the tree as clusters of closely related sequences, separated from other subgroups on the tree by a relatively long branch.

       In order to keep track of new sequences from the literature, e.g. sequences that are potentially from a single subgroup, you could keep a simple fasta format file containing sequences of interest, periodically update it with new sequences, then submit them via The Interrogative Tree Web Form to observe instantly how all the sequences relate to the existing data.

       More detail about subgroup trees to follow below.

Gene expression profiles for each gene are presented in a table to the right of the tree for a range of tissues and environmental conditions. The mouse arrow can be placed over the coloured table cells of the expression experiments to reveal the normalised natural siganl intensity value. More detail about the expression data to follow below.

The diagram below explains briefly how the query sequences are added to the tree:


About The Data Sets

The aim for each TF family is to present a tree that contains the full compliment of proteins from two dicot and two monocot model species. These trees will be sufficient to define the subgroups that are very likely to be present in a typical higher plant, especially the dicot and monocot lineages. To aid clarity for large families, a separate tree has been derived from protein sequences from a representative model higher plant (Arabidopsis) and an ancestral plant species (moss) so that the subgroups in common between these diverse plant lineages can be observed easily. For smaller families, all species' sequences are included in a single tree.

       The analysis of each family divides the proteins into three categories (category 1, 2 and 3). The phylogenetic trees only contain protein sequences from Category 1. Unlike Category 2 proteins, Category1 proteins contain a sufficiently complete DNA binding domain. Category 3 proteins were present on long terminal branches (>0.8) in a preliminary tree so have been removed to reduce the chance of these proteins appearing in a misleading position e.g. in an otherwise well defined subgroup. Category 2 and 3 protein sequences will be added to the bottom of the Global Alignment Display for inspection, in due course. The fasta file of protein sequences in the data sets table contains the proteins present in all three categories.

       Also available in The Datasets Table are links to all the protein sequences of a family, separated out by category and by species, as well as links to the Hidden Markov Model of the aligned protein domain on which the trees are based. There can often exist faint but detectable homology between different protein family domains. Where homology exists, the pFAM database organises these families into clans. Links to these clans at pFAM are included in the Datasets Table. The datasets table is organised based on these clans and other widely appreciated TF domain fold types rather than, perhaps less informatively, an A-->Z list.


The Trees As An 'Anchor' For Displaying Other Data Types

(i) Highlighting Proteins Encoded By Genes Residing On Regions Of The Genome That Have Been Duplicated In An Ancestor

This data is available for genes in Arabidopsis, indicated on the trees by (dark) red dots. Hover the mouse over the locus ID printed in the table that is to the right of the tree to see further information. The data originates from two independant analyses (from 2003 and about 2008). For some genes, the two datasets have given different results but these anomalies might be resolved by looking at the tree and the sequence alignments.

       The duplication data from the analysis by Blanc, Hokamp and Wolfe (2003) contain locus IDs that have since changed, for example, if the At1g01010 gene is now considered to be two genes, this locus ID might have been split into gene At1g01010 and gene At1g01015 but the At1g01015 gene would not be picked out of this data set when using gene IDs from more recent versions of the TAIR predicted peptides (> version 9). To overcome this potential snag, the locus IDs were truncated by one digit (as indicated by an asterisk) and the data was searched with the shortened string. This means that sometimes two or more genes will be picked up but is not a problem because the genes in question will originate from the same place in the genome.

(ii) Gene Expression Profiles

This diagram explains how the expression intensity values are colour coded:


The Subgroup Trees

Further detail about the subgroup inner trees will appear here.

The inner tree also comes with a global alignment of the subgroup sequences to show how they are related across their full length. Certain columns are emphasized using the following criteria:
bold - columns consisting of one amino acid
blue - columns containing 50% or more of the same amino acid
red - hypervariable columns within the areas depicted in blue or bold.
The columns in bold and blue are used to derive the subgroup tree. The blue and red columns are the ones that are useful in the phylogeny to infer gene relationships in the tree.