The integration of efficient data capture and analysis systems in a project of this scope is essential both for the smooth running of the project but also to maximise the analysis. This breaks down into the following areas:
Coordination of mapping and sequencing
Support for sequence determination
Sequence annotation
Comparative sequence analysis
Downstream Integration
Within the lifetime of this project we will be designing specialist tools and software, especially in the area of comparative analysis.
Coordination of mapping and sequencing
Sequence ready maps will be fingerprinted at the laboratory responsible for mapping each region using the Image and FPC software, within which a minimal tiling path can be elected for sequencing.
Clones from the assembled contigs will be registered with the NCBI clone-registry in order to avoid duplication of effort with other sequencing centres.
Support for sequence determination
The Hinxton Sequencing Consortium will use the same software systems for assembly and finishing, namely PHRAP, GAP and CAF packages.
During the sequencing process, assembled contigs longer than 2kb will be made available by ftp and for sequence searching and submitted automatically to the HTG division of EMBL/Genbank, following the Bermuda agreements for large-scale genomic sequence determination.
Sequence annotation
Automated sequence annotation will be carried out not only on finished sequence but also on partially finished sequence:
Unfinished Sequence
This will annotated using the Ensembl programme developed at the Sanger Centre by the Annotation Group.
See http://ensembl.ebi.ac.uk
It's a joint EMBL-EBI and Sanger Centre project. See site for more information. Note spelling "Ensembl"
Finished Sequence
Once the finished sequence for individual clones has been determined, the primary sequence annotation process carries out computational analyses and database searches to search for information on the biological content of the sequence, following which it is submitted to the main EMBL nucleotide database.
Have a look at our explainations at http://www.sanger.ac.uk/HGP/Humana/
Comparative sequence analysis
The current strategies for sequence annotation, are not optimised for using homologous genomic sequence. However we expect comparison of the corresponding mouse and human sequence to be the most valuable source of information in the current programme, in particular for confirming gene structure and identifying potential regulatory sequences.
Various computer programs and software systems have been or are being developed for the analysis of this type of data, e.g. PIP, ALFRESCO. We propose to expand and to undertake further work in this area, in particular assessing the available methods for their potential for large scale automated application, and integrating them into the standard analysis workflow and database systems.
Downstream Integration
In order to gain the maximum utility from finished and annotated sequence it will be essential that it is viewed in context with other experimental and genetic information from corresponding mouse and human chromosomal regions. We propose to develop targeted data integration tools that will seek to aid the integration and discovery of links between a number of different data sets, for example:
Mouse mutagenesis - detailed phenotype data derived from genome-wide and targeted mutagenesis programmes worldwide.
Expression profiles - both micro-array, in situ and other forms of gene expression data deposited in the Mouse Gene Expression Database.