This example demonstrates the steps involved in adding a new data source to the Locus Variants track. Each database has its own unique traits, so the exact steps and challenges met along the way will vary. This example is based on data that already uses HGVS mutation nomenclature, which makes the task much easier. Lack of an HGVS name is not a show stopper, but does increase the amount of custom coding that must be done to prepare the data for loading.
The first step is to contact the database curators, explain the PhenCode project, and see if they are interested. Once permission is given, I then explore their website looking for information to establish what needs to be done. For example, what numbering system is used for describing the placement of the variants, e.g., where is position 1? How are introns numbered? What reference sequence is used? If the mutation names are in HGVS format and use the numbering system recommended by HGVS, then mapping the mutations onto the chromosome will be easy. I also identify other data fields that we may want to download (e.g. phenotype), and look at how links back to individual mutations can be created (using existing links at the source as a guide).
If you look at PAHdb you will see that they already use the HGVS-style nomenclature, and the reference sequences for genomic DNA and cDNA are available. Using the advanced mutation search you can download the mutations and relevant fields. The links from the listing to individual mutations are provided using the "Reference ID" field. This field is not unique to the mutation so it cannot be used as a unique identifier for the track, but this is not a problem and is handled well by the table schema for the track. The unique identifier and the accession used by links do not have to be the same. Since all the necessary information is available online, I can download the data and begin reformatting it for loading.
prev 1 | 2 | 3 | 4 next
return to FAQ