Once we have downloaded the mutation data and the reference sequences
that are used to define the mutations' positions, we need to align the
reference sequences with one or more of the UCSC genome assemblies in
order to translate the given positions to chromosome coordinates.
I use Blat to get
these alignments in PSL format. If there are a lot of reference sequences the Blat
runs can be automated, but in this example there are only two, so they
are easy to do by hand.
I then write a custom Perl script that
reads in the downloaded mutation data and the alignment, and does most
of the work to generate the table files. This script is generally
similar from one database to the next, but needs to be adjusted to
reflect different input formats, numbering systems, etc. For example
in the case of PAHdb, the region
field from the input
file is used to determine the location (exon, intron, UTR, etc.), but
this might be different for another database. A number of common
utility scripts that do not change
have been factored out for easy reuse; this simplifies the custom
scripts considerably. In general the custom script will: loop over
each mutation in the input file; parse out the HGVS name and send
it to the parseHgvsName2 script to get the chromosome coordinates,
strand, and mutation type; also send the HGVS name to the
sequenceCheck script to make sure that the wild-type sequence
matches the reference sequence; read and compute links, attributes,
and other fields as necessary; and print output lines for the gv,
gvPos, and other tables.
When the script is done running, the data is ready for verification. The positions are put online at a test location in the form of a custom track that can be loaded into the Genome Browser like this. Then the mapping can be confirmed by zooming in and examining the custom track along with the sequence and genes. After I have finished checking this, I ask the source database curators to look it over as well.