.. _features:

********
Features
********


.. currentmodule:: HTSeq

.. doctest:: 
   :hide:

   >>> import HTSeq

The easiest way to work with annotation is to use :class:`GenomicArray` with ``typecode=='O'``
or :class:`GenomicArrayOfSets`. If you have your annotation in a flat file, with each
line describing a feature and giving its coordinates, you can read in the file line for line,
parse it (see the standard Python module ``csv``), use the information on chromosome, start,
end and strand to create a :class:`GenomicInterval` object and then store the data from the line
in the genomic array at the place indicated by the genomic interval.

For example, if you have data in a tab-separated file as follows:

.. doctest::

   >>> for line in open( "feature_list.txt" ):  #doctest:+NORMALIZE_WHITESPACE
   ...     print line,
   chr2  100	300	+	"gene A"
   chr2	200	400	-	"gene B"
   chr3	150	270	+	"gene C"

Then, you could load this information as follows::

   >>> import csv
   >>> genes = HTSeq.GenomicArray( [ "chr1", "chr2", "chr3" ], typecode='O' )
   >>> for (chrom, start, end, strand, name) in \
   ...        csv.reader( open("feature_list.txt"), delimiter="\t" ):
   ...     iv = HTSeq.GenomicInterval( chrom, int(start), int(end), strand )
   ...     genes[ iv ] = name

Now, to see whether there is a feature at a given :class:`GenomicPosition`, you just query the
genomic array::

   >>> print genes[ HTSeq.GenomicPosition( "chr3", 100, "+" ) ]
   None
   >>> print genes[ HTSeq.GenomicPosition( "chr3", 200, "+" ) ]
   gene C

See :class:`GenomicArray` and :class:`GenomicArrayOfSets` for more sophisticated use.


``GFF_Reader`` and ``GenomicFeature``
=====================================

One of the most common format for annotation data is GFF_ (which includes GTF_ as
a sub-type). Hence, a parse for GFF files is included in HTSeq.

.. _GFF: http://www.sanger.ac.uk/resources/software/gff/spec.html
.. _GTF: http://mblab.wustl.edu/GTF22.html

As usual, there is a parser class, called **GFF_Reader**, that can generate an
iterator of objects describing the features. These objects are of type :class`GenomicFeature`
and each describes one line of a GFF file. See Section :ref:`tour` for an example.

.. class:: GFF_Reader( filename_or_sequence, end_included=True )

   As a subclass of :class:`FileOrSequence`, GFF_Reader can be initialized either
   with a file name or with an open file or another sequence of lines.
   
   When requesting an iterator, it generates objects of type :class:`GenomicFeature`.
   
   The GFF specification is unclear on whether the end coordinate marks the last
   base-pair of the feature (closed intervals, ``end_included=True``) or the one
   after (half-open intervals, ``end_included=False``). The default, True, is
   correct for Ensembl GTF files. If in doubt, look for a CDS or stop_codon
   feature in you GFF file. Its length should be divisible by 3. If "end-start"
   is divisible by 3, you need ``end_included=False``. If "end-start+1" is
   divisible by 3, you need ``end_included=True``. 
   
   GFF_Reader will convert the coordinates from GFF standard (1-based, end
   maybe included) to HTSeq standard (0-base, end not included) by subtracting
   1 from the start position, and, for ``end_included=True``, also subtract 1 from
   the end position.
   
      .. attribute:: GFF_Reader.metadata
      
         GFF_Reader skips all lines starting with a single '#' as this marks
         a comment. However, lines starying with '##' contain meta data (at least
         accoring to the Sanger Institute's version of the GFF standard.) Such meta
         data has the format ``##key value``. When a metadata line is encountered,
         it is added to the ``metadata`` dictionary.
         
  
.. class:: GenomicFeature( name, type_, interval )

   A GenomicFeature object always contains the following attributes:
   
      .. attribute:: GenomicFeature.name
      
         A name of ID for the feature. As the GFF format does not have a dedicated
         field for this, the value of the first attribute in the *attributes* column is
         assumed to be the name of ID.
         
      .. attribute:: GenomicFeature.type
      
         The type of the feature, i.e., a string like ``"exon"`` or ``"gene"``. For GFF
         files, the 3rd column (*feature*) is taken as the type.
         
      .. attribute:: GenomicFeature.interval
      
         The interval that the feature covers on the genome. For GFF files, this information is taken
         from the first (*seqname*), the forth (*start*), the fifth (*end*), and the seventh (*strand*)
         column.
         
   When created by a :class:`GFF_Reader` object, the following attributes are also present, with the information
   from the remaining GFF columns:
   
      .. attribute:: GenomicFeature.source
      
         The 2nd column, denoted *source* in the specification, and intended to specify the
         data source.
     
      .. attribute:: GenomicFeature.frame
      
         The 8th column (*frame*), giving the reading frame in case of a coding feature. Its value
         is an integer (0, 1, or 2), or the string ``'.'`` in case that a frame is not specified or would not make sense.
   
      .. attribute:: GenomicFeature.score
      
         The 6th column (*score*), giving some numerical score for the feature. Its value
         is a float, or ``'.'`` in case that a score is not specified or would not make sense
      
      .. attribute:: GenomicFeature.attr
      
         The last (9th) column of a GFF file contains *attributes*, i.e. a list of name/value pairs.
         These are transformed into a dict, such that, e.g., ``gf.attr['gene_id']`` gives the value
         of the attribute ``gene_id`` in the feature described by ``GenomicFeature`` object ``gf``.
         The parser for the attribute field is reasonably flexible to deal with format variations
         (it was never clearly established whetehr name and value should be sperarated by a colon or an
         equal sign, and whether quotes need to be used) and also does a URL style decoding, as is often
         required.

   In order to write a GFF file from a sequence of features, this method is provided:
   
      .. method:: GenomicFeature.get_gff_line( with_equal_sign=False )
   
         Returns a line to describe the feature in the GFF format. This works even if the optional 
         attributes given above are missing. Call this for each of your ``GenomicFeature`` objects
         and write the lines into a file to get a GFF file.
      
.. function:: parse_GFF_attribute_string( attrStr, extra_return_first_value=False )      

   This is the function that :class:`GFF_Reader` uses to parse the attribute column. (See :attr:`GenomicFeature.attr`.)
   It returns a dict, or, if requested, a pair of the dict and the first value.

