Overview
GFF (General Feature Format) is a standard tab-delimited text file format for describing genomic features. It is widely used in bioinformatics and genome annotation to represent genes, transcripts, exons, regulatory elements, and other sequence features. The format has evolved through several versions, with GFF3 being the current standard maintained by the Sequence Ontology consortium.
- GFF1: Original format (deprecated)
- GFF2/GTF: Gene Transfer Format, widely used for gene annotations
- GFF3: Current standard with improved structure and semantics
GFF3 Structure
Each line in a GFF3 file contains 9 tab-separated fields:
- seqid: Chromosome or scaffold identifier
- source: Algorithm or procedure that generated the feature
- type: Feature type (gene, mRNA, exon, CDS, etc.)
- start: Start position (1-based, inclusive)
- end: End position (1-based, inclusive)
- score: Numeric value (or . if not applicable)
- strand: + (forward), - (reverse), or . (not stranded)
- phase: Reading frame for CDS features (0, 1, 2, or .)
- attributes: Semicolon-separated key=value pairs
Key Features
- Hierarchical Relationships: Parent-child relationships via ID and Parent attributes
- Controlled Vocabulary: Uses Sequence Ontology terms for feature types
- Flexible Attributes: Extensible key-value pairs for metadata
- Multi-line Features: Support for features with discontinuous coordinates
- Comments and Directives: Special lines for metadata and file structure
Common Feature Types
Gene Structure
- gene: Gene loci
- mRNA: Messenger RNA transcripts
- exon: Exonic regions
- CDS: Coding sequence
- UTR: Untranslated regions (five_prime_UTR, three_prime_UTR)
- intron: Intronic regions
Non-coding Features
- ncRNA: Non-coding RNA genes
- tRNA: Transfer RNA
- rRNA: Ribosomal RNA
- miRNA: MicroRNA
- lncRNA: Long non-coding RNA
Regulatory Elements
- promoter: Promoter regions
- enhancer: Enhancer elements
- TF_binding_site: Transcription factor binding sites
Applications
Genome Annotation
- Representing gene models from annotation pipelines
- Storing results from gene prediction algorithms
- Documenting manual curation efforts
- Tracking annotation versions and provenance
Comparative Genomics
- Comparing gene structures across species
- Identifying orthologous features
- Analyzing synteny and gene order
Functional Genomics
- Mapping RNA-seq reads to genomic features
- ChIP-seq peak annotation
- Variant effect prediction (relative to annotated features)
- Expression quantification by feature
Visualization
- Displaying features in genome browsers (IGV, UCSC, JBrowse)
- Creating publication-quality genomic figures
- Interactive exploration of annotations
Parsing and Manipulation
- BEDOPS: Convert and manipulate genomic interval formats
- GenomeTools: GFF3 validation and manipulation
- gffutils: Python library for GFF/GTF files
- rtracklayer: R/Bioconductor package for genomic annotations
Validation
- GenomeTools gff3validator: Validates GFF3 syntax and semantics
- Sequence Ontology tools: Check feature type compliance
Conversion
- GTF ↔ GFF3 conversion
- BED format conversion
- GenBank/EMBL format conversion
Standards and Specifications
- Maintained by: Sequence Ontology Consortium
- Specification: http://www.sequenceontology.org/gff3.shtml
- Feature Types: Based on Sequence Ontology (SO) terms
- File Extension: .gff, .gff3
Best Practices
- Use Sequence Ontology terms for feature types
- Include proper Parent-child relationships for hierarchical features
- Provide unique IDs for all features
- Sort by seqid and start position
- Validate files before distribution
- Document the genome assembly version
- Include ##gff-version pragma
Integration
GFF files are integrated into:
- Genome browsers (UCSC, Ensembl, NCBI)
- Annotation databases (RefSeq, Gencode, FlyBase)
- Analysis pipelines (variant annotation, RNA-seq)
- Knowledge graphs (linking genomic features to biological entities)
For more information and specifications, visit http://www.sequenceontology.org/gff3.shtml