John Pool - Population Genomics Lab

If you'd like to see population genetic statistics from DGN alignments in a
genome browser format, or extract subsets of the data, check out PopFly!

RAW DATA CORRECTION (JAN. 2025): The raw read data on the SRA for the EA/EF/EG/FR/SD/SP population genomes added in Lack et al. 2016 was recently found to only contain the aligned reads from DGN assemblies. These data have now been replaced on the SRA with complete read data, as reflected in this bioproject and in this table. Thanks to Jian Lu and Chenlu Liu for noticing an issue with the previously available data.

Drosophila Genome Nexus, Version 1.1
Reference alignments by Justin Lack, Jeremy Lange, and John Pool,
with help from Alison Tang and Russ Corbett-Detig.

3rd Nov. 2016: A further updated inversion spreadsheet is available here. Russ Corbett-Detig added inversion genotype calls in heterozygous regions for the POOL genome set, along with a few other things.

12th July 2016: DGN 1.1 adds 499 more genomes to the previous data object. Other genomes are unchanged.

The lists of data files below now inlclude the following data groups:
BERGMAN - genomes published by Bergman & Haddrill 2015
CLARK - the Global Diversity Lines published in Grenier et al. 2014
NUZHDIN - genomes from both Campo et al. 2013 and Kao et al. 2015
POOL - 306 new genomes from France, Egypt, Ethiopia, and South Africa populations.
SIMULANS - D. simulans 2 aligned to D. melanogaster 5, from Stanley & Kulathinal 2016

The updated resource is further documented in THIS MANUSCRIPT.

** To receive important updates and bug reports for these genomes, JOIN THE E-MAIL LIST

_____________________________________________________________________________

Previous Release: Drosophila Genome Nexus, Version 1.0, 16th June 2014
Reference alignments by Justin Lack and John Pool,
with help from Russ Corbett-Detig, J.J. Emerson, and Chuck Langley.

Update 9SEP2015: VCF SITES files replaced to correct a minor site numbering bug.

Update 18MAY2015: ZI388 X chromosome added to DPGP3 sequence text data.
It had been incorrectly masked before. The individual spreadsheet has been updated to reflect that only ZI382 should be missing an X due to heterozygosity (it had already been properly masked from all data files). Note that the ZI388 X chromosome has no detected admixture. A direct link to the updated ZI388 X sequence file is here. The DPGP3 sequence file package is also updated.

UPDATES 28JAN2015:
* DGN/DPGP3 manuscript accepted by Genetics ("Early Online" now).
* DGN pipeline scripts, etc. are available through a GitHub page.
* Parental strain genomes from the DSPR have been added to the DGN alignments.

NOTE: This data release occurred before release 6 of the Drosophila melangoaster reference. All coordinates are with regard to release 5.

Over the past few years, a large number of Drosophila melanogaster genomes have been sequenced by different labs at different times. When aligning our most recent batch of African D. melanogaster genomes, we wanted to improve at least marginally over previous alignment pipelines, but we also wanted to compare new and old genomes without bias stemming from differing alignment methods. We therefore decided to realign published population genomic data sets using the same pipeline that we chose for our newer data.

Researchers must carefully evaluate the suitability of this data set for their analyses:
* The alignments presented here are mainly aimed toward SNP-oriented population genetic analyses. The pipeline also identifies short indels, but the precision of these variant calls has not been evaluated. Larger indels, inversions, and other structural variants are not addressed.
* Consistent alignments make these genomes more comparable, but important differences still exist among them. Sequencing depth varies considerably among genomes (see INDIVIDUAL SPREADSHEET). Lower depth causes little or no reference sequence bias with our pipeline, but it substantially impacts genomic coverage (higher depth allows more sites to be called in difficult-to-align regions, which may differ in their diversity characteristics).
* With any reference alignment, the success of read mapping may vary due to some genomes being more closely related to the reference strain than others (though our pipeline should reduce this factor). Other biases may exist in sequencing (e.g. relative to GC content) or in alignment (e.g. due to cryptic structural variation).
* Demographic factors, such as recent population admixture or close relatedness between genomes, may bias population genetic inference as well (discussed below). It will be critical for all users of these genomes to have a basic understanding of the species' history and the resulting ways in which populations differ in their diversity levels and allele frequencies.

Our pipeline consists of the following steps (independently for each genome):

Mapping round 1:
* Use BWA followed by Stampy to align reads to the D. melanogaster reference genome.
* Use GATK indel realigner to refine short indel alignments.
* Call round 1 SNPs and indels for this genome.

Reference modification:
* Modify the reference sequence for this genome based on all SNPs and (short) indels called in round 1.

Mapping round 2:
* Use BWA followed by Stampy to align reads to the modified reference sequence.
* Use GATK indel realigner to refine short indel alignments.
* Call round 2 SNPs and indels for this genome.

Consensus sequence generation:
* Generate a reference-numbered consensus sequence by reversing the indel modifications made to the tailored reference sequence between mapping rounds 1 and 2.
* Filter all sites (to N) that are within 3 bp of an indel called in round 1, or that fail to meet alignment base quality thresholds.
* Identify heterozygous genomic regions and filter these entire intervals to N.

A more detailed description of the pipeline is available here: METHODS TEXT.

The D. melanogaster genomes aligned here are from the following sources:
Drosophila Population Genomics Project (DPGP)
   Phase 1 - Malawi extraction lines only (Langley et al. 2012 Genetics)
   Phase 2 - Most from a variety of African populations (Pool et al. 2012 PLoS Genetics)
   Phase 3 - Nearly 200 genomes from a single high diversity and low admixture Zambia
                    population (unpublished, sequenced by the Langley lab)
Drosophila Genetic Reference Panel (DGRP) - ~200 genomes from a North American
   sample (Mackay et al. 2012 Nature)
Drosophila Synthetic Population Resource (DSPR) - parental strain genomes
Additional African genomes sequenced by the Pool lab

Information regarding the populations of origin is here: POPULATION SPREADSHEET.

A summary of the individual genomes is here: INDIVIDUAL SPREADSHEET. Included are depth, coverage, read characteristics, and chromosomal representation. Each genome's origin from inbred, isofemale, or homozygous chromosome line adults, or else a haploid embryo (Langley et al. 2011 Genetics), is also noted.

Citation: For the DGN alignments, please cite Lack et al. 2015 Genetics. Users of these genomes should also always cite the original publications that described them.

The data files we distribute for each of a genome’s chromosome arms are as follows:

* Consensus sequence files (essentially FastA files but lacking a header line). Each file contains the full sequence for one chromosome arm from one genome. The sequence is all on one line, but we provide a script to add line breaks every 1000 bp to all these files, which can be convenient for downstream analysis. These files follow standard reference base numbering (Flybase release 5 of the D. melanogaster genome), and contain only A, C, G, T, and N. Insertions (relative to the reference) have been clipped out of these files, while deletions have been coded as N. Regions of apparent heterozygosity (due to true heterozygosity, structural variation, or technical issues) have been masked to N. These are the files that most researchers are likely to analyze.

* Indel VCF files. These variant call files summarize the short insertions and deletions called for this genome relative to the reference sequence. The larger indel VCF comes from mapping round 1, so all positions listed correspond to the numbering of the reference genome. We also include the smaller indel VCFs from round 2, with indel start positions adjusted to match the base numbering of the reference genome. We make no effort to resolve any incongruities that may exist between round 1 and round 2 indel calls. Many users may wish to download these files, if only to check for relevant indels in specific genomic regions of interest.

* “All sites” VCF files. These variant call files contain quality and read count information for each reference base, allowing the evidence behind each SNP call to be evaluated. These files are from mapping round 2, but the site positions have been reverted to match reference sequence numbering. These files can be imported into programs such as GATK to create custom consensus sequences with different quality or depth thresholds. They are large files (>500 Mb per genome), and most users may not need to download them.

To avoid distributing multiple versions of the consensus sequences, the files linked here are not yet filtered for identical-by-descent (IBD) regions, or for ancestry/admixture. Applying these filters will often be quite important, so we provide files with masking intervals and a script which allows masking of IBD and/or admixture from a single command line. IBD masking aims to remove the effects of close relatedness among individuals (visible as long identical chromosome tracts between individuals within a population), since most population genetic analyses assume the sampling of unrelated individuals.

For sub-Saharan genomes, we provide the option to mask genomic segments inferred to have recent cosmopolitan (non-African) ancestry. Masking cosmopolitan ancestry allows the user to analyze “African” genetic variation specifically, bringing the data closer to population genetic assumptions regarding well-mixed populations. The admixture detection method and general patterns of cosmpolitan admixture in Africa are described in Pool et al. 2012 PLoS Genetics.

Significant admixture from Africa is found in North American populations such as the DGRP sample (Duchen et al. 2013 Genetics), and also on many inverted chromosome arms from non-African populations (Pool et al. 2012 PLoS Genetics, Corbett-Detig and Hartl 2012 PLoS Genetics). Because the latter studies found that inversions can affect genetic variation over whole chromosome arms, users may wish to exclude inverted arms from some analyses. A list of detected inversions for all genomes is posted here: INVERSION SPREADSHEET
Our scripts do not specifically enable the masking of inversions or other African admixture in cosmopolitan populations. However, tenative ancestry inferences along chromosomes are provided for DGRP and other non-African genomes as part of the
(below).

A package containing IBD and admixture masking intervals, plus a script to mask these regions from downloaded consensus sequences, is here: MASKING PACKAGE (a recommended download for all users). Note that these scripts must be run on Linux or Mac computers (on Linux, disregard error messages that mention 'purge'). Specifically, users should moved the unpackaged masking files into the same directory as their unpackaged consensus sequence files, and then type the commands:
perl ibd_mask_seq.pl
perl admixture_mask_seq.pl
When running either script, unmasked versions of the chromosome arms that have masked intervals are copied into a new subdirectory. If run in the order shown above, there will be an "ibd_unmasked" folder with some unmasked sequence files, and and "admixture_unmasked" folder containing sequence files masked for IBD (if present) but not admixture.

The masking package includes two other scripts, in case they help with downstream analyses:

breaker.pl inserts line breaks every 1000 bp in all files, which may be helpful for downstream analyses (scripts can then be written to read 1kb at a time from each analyzed genome, which minimizes RAM usage). Don't run this script until IBD and admixture masking is complete.

dataslice.pl returns locus-specific FastA files when given a subset of individuals and locus information. Users set the chromosome arm, and start and stop position near the top of the file, along with specifying whether to reverse-complement the sequences, and whether to have 1 FastA output file or separate files for each individual. This script allows users to create small files to analyze with programs such as MEGA, DnaSP, and GenePalette. Don't run this script until line breaks have been added by breaker.pl.

Below we link to packages of data files, which are arranged separately by chromosome arm.

Note that our own analysis has focused on the five nuclear, euchromatic chromosome arms (X, 2L, 2R, 3L, and 3R). For other arms, only VCF files are provided; these difficult-to-align regions of the genome will require particular scrutiny prior to any analysis.

Data are also grouped into:
“DPGP2” (also includes Malawi extraction lines from DPGP1, but no Zambia ZI genomes)
“DPGP3” (all Zambia ZI genomes, including the 4 originally published with DPGP2)
“DGRP”
“DSPR”
“AGES” (unpublished African genomes from the Pool Lab)

File packages are further distinguished by “seq” (consensus sequences), “indel” (indel VCFs), and “sites” (all sites VCFs). We encourage most users to focus on the SEQ files, while using indel VCFs as a reference for short insertions and deletions. File links, sizes, and MD5 checksums are given below.

DPGP2 SEQ   (3.7 Gb, MD5 = d1effb9920a96d936a816b37f92aaeac)
DPGP3 SEQ   (5.6 Gb, MD5 = 906d282740a56e5273a4cbc5abfb61f9)
DGRP SEQ   (4.9 Gb, MD5 = c697730b0720e944ab2be32e391322b0)
DSPR SEQ   (0.4 Gb, MD5 = ce177646b2b840494b7676929fb8cdc2)
AGES SEQ   (1.4 Gb, MD5 = df832c834acb3591e084ce601cc415ac)
CLARK SEQ   (1.8 Gb, MD5 = 9e427d6e6b0b641084d0fc83c8fe10bf)
NUZHDIN SEQ   (1.1 Gb, MD5 = ca78292ab44de4aabd4784c2722dd73b)
POOL SEQ   (4.6 Gb, MD5 = 34f25cf0fc5a9f055cf16817fc59aaac)
SIMULANS SEQ   (30 Mb, MD5 = bb543088056edec3beedf00ed7cc4c63)

DPGP2 INDEL   (424 Mb, MD5 = 8f2b0912f81db0c34a1bb4321a653c56)
DPGP3 INDEL   (690 Mb, MD5 = 888fd43269636ac4b576dff9628d3edc)
DGRP INDEL   (398 Mb, MD5 = 20169ebc06c2e863c06e1f1400606823)
DSPR INDEL (45 Mb, MD5 = 26e23766061b860fc9abb47c4b6b1650)
AGES INDEL   (146 Mb, MD5 = 29253dab7a9144206a34238498eef3e6)
BERGMAN INDEL   (195 Mb, MD5 = 156a62d92cad1cf8a24385e00a996ae7)
CLARK INDEL   (347 Mb, MD5 = 5232f2e961c65380e10d2ffb1df4e4b3)
NUZHDIN INDEL (153 Mb, MD5 = d88773c4b128439dd5f89c33c6cfd460)
POOL INDEL   (1.2 Gb, MD5 = 9d3a006864be1d2598b891e5167f1452)

WARNING: the VCF "SITES" files are NOT the best data object for most users. Using these files may require you to re-implement the filters we included in the SEQ files (around indels, heterozygosity, etc.). Also note that the reference allele differs from one genome to the next, because these files are generated based on modified reference sequences in the second round of mapping (which makes merging these VCFs generally inadvisable). As downloaded, these VCFs are in a reduced-columns format (described here). If you need to re-insert columns to upload individual VCFs into common software (and you don't mind doubling the file size), you can download this package (and run the script from the same directory as the accompanying reference index and your VCFs). Again, we emphasize that these files are for advanced users who are comfortable implementing their own filtering, not for typical users interested in SNP variation.

DPGP2 SITES   (92 Gb, MD5 = 4444e44552814c86365c592f55835351)
DPGP3 SITES   (124 Gb, MD5 = b5602eb67f031c9670267530fdad214d)
DGRP SITES   (105 Gb, MD5 = 6b4923842d20517007600eeb42ab69d4)
DSPR SITES (13 Gb, MD5 = daf9e7c227104f1944731cbdc36b6265)
AGES SITES   (30 Gb, MD5 = d029f975adafb12a41ffc1b952bad2c5)
BERGMAN SITES   (50 Gb, MD5 = baa8bb671c3a905dac5804d6a3e3f5f4)
CLARK SITES   (57 Gb, MD5 = 974bcb8c2cce1dda01ba2eb0c4bfde5b)
NUZHDIN SITES (32 Gb, MD5 = f8c953da1ded11d9709beb41ce7e6008)
POOL SITES   (200 Gb, MD5 = 89088cd37217c72194fb667c990bf86f)

If you want to import the indel VCFs into a program like IGV, you'll need to add a header such as:

##fileformat=VCFv4.1
##source=whatever
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    GENOTYPE

However, programs differ in their specific header requirement (see documentation). And make sure the chromosome arm numbers match the reference you supply. The chromosome arm numbers in our VCFs are as follows:

1 Yhet
2 mtDNA
3 2L
4 X
5 3L
6 4
7 2R
8 3R
9 Uextra
10 2RHet
11 2LHet
12 3LHet
13 3RHet
14 U
15 XHet
16 Wolbachia

** To receive important updates and bug reports for these genomes, JOIN THE E-MAIL LIST