vcftools

vcftools(man)                    2 August 2018                   vcftools(man)



NAME
       vcftools v0.1.16 - Utilities for the variant call format (VCF) and
       binary variant call format (BCF)

SYNOPSIS
       vcftools [ --vcf FILE | --gzvcf FILE | --bcf FILE] [ --out OUTPUT
       PREFIX ] [ FILTERING OPTIONS ]  [ OUTPUT OPTIONS ]

DESCRIPTION
       vcftools is a suite of functions for use on genetic variation data in
       the form of VCF and BCF files. The tools provided will be used mainly
       to summarize data, run calculations on data, filter out data, and
       convert data into other useful file formats.

EXAMPLES
       Output allele frequency for all sites in the input vcf file from
       chromosome 1
         vcftools --gzvcf input_file.vcf.gz --freq --chr 1 --out chr1_analysis

       Output a new vcf file from the input vcf file that removes any indel
       sites
         vcftools --vcf input_file.vcf --remove-indels --recode --recode-INFO-
         all --out SNPs_only

       Output file comparing the sites in two vcf files
         vcftools --gzvcf input_file1.vcf.gz --gzdiff input_file2.vcf.gz
         --diff-site --out in1_v_in2

       Output a new vcf file to standard out without any sites that have a
       filter tag, then compress it with gzip
         vcftools --gzvcf input_file.vcf.gz --remove-filtered-all --recode
         --stdout | gzip -c > output_PASS_only.vcf.gz

       Output a Hardy-Weinberg p-value for every site in the bcf file that
       does not have any missing genotypes
         vcftools --bcf input_file.bcf --hardy --max-missing 1.0 --out
         output_noMissing

       Output nucleotide diversity at a list of positions
         zcat input_file.vcf.gz | vcftools --vcf - --site-pi --positions
         SNP_list.txt --out nucleotide_diversity

BASIC OPTIONS
       These options are used to specify the input and output files.

   INPUT FILE OPTIONS
         --vcf <input_filename>
           This option defines the VCF file to be processed. VCFtools expects
           files in VCF format v4.0, v4.1 or v4.2. The latter two are
           supported with some small limitations. If the user provides a dash
           character '-' as a file name, the program expects a VCF file to be
           piped in through standard in.

         --gzvcf <input_filename>
           This option can be used in place of the --vcf option to read
           compressed (gzipped) VCF files directly.

         --bcf <input_filename>
           This option can be used in place of the --vcf option to read BCF2
           files directly. You do not need to specify if this file is
           compressed with BGZF encoding. If the user provides a dash
           character '-' as a file name, the program expects a BCF2 file to be
           piped in through standard in.

   OUTPUT FILE OPTIONS
         --out <output_prefix>
           This option defines the output filename prefix for all files
           generated by vcftools. For example, if <prefix> is set to
           output_filename, then all output files will be of the form
           output_filename.*** . If this option is omitted, all output files
           will have the prefix "out." in the current working directory.

         --stdout
         -c
           These options direct the vcftools output to standard out so it can
           be piped into another program or written directly to a filename of
           choice. However, a select few output functions cannot be written to
           standard out.

         --temp <temporary_directory>
           This option can be used to redirect any temporary files that
           vcftools creates into a specified directory.

SITE FILTERING OPTIONS
       These options are used to include or exclude certain sites from any
       analysis being performed by the program.

   POSITION FILTERING
         --chr <chromosome>
         --not-chr <chromosome>
           Includes or excludes sites with indentifiers matching <chromosome>.
           These options may be used multiple times to include or exclude more
           than one chromosome.

         --from-bp <integer>
         --to-bp <integer>
           These options specify a lower bound and upper bound for a range of
           sites to be processed. Sites with positions less than or greater
           than these values will be excluded. These options can only be used
           in conjunction with a single usage of --chr. Using one of these
           does not require use of the other.

         --positions <filename>
         --exclude-positions <filename>
           Include or exclude a set of sites on the basis of a list of
           positions in a file. Each line of the input file should contain a
           (tab-separated) chromosome and position. The file can have comment
           lines that start with a "#", they will be ignored.

         --positions-overlap <filename>
         --exclude-positions-overlap <filename>
           Include or exclude a set of sites on the basis of the reference
           allele overlapping with a list of positions in a file. Each line of
           the input file should contain a (tab-separated) chromosome and
           position. The file can have comment lines that start with a "#",
           they will be ignored.

         --bed <filename>
         --exclude-bed <filename>
           Include or exclude a set of sites on the basis of a BED file. Only
           the first three columns (chrom, chromStart and chromEnd) are
           required. The BED file is expected to have a header line. A site
           will be kept or excluded if any part of any allele (REF or ALT) at
           a site is within the range of one of the BED entries.

         --thin <integer>
           Thin sites so that no two sites are within the specified distance
           from one another.

         --mask <filename>
         --invert-mask <filename>
         --mask-min <integer>
           These options are used to specify a FASTA-like mask file to filter
           with. The mask file contains a sequence of integer digits (between
           0 and 9) for each position on a chromosome that specify if a site
           at that position should be filtered or not.
           An example mask file would look like:
             >1
             0000011111222...
             >2
             2222211111000...
           In this example, sites in the VCF file located within the first 5
           bases of the start of chromosome 1 would be kept, whereas sites at
           position 6 onwards would be filtered out. And sites after the 11th
           position on chromosome 2 would be filtered out as well.
           The "--invert-mask" option takes the same format mask file as the
           "--mask" option, however it inverts the mask file before filtering
           with it.
           And the "--mask-min" option specifies a threshold mask value
           between 0 and 9 to filter positions by. The default threshold is 0,
           meaning only sites with that value or lower will be kept.

   SITE ID FILTERING
         --snp <string>
           Include SNP(s) with matching ID (e.g. a dbSNP rsID). This command
           can be used multiple times in order to include more than one SNP.

         --snps <filename>
         --exclude <filename>
           Include or exclude a list of SNPs given in a file. The file should
           contain a list of SNP IDs (e.g. dbSNP rsIDs), with one ID per line.
           No header line is expected.

   VARIANT TYPE FILTERING
         --keep-only-indels
         --remove-indels
           Include or exclude sites that contain an indel. For these options
           "indel" means any variant that alters the length of the REF allele.

   FILTER FLAG FILTERING
         --remove-filtered-all
           Removes all sites with a FILTER flag other than PASS.

         --keep-filtered <string>
         --remove-filtered <string>
           Includes or excludes all sites marked with a specific FILTER flag.
           These options may be used more than once to specify multiple FILTER
           flags.

   INFO FIELD FILTERING
         --keep-INFO <string>
         --remove-INFO <string>
           Includes or excludes all sites with a specific INFO flag. These
           options only filter on the presence of the flag and not its value.
           These options can be used multiple times to specify multiple INFO
           flags.

   ALLELE FILTERING
         --maf <float>
         --max-maf <float>
           Include only sites with a Minor Allele Frequency greater than or
           equal to the "--maf" value and less than or equal to the "--max-
           maf" value. One of these options may be used without the other.
           Allele frequency is defined as the number of times an allele
           appears over all individuals at that site, divided by the total
           number of non-missing alleles at that site.

         --non-ref-af <float>
         --max-non-ref-af <float>
         --non-ref-ac <integer>
         --max-non-ref-ac <integer>

         --non-ref-af-any <float>
         --max-non-ref-af-any <float>
         --non-ref-ac-any <integer>
         --max-non-ref-ac-any <integer>
           Include only sites with all Non-Reference (ALT) Allele Frequencies
           (af) or Counts (ac) within the range specified, and including the
           specified value. The default options require all alleles to meet
           the specified criteria, whereas the options appended with "any"
           require only one allele to meet the criteria. The Allele frequency
           is defined as the number of times an allele appears over all
           individuals at that site, divided by the total number of non-
           missing alleles at that site.

         --mac <integer>
         --max-mac <integer>
           Include only sites with Minor Allele Count greater than or equal to
           the "--mac" value and less than or equal to the "--max-mac" value.
           One of these options may be used without the other. Allele count is
           simply the number of times that allele appears over all individuals
           at that site.

         --min-alleles <integer>
         --max-alleles <integer>
           Include only sites with a number of alleles greater than or equal
           to the "--min-alleles" value and less than or equal to the "--max-
           alleles" value. One of these options may be used without the other.
           For example, to include only bi-allelic sites, one could use:
             vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2

   GENOTYPE VALUE FILTERING
         --min-meanDP <float>
         --max-meanDP <float>
           Includes only sites with mean depth values (over all included
           individuals) greater than or equal to the "--min-meanDP" value and
           less than or equal to the "--max-meanDP" value. One of these
           options may be used without the other. These options require that
           the "DP" FORMAT tag is included for each site.

         --hwe <float>
           Assesses sites for Hardy-Weinberg Equilibrium using an exact test,
           as defined by Wigginton, Cutler and Abecasis (2005). Sites with a
           p-value below the threshold defined by this option are taken to be
           out of HWE, and therefore excluded.

         --max-missing <float>
           Exclude sites on the basis of the proportion of missing data
           (defined to be between 0 and 1, where 0 allows sites that are
           completely missing and 1 indicates no missing data allowed).

         --max-missing-count <integer>
           Exclude sites with more than this number of missing genotypes over
           all individuals.

         --phased
           Excludes all sites that contain unphased genotypes.

   MISCELLANEOUS FILTERING
         --minQ <float>
           Includes only sites with Quality value above this threshold.

INDIVIDUAL FILTERING OPTIONS
       These options are used to include or exclude certain individuals from
       any analysis being performed by the program.
         --indv <string>
         --remove-indv <string>
           Specify an individual to be kept or removed from the analysis. This
           option can be used multiple times to specify multiple individuals.
           If both options are specified, then the "--indv" option is executed
           before the "--remove-indv option".

         --keep <filename>
         --remove <filename>
           Provide files containing a list of individuals to either include or
           exclude in subsequent analysis. Each individual ID (as defined in
           the VCF headerline) should be included on a separate line. If both
           options are used, then the "--keep" option is executed before the
           "--remove" option. When multiple files are provided, the union of
           individuals from all keep files subtracted by the union of
           individuals from all remove files are kept. No header line is
           expected.

         --max-indv <integer>
           Randomly thins individuals so that only the specified number are
           retained.

GENOTYPE FILTERING OPTIONS
       These options are used to exclude genotypes from any analysis being
       performed by the program. If excluded, these values will be treated as
       missing.
         --remove-filtered-geno-all
           Excludes all genotypes with a FILTER flag not equal to "." (a
           missing value) or PASS.

         --remove-filtered-geno <string>
           Excludes genotypes with a specific FILTER flag.

         --minGQ <float>
           Exclude all genotypes with a quality below the threshold specified.
           This option requires that the "GQ" FORMAT tag is specified for all
           sites.

         --minDP <float>
         --maxDP <float>
           Includes only genotypes greater than or equal to the "--minDP"
           value and less than or equal to the "--maxDP" value. This option
           requires that the "DP" FORMAT tag is specified for all sites.

OUTPUT OPTIONS
       These options specify which analyses or conversions to perform on the
       data that passed through all specified filters.

   OUTPUT ALLELE STATISTICS
         --freq
         --freq2
           Outputs the allele frequency for each site in a file with the
           suffix ".frq". The second option is used to suppress output of any
           information about the alleles.

         --counts
         --counts2
           Outputs the raw allele counts for each site in a file with the
           suffix ".frq.count". The second option is used to suppress output
           of any information about the alleles.

         --derived
           For use with the previous four frequency and count options only.
           Re-orders the output file columns so that the ancestral allele
           appears first. This option relies on the ancestral allele being
           specified in the VCF file using the AA tag in the INFO field.

   OUTPUT DEPTH STATISTICS
         --depth
           Generates a file containing the mean depth per individual. This
           file has the suffix ".idepth".

         --site-depth
           Generates a file containing the depth per site summed across all
           individuals. This output file has the suffix ".ldepth".

         --site-mean-depth
           Generates a file containing the mean depth per site averaged across
           all individuals. This output file has the suffix ".ldepth.mean".

         --geno-depth
           Generates a (possibly very large) file containing the depth for
           each genotype in the VCF file. Missing entries are given the value
           -1. The file has the suffix ".gdepth".

   OUTPUT LD STATISTICS
         --hap-r2
           Outputs a file reporting the r2, D, and D' statistics using phased
           haplotypes. These are the traditional measures of LD often reported
           in the population genetics literature. The output file has the
           suffix ".hap.ld". This option assumes that the VCF input file has
           phased haplotypes.

         --geno-r2
           Calculates the squared correlation coefficient between genotypes
           encoded as 0, 1 and 2 to represent the number of non-reference
           alleles in each individual. This is the same as the LD measure
           reported by PLINK. The D and D' statistics are only available for
           phased genotypes. The output file has the suffix ".geno.ld".

         --geno-chisq
           If your data contains sites with more than two alleles, then this
           option can be used to test for genotype independence via the chi-
           squared statistic. The output file has the suffix ".geno.chisq".

         --hap-r2-positions <positions list file>
         --geno-r2-positions <positions list file>
           Outputs a file reporting the r2 statistics of the sites contained
           in the provided file verses all other sites. The output files have
           the suffix ".list.hap.ld" or ".list.geno.ld", depending on which
           option is used.

         --ld-window <integer>
           This optional parameter defines the maximum number of SNPs between
           the SNPs being tested for LD in the "--hap-r2", "--geno-r2", and
           "--geno-chisq" functions.

         --ld-window-bp <integer>
           This optional parameter defines the maximum number of physical
           bases between the SNPs being tested for LD in the "--hap-r2",
           "--geno-r2", and "--geno-chisq" functions.

         --ld-window-min <integer>
           This optional parameter defines the minimum number of SNPs between
           the SNPs being tested for LD in the "--hap-r2", "--geno-r2", and
           "--geno-chisq" functions.

         --ld-window-bp-min <integer>
           This optional parameter defines the minimum number of physical
           bases between the SNPs being tested for LD in the "--hap-r2",
           "--geno-r2", and "--geno-chisq" functions.

         --min-r2 <float>
           This optional parameter sets a minimum value for r2, below which
           the LD statistic is not reported by the "--hap-r2", "--geno-r2",
           and "--geno-chisq" functions.

         --interchrom-hap-r2
         --interchrom-geno-r2
           Outputs a file reporting the r2 statistics for sites on different
           chromosomes. The output files have the suffix ".interchrom.hap.ld"
           or ".interchrom.geno.ld", depending on the option used.

   OUTPUT TRANSITION/TRANSVERSION STATISTICS
         --TsTv <integer>
           Calculates the Transition / Transversion ratio in bins of size
           defined by this option. Only uses bi-allelic SNPs. The resulting
           output file has the suffix ".TsTv".

         --TsTv-summary
           Calculates a simple summary of all Transitions and Transversions.
           The output file has the suffix ".TsTv.summary".

         --TsTv-by-count
           Calculates the Transition / Transversion ratio as a function of
           alternative allele count. Only uses bi-allelic SNPs. The resulting
           output file has the suffix ".TsTv.count".

         --TsTv-by-qual
           Calculates the Transition / Transversion ratio as a function of SNP
           quality threshold. Only uses bi-allelic SNPs. The resulting output
           file has the suffix ".TsTv.qual".

         --FILTER-summary
           Generates a summary of the number of SNPs and Ts/Tv ratio for each
           FILTER category. The output file has the suffix ".FILTER.summary".

   OUTPUT NUCLEOTIDE DIVERGENCE STATISTICS
         --site-pi
           Measures nucleotide divergency on a per-site basis. The output file
           has the suffix ".sites.pi".

         --window-pi <integer>
         --window-pi-step <integer>
           Measures the nucleotide diversity in windows, with the number
           provided as the window size. The output file has the suffix
           ".windowed.pi". The latter is an optional argument used to specify
           the step size in between windows.

   OUTPUT FST STATISTICS
         --weir-fst-pop <filename>
           This option is used to calculate an Fst estimate from Weir and
           Cockerham's 1984 paper. This is the preferred calculation of Fst.
           The provided file must contain a list of individuals (one
           individual per line) from the VCF file that correspond to one
           population. This option can be used multiple times to calculate Fst
           for more than two populations. These files will also be included as
           "--keep" options. By default, calculations are done on a per-site
           basis. The output file has the suffix ".weir.fst".

         --fst-window-size <integer>
         --fst-window-step <integer>
           These options can be used with "--weir-fst-pop" to do the Fst
           calculations on a windowed basis instead of a per-site basis. These
           arguments specify the desired window size and the desired step size
           between windows.

   OUTPUT OTHER STATISTICS
         --het
           Calculates a measure of heterozygosity on a per-individual basis.
           Specfically, the inbreeding coefficient, F, is estimated for each
           individual using a method of moments. The resulting file has the
           suffix ".het".

         --hardy
           Reports a p-value for each site from a Hardy-Weinberg Equilibrium
           test (as defined by Wigginton, Cutler and Abecasis (2005)). The
           resulting file (with suffix ".hwe") also contains the Observed
           numbers of Homozygotes and Heterozygotes and the corresponding
           Expected numbers under HWE.

         --TajimaD <integer>
           Outputs Tajima's D statistic in bins with size of the specified
           number. The output file has the suffix ".Tajima.D".

         --indv-freq-burden
           This option calculates the number of variants within each
           individual of a specific frequency. The resulting file has the
           suffix ".ifreqburden".

         --LROH
           This option will identify and output Long Runs of Homozygosity. The
           output file has the suffix ".LROH". This function is experimental,
           and will use a lot of memory if applied to large datasets.

         --relatedness
           This option is used to calculate and output a relatedness statistic
           based on the method of Yang et al, Nature Genetics 2010
           (doi:10.1038/ng.608). Specifically, calculate the unadjusted Ajk
           statistic. Expectation of Ajk is zero for individuals within a
           populations, and one for an individual with themselves. The output
           file has the suffix ".relatedness".

         --relatedness2
           This option is used to calculate and output a relatedness statistic
           based on the method of Manichaikul et al., BIOINFORMATICS 2010
           (doi:10.1093/bioinformatics/btq559). The output file has the suffix
           ".relatedness2".

         --site-quality
           Generates a file containing the per-site SNP quality, as found in
           the QUAL column of the VCF file. This file has the suffix ".lqual".

         --missing-indv
           Generates a file reporting the missingness on a per-individual
           basis. The file has the suffix ".imiss".

         --missing-site
           Generates a file reporting the missingness on a per-site basis. The
           file has the suffix ".lmiss".

         --SNPdensity <integer>
           Calculates the number and density of SNPs in bins of size defined
           by this option. The resulting output file has the suffix ".snpden".

         --kept-sites
           Creates a file listing all sites that have been kept after
           filtering. The file has the suffix ".kept.sites".

         --removed-sites
           Creates a file listing all sites that have been removed after
           filtering. The file has the suffix ".removed.sites".

         --singletons
           This option will generate a file detailing the location of
           singletons, and the individual they occur in. The file reports both
           true singletons, and private doubletons (i.e. SNPs where the minor
           allele only occurs in a single individual and that individual is
           homozygotic for that allele). The output file has the suffix
           ".singletons".

         --hist-indel-len
           This option will generate a histogram file of the length of all
           indels (including SNPs). It shows both the count and the percentage
           of all indels for indel lengths that occur at least once in the
           input file. SNPs are considered indels with length zero. The output
           file has the suffix ".indel.hist".

         --hapcount <BED file>
           This option will output the number of unique haplotypes within user
           specified bins, as defined by the BED file. The output file has the
           suffix ".hapcount".

         --mendel <PED file>
           This option is use to report mendel errors identified in trios. The
           command requires a PLINK-style PED file, with the first four
           columns specifying a family ID, the child ID, the father ID, and
           the mother ID. The output of this command has the suffix ".mendel".

         --extract-FORMAT-info <string>
           Extract information from the genotype fields in the VCF file
           relating to a specfied FORMAT identifier. The resulting output file
           has the suffix ".<FORMAT_ID>.FORMAT". For example, the following
           command would extract the all of the GT (i.e. Genotype) entries:
             vcftools --vcf file1.vcf --extract-FORMAT-info GT

         --get-INFO <string>
           This option is used to extract information from the INFO field in
           the VCF file. The <string> argument specifies the INFO tag to be
           extracted, and the option can be used multiple times in order to
           extract multiple INFO entries. The resulting file, with suffix
           ".INFO", contains the required INFO information in a tab-separated
           table. For example, to extract the NS and DB flags, one would use
           the command:
             vcftools --vcf file1.vcf --get-INFO NS --get-INFO DB

   OUTPUT VCF FORMAT
         --recode
         --recode-bcf
           These options are used to generate a new file in either VCF or BCF
           from the input VCF or BCF file after applying the filtering options
           specified by the user. The output file has the suffix ".recode.vcf"
           or ".recode.bcf". By default, the INFO fields are removed from the
           output file, as the INFO values may be invalidated by the recoding
           (e.g. the total depth may need to be recalculated if individuals
           are removed). This behavior may be overriden by the following
           options. By default, BCF files are written out as BGZF compressed
           files.

         --recode-INFO <string>
         --recode-INFO-all
           These options can be used with the above recode options to define
           an INFO key name to keep in the output file. This option can be
           used multiple times to keep more of the INFO fields. The second
           option is used to keep all INFO values in the original file.

         --contigs <string>
           This option can be used in conjuction with the --recode-bcf when
           the input file does not have any contig declarations. This option
           expects a file name with one contig header per line. These lines
           are included in the output file.

   OUTPUT OTHER FORMATS
         --012
           This option outputs the genotypes as a large matrix. Three files
           are produced. The first, with suffix ".012", contains the genotypes
           of each individual on a separate line. Genotypes are represented as
           0, 1 and 2, where the number represent that number of non-reference
           alleles. Missing genotypes are represented by -1. The second file,
           with suffix ".012.indv" details the individuals included in the
           main file. The third file, with suffix ".012.pos" details the site
           locations included in the main file.

         --IMPUTE
           This option outputs phased haplotypes in IMPUTE reference-panel
           format. As IMPUTE requires phased data, using this option also
           implies --phased. Unphased individuals and genotypes are therefore
           excluded. Only bi-allelic sites are included in the output. Using
           this option generates three files. The IMPUTE haplotype file has
           the suffix ".impute.hap", and the IMPUTE legend file has the suffix
           ".impute.hap.legend". The third file, with suffix
           ".impute.hap.indv", details the individuals included in the
           haplotype file, although this file is not needed by IMPUTE.

         --ldhat
         --ldhelmet
         --ldhat-geno
           These options output data in LDhat/LDhelmet format. This option
           requires the "--chr" filter option to also be used. The two first
           options output phased data only, and therefore also implies
           "--phased" be used, leading to unphased individuals and genotypes
           being excluded. For LDhelmet, only snps will be considered, and
           therefore it implies "--remove-indels". The second option treats
           all of the data as unphased, and therefore outputs LDhat files in
           genotype/unphased format. Two output files are generated with the
           suffixes ".ldhat.sites" and ".ldhat.locs", which correspond to the
           LDhat "sites" and "locs" input files respectively; for LDhelmet,
           the two files generated have the suffixes ".ldhelmet.snps" and
           ".ldhelmet.pos", which corresponds to the "SNPs" and "positions"
           files.

         --BEAGLE-GL
         --BEAGLE-PL
           These options output genotype likelihood information for input into
           the BEAGLE program. The VCF file is required to contain FORMAT
           fields with "GL" or "PL" tags, which can generally be output by SNP
           callers such as the GATK. Use of this option requires a chromosome
           to be specified via the "--chr" option. The resulting output file
           has the suffix ".BEAGLE.GL" or ".BEAGLE.PL" and contains genotype
           likelihoods for biallelic sites. This file is suitable for input
           into BEAGLE via the "like=" argument.

         --plink
         --plink-tped
         --chrom-map
           These options output the genotype data in PLINK PED format. With
           the first option, two files are generated, with suffixes ".ped" and
           ".map". Note that only bi-allelic loci will be output. Further
           details of these files can be found in the PLINK documentation.
           Note: The first option can be very slow on large datasets. Using
           the --chr option to divide up the dataset is advised, or
           alternatively use the --plink-tped option which outputs the files
           in the PLINK transposed format with suffixes ".tped" and ".tfam".
           For usage with variant sites in species other than humans, the
           --chrom-map option may be used to specify a file name that has a
           tab-delimited mapping of chromosome name to a desired integer value
           with one line per chromosome. This file must contain a mapping for
           every chromosome value found in the file.

COMPARISON OPTIONS
       These options are used to compare the original variant file to another
       variant file and output the results. All of the diff functions require
       both files to contain the same chromosomes and that the files be sorted
       in the same order. If one of the files contains chromosomes that the
       other file does not, use the --not-chr filter to remove them from the
       analysis.

   DIFF VCF FILE
         --diff <filename>
         --gzdiff <filename>
         --diff-bcf <filename>
           These options compare the original input file to this specified
           VCF, gzipped VCF, or BCF file. These options must be specified with
           one additional option described below in order to specify what type
           of comparison is to be performed. See the examples section for
           typical usage.

   DIFF OPTIONS
         --diff-site
           Outputs the sites that are common / unique to each file. The output
           file has the suffix ".diff.sites_in_files".

         --diff-indv
           Outputs the individuals that are common / unique to each file. The
           output file has the suffix ".diff.indv_in_files".

         --diff-site-discordance
           This option calculates discordance on a site by site basis. The
           resulting output file has the suffix ".diff.sites".

         --diff-indv-discordance
           This option calculates discordance on a per-individual basis. The
           resulting output file has the suffix ".diff.indv".

         --diff-indv-map <filename>
           This option allows the user to specify a mapping of individual IDs
           in the second file to those in the first file. The program expects
           the file to contain a tab-delimited line containing an individual's
           name in file one followed by that same individual's name in file
           two with one mapping per line.

         --diff-discordance-matrix
           This option calculates a discordance matrix. This option only works
           with bi-allelic loci with matching alleles that are present in both
           files. The resulting output file has the suffix
           ".diff.discordance.matrix".

         --diff-switch-error
           This option calculates phasing errors (specifically "switch
           errors"). This option creates an output file describing switch
           errors found between sites, with suffix ".diff.switch".

AUTHORS
       Adam Auton
       Anthony Marcketta



1                                    page                        vcftools(man)