Use of pipe character in VCF info field

Use of pipe character in VCF info field

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

While annotating my VCF file with ClinVar, I noted the following value for theCLNSIGfield (i.e., clinical significance):


This is actually reported in this example, so I think the usage of the pipe is quite common for genomic annotations of this kind.

Which is the use of the pipe (|) character? I looked for it in the VCF file format specification; however, only the meaning of the comma (,) is specified (i.e., presence of multiple alternate value for that field). I am wondering, instead, what the pipe represents.

Thank you.


bedtools getfasta extracts sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file.

1. The headers in the input FASTA file must exactly match the chromosome column in the BED file.

2. You can use the UNIX fold command to set the line width of the FASTA output. For example, fold -w 60 will make each line of the FASTA file have at most 60 nucleotides for easy viewing.

3. BED files containing a single region require a newline character at the end of the line, otherwise a blank output file is produced.


Each command has its own man page which can be viewed using e.g. man samtools-view or with a recent GNU man using man samtools view. Below we have a brief summary of syntax and sub-command description.

Options common to all sub-commands are documented below in the GLOBAL COMMAND OPTIONS section.

With no options or regions specified, prints all alignments in the specified input alignment file (in SAM, BAM, or CRAM format) to standard output in SAM format (with no header by default).

You may specify one or more space-separated region specifications after the input filename to restrict output to only those alignments which overlap the specified region(s). Use of region specifications requires a coordinate-sorted and indexed input file.

Options exist to change the output format from SAM to BAM or CRAM, so this command also acts as a file format conversion utility.

samtools tview [-p chr:pos] [-s STR] [-d display] <in.sorted.bam> [ref.fasta]

Text alignment viewer (based on the ncurses library). In the viewer, press `?' for help and press `g' to check the alignment start from a region in the format like `chr10:10,000,000' or `=10,000,000' when viewing the same reference sequence.

Quickly check that input files appear to be intact. Checks that beginning of the file contains a valid header (all formats) containing at least one target sequence and then seeks to the end of the file and checks that an end-of-file (EOF) is present and intact (BAM only).

Data in the middle of the file is not read since that would be much more time consuming, so please note that this command will not detect internal corruption, but is useful for testing that files are not truncated before performing more intensive tasks on them.

This command will exit with a non-zero exit code if any input files don't have a valid header or are missing an EOF block. Otherwise it will exit successfully (with a zero exit code).

samtools index [-bc] [-m INT] aln.sam.gz|aln.bam|aln.cram [out.index]

Index a coordinate-sorted SAM, BAM or CRAM file for fast random access. Note for SAM this only works if the file has been BGZF compressed first.

This index is needed when region arguments are used to limit samtools view and similar commands to particular regions of interest.

If an output filename is given, the index file will be written to out.index. Otherwise, for a CRAM file aln.cram, index file aln.cram.crai will be created for a BAM or SAM file aln.bam, either aln.bam.bai or aln.bam.csi will be created, depending on the index format selected.

samtools sort [-l level] [-m maxMem] [-o out.bam] [-O format] [-n] [-t tag] [-T tmpprefix] [[email protected] threads] [in.sam|in.bam|in.cram]

Sort alignments by leftmost coordinates, or by read name when -n is used. An appropriate @HD-SO sort order header tag will be added or an existing one updated if necessary.

The sorted output is written to standard output by default, or to the specified file (out.bam) when -o is used. This command will also create temporary files tmpprefix.%d.bam as needed when the entire alignment data cannot fit into memory (as controlled via the -m option).

Consider using samtools collate instead if you need name collated data without a full lexicographical sort.

Shuffles and groups reads together by their names. A faster alternative to a full query name sort, collate ensures that reads of the same name are grouped together in contiguous groups, but doesn't make any guarantees about the order of read names between groups.

The output from this command should be suitable for any operation that requires all reads from the same template to be grouped together.

Retrieve and print stats in the index file corresponding to the input file. Before calling idxstats, the input BAM file should be indexed by samtools index.

If run on a SAM or CRAM file or an unindexed BAM file, this command will still produce the same summary statistics, but does so by reading through the entire file. This is far slower than using the BAM indices.

The output is TAB-delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads. It is written to stdout.

Does a full pass through the input file to calculate and print statistics to stdout.

Provides counts for each of 13 categories based primarily on bit flags in the FLAG field. Each category in the output is broken down into QC pass and QC fail, which is presented as "#PASS + #FAIL" followed by a description of the category.

Convert between textual and numeric flag representation.


0x1PAIREDpaired-end (or multiple-segment) sequencing technology
0x2PROPER_PAIReach segment properly aligned according to the aligner
0x4UNMAPsegment unmapped
0x8MUNMAPnext segment in the template unmapped
0x10REVERSESEQ is reverse complemented
0x20MREVERSESEQ of the next segment in the template is reverse complemented
0x40READ1the first segment in the template
0x80READ2the last segment in the template
0x100SECONDARYsecondary alignment
0x200QCFAILnot passing quality controls
0x400DUPPCR or optical duplicate
0x800SUPPLEMENTARYsupplementary alignment

samtools stats collects statistics from BAM files and outputs in a text format. The output can be visualized graphically using plot-bamstats.

Reports the total read base count (i.e. the sum of per base read depths) for each genomic region specified in the supplied BED file. The regions are output as they appear in the BED file and are 0-based. Counts for each alignment file supplied are reported in separate columns.

Computes the read depth at each position or region.

samtools ampliconstats collects statistics from one or more input alignment files and produces tables in text format. The output can be visualized graphically using plot-ampliconstats.

The alignment files should have previously been clipped of primer sequence, for example by samtools ampliconclip and the sites of these primers should be specified as a bed file in the arguments.

samtools mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [. ]]

Generate textual pileup for one or multiple BAM files. For VCF and BCF output, please use the bcftools mpileup command instead. Alignment records are grouped by sample (SM) identifiers in @RG header lines. If sample identifiers are absent, each input file is regarded as one sample.

See the samtools-mpileup man page for a description of the pileup format and options.

Produces a histogram or table of coverage per chromosome.

samtools merge [-nur1f] [-h inh.sam] [-t tag] [-R reg] [-b list] out.bam in1.bam [in2.bam in3.bam . inN.bam]

Merge multiple sorted alignment files, producing a single sorted output file that contains all the input records and maintains the existing sort order.

If -h is specified the @SQ headers of input files will be merged into the specified header, otherwise they will be merged into a composite header created from the input headers. If the @SQ headers differ in order this may require the output file to be re-sorted after merge.

The ordering of the records in the input files must match the usage of the -n and -t command-line options. If they do not, the output order will be undefined. See sort for information about record ordering.

samtools split [options] merged.sam|merged.bam|merged.cram

Splits a file by read group, producing one or more output files matching a common prefix (by default based on the input filename) each containing one read-group.

samtools cat [-b list] [-h header.sam] [-o out.bam] in1.bam in2.bam [ . ]

Concatenate BAMs or CRAMs. Although this works on either BAM or CRAM, all input files must be the same format as each other. The sequence dictionary of each input file must be identical, although this command does not check this. This command uses a similar trick to reheader which enables fast BAM concatenation.

samtools fastq [options] in.bam
samtools fasta [options] in.bam

Converts a BAM or CRAM into either FASTQ or FASTA format depending on the command invoked. The files will be automatically compressed if the file names have a .gz or .bgzf extension.

The input to this program must be collated by name. Use samtools collate or samtools sort -n to ensure this.

samtools faidx <ref.fasta> [region1 [. ]]

Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format.

The input file can be compressed in the BGZF format.

FASTQ files can be read and indexed by this command. Without using --fastq any extracted subsequence will be in FASTA format.

samtools fqidx <ref.fastq> [region1 [. ]]

Index reference sequence in the FASTQ format or extract subsequence from indexed reference sequence. If no region is specified, fqidx will index the file and create <ref.fastq>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTQ format.

The input file can be compressed in the BGZF format.

samtools fqidx should only be used on fastq files with a small number of entries. Trying to use it on a file containing millions of short sequencing reads will produce an index that is almost as big as the original file, and searches using the index will be very slow and use a lot of memory.

samtools dict ref.fasta|ref.fasta.gz

Create a sequence dictionary file from a fasta file.

samtools calmd [-Eeubr] [-C capQcoef] aln.bam ref.fasta

Generate the MD tag. If the MD tag is already present, this command will give a warning if the MD tag generated is different from the existing tag. Output SAM by default.

Calmd can also read and write CRAM files although in most cases it is pointless as CRAM recalculates MD and NM tags on the fly. The one exception to this case is where both input and output CRAM files have been / are being created with the no_ref option.

samtools fixmate [-rpcm] [-O format] in.nameSrt.bam out.bam

Fill in mate coordinates, ISIZE and mate related flags from a name-sorted alignment.

samtools markdup [-l length] [-r] [-s] [-T] [-S] in.algsort.bam out.bam

Mark duplicate alignments from a coordinate sorted file that has been run through samtools fixmate with the -m option. This program relies on the MC and ms tags that fixmate provides.

samtools rmdup [-sS] <> <out.bam>

This command is obsolete. Use markdup instead.

samtools addreplacerg [-r rg-line | -R rg-ID] [-m mode] [-l level] [-o out.bam] in.bam

Adds or replaces read group tags in a file.

samtools reheader [-iP] in.header.sam in.bam

Replace the header in in.bam with the header in in.header.sam. This command is much faster than replacing the header with a BAM&rarrSAM&rarrBAM conversion.

By default this command outputs the BAM or CRAM file to standard output (stdout), but for CRAM format files it has the option to perform an in-place edit, both reading and writing to the same file. No validity checking is performed on the header, nor that it is suitable to use with the sequence data itself.

samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] in.bam

This command identifies target regions by examining the continuity of read depth, computes haploid consensus sequences of targets and outputs a SAM with each sequence corresponding to a target. When option -f is in use, BAQ will be applied. This command is only designed for cutting fosmid clones from fosmid pool sequencing [Ref. Kitzman et al. (2010)].

samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ] in.bam

Call and phase heterozygous SNPs.

samtools depad [-SsCu1] [-T ref.fa] [-o output] in.bam

Converts a BAM aligned against a padded reference to a BAM aligned against the depadded reference. The padded reference may contain verbatim "*" bases in it, but "*" bases are also counted in the reference numbering. This means that a sequence base-call aligned against a reference "*" is considered to be a cigar match ("M" or "X") operator (if the base-call is "A", "C", "G" or "T"). After depadding the reference "*" bases are deleted and such aligned sequence base-calls become insertions. Similarly transformations apply for deletions and padding cigar operations.

samtools ampliconclip [-o out.file] [-f stat.file] [--soft-clip] [--hard-clip] [--both-ends] [--strand] [--clipped] [--fail] [--no-PG] -b bed.file in.file

Clip reads in a SAM compatible file based on data from a BED file.


Overview of the vcfanno functionality

Vcfanno annotates variants in a VCF file (the “query” intervals) with information aggregated from the set of intersecting intervals among many different annotation files (the “database” intervals) stored in common genomic formats such as BED, GFF, GTF, VCF, and BAM. It utilizes a “streaming” intersection algorithm that leverages sorted input files to greatly reduce memory consumption and improve speed. As the streaming intersection is performed (details below), database intervals are associated with a query interval if there is an interval intersection. Once all intersections for a particular query interval are known, the annotation proceeds according to user-defined operations that are applied to the attributes (e.g., the “score” column in a BED annotation file or an attribute in the INFO field of a VCF annotation file) data within the database intervals. As a simple example, consider a query VCF of single nucleotide variants (SNVs) that was annotated by SNVs from an annotation database such as a VCF file of the dbSNP resource. In this case, the query and database variants are matched on position, REF, and ALT fields when available and a value from the overlapping database interval (e.g., minor allele frequency) is carried forward to become the annotation stored in the INFO field of the query VCF. In a more complex scenario where a query structural variant intersects multiple annotation intervals from each database, the information from those intervals must be aggregated. One may wish to report each of the attributes as a comma-separated list via the “concat” operation. Alternatively, one could select the maximum allele frequency via the “max” operation. For cases where only a single database interval is associated with the query, the choice of operation will not affect the summarized value.

An example VCF INFO field from a single variant before and after annotation with vcfanno is shown in Fig. 1. A simple configuration file is used to specify both the source files and the set of attributes (in the case of VCF) or columns (in the case of BED or other tab-delimited formats) that should be added to the query file. In addition, the configuration file allows annotations to be renamed in the resulting VCF INFO field. For example, we can extract the allele frequency (AF) attribute from the ExAC VCF file [9] and rename it as “exac_aaf” in the INFO field of the VCF query records. The configuration file allows one to extract as many attributes as needed from any number of annotation datasets.

Overview of the vcfanno workflow. An unannotated VCF (a) is sent to vcfanno (b) along with a configuration file that indicates the paths to the annotation files, the attributes to extract from each file, and the methods that should be used to describe or summarize the values extracted from those files. The new annotations in the resulting VCF (c) are shown in blue text with additional fields added to the INFO column

Overview of the chrom-sweep algorithm

The chromosome sweeping algorithm (“chrom-sweep”) is an adaptation of the streaming, sort-merge join algorithm, and is capable of efficiently detecting interval intersections among multiple interval files, as long as they are sorted by both chromosome and interval start position. Utilized by both BEDTOOLS [10, 11] and BEDOPS [12], chrom-sweep finds intersections in a single pass by advancing pointers in each file that are synchronized by genomic position. At each step in the sweep, these pointers maintain the set of intervals that intersect a particular position and, in turn, intersect each other. This strategy is advantageous for large datasets because it avoids the use of data structures such as interval trees or hierarchical bins (e.g., the UCSC binning algorithm [13]). While these tree and binning techniques do not require sorted input, the memory footprint of these methods scales poorly, especially when compared with streaming algorithms, which typically exhibit low, average-case memory demands.

The chrom-sweep algorithm implemented in vcfanno proceeds as follows. First, we create an iterator of interval records for the query VCF and for each database annotation file. We then merge intervals from the query VCF and each annotation into a single priority queue, which orders the intervals from all files by chromosome and start coordinate, while also tracking the file from which each interval came. Vcfanno progresses by requesting an interval from the priority queue and inserts it into a cache. If the most recently observed interval is from the query VCF, we check for intersections with all database intervals that are currently in the cache. Since vcfanno requires that all files be sorted, we know that intervals are entering the cache ordered by start coordinate. Therefore, in order to check for overlap, we only need to check that the start of the new interval is less than the end of any of the intervals in the cache (assuming half-open intervals). An example of the sweeping algorithm is shown in Fig. 2 for a case involving two annotation files and three records from a single query VCF. The contents of the cache are shown as the sweep reaches the start of each new interval. When a new query interval enters the cache, any interval that does not intersect it is ejected from the cache. If the removed interval originated from the query VCF, it is sent, together with each of the intersecting annotation intervals, to be processed according to the operations specified in the configuration file. The resulting annotations are stored in the INFO field of the VCF file and the updated VCF record is reported as output.

Overview of the chrom-sweep interval intersection algorithm. The chrom-sweep algorithm sweeps from left to right as it progresses along each chromosome. Green intervals from the query VCF in the first row are annotated by annotation files A (blue) and B (orange) in the second and third rows, respectively. The cache row indicates which intervals are currently in the cache at each point in the progression of the sweeping algorithm. Intervals enter the cache in order of their chromosomal start position. First A1 enters the cache followed by Q1. Since Q1 intersects A1, they are associated, as are Q1 and B1 when B1 enters the cache. Each time a new query interval enters the cache, any interval it does not intersect is ejected. Therefore, when Q2 enters the cache, Q1 and A1 are ejected. Since Q1 is a query interval, it is sent to be reported as output. Proceeding to the right, A2 and then Q3 enter the cache the latter is a query interval and so the intervals that do not overlap it—B1, Q2, and A2—are ejected from the cache with the query interval, Q2, which is sent to the caller. Finally, as we reach the end of the incoming intervals, we clear out the final Q3 interval and finalize the output for this chromosome. EOF: End of File

Limitations of the chrom-sweep algorithm

Owing to the fact that annotation sets are not loaded into memory-intensive data structures, the chrom-sweep algorithm easily scales to large datasets. However, it does have some important limitations. First, it requires that all intervals from all annotation files adhere to the same chromosome order. While conceptually simple, this is especially onerous since VCFs produced by variant callers such as GATK impose a different chromosome order (1, 2, …21, X, Y, MT) than most other numerically sorted annotation files, which would put MT before X and Y. Of course, sorting the numeric chromosomes as characters or integers also results in different sort orders. Discrepancies in chromosome ordering among files are often not detected until substantial computation has already been performed. A related problem is when one file contains intervals from a given chromosome that the other does not, it’s not possible to distinguish whether the chromosome order is different or if that chromosome is simply not present in one of the files until all intervals are parsed.

Second, the standard chrom-sweep implementation is suboptimal because it is often forced to consider (and parse) many annotation intervals that will never intersect the query intervals, resulting in unnecessary work [14]. For example, given a VCF file of variants that are sparsely distributed throughout the genome (e.g., a VCF from a single exome study) and dense data sets of whole-genome annotations, chrom-sweep must parse and test each interval of the whole-genome annotations for intersection with a query interval, even though the areas of interest comprise less than 1 % of the regions in the file. In other words, sparse queries with dense annotation files represent a worst-case scenario for the performance of chrom-sweep because a high proportion of the intervals in the data sets will never intersect.

A third limitation of the chrom-sweep algorithm is that, due to the inherently serial nature of the algorithm, it is difficult to parallelize the detection of interval intersections and the single CPU performance is limited by the speed at which intervals can be parsed. Since the intervals arrive in sorted order, skipping ahead to process a new region from each file in a different processing thread is difficult without a pre-computed spatial index of the intervals and reporting the intervals in sorted order after intersection requires additional bookkeeping.

A parallel chrom-sweep algorithm

To address these shortcomings, we developed a parallel algorithm that concurrently chrom-sweeps “chunks” of query and database intervals. Unlike previous in-memory parallel sweeping methods that uniformly partition the input [15], we define (without the need for preprocessing [16]) chunks by consecutive query intervals that meet one of two criteria: either the set reaches the “chunk size” threshold or the genomic distance to the next interval exceeds the “gap size” threshold. Restricting the chunk size creates reasonably even work among the threads to support efficient load balancing (i.e., to avoid task divergence). The gap size cutoff is designed to avoid processing an excessive number of unrelated database intervals that reside between distant query intervals.

As soon as a chunk is defined, it is scheduled to be swept in parallel along with the other previously defined chunks. The bounds of the query intervals in the chunk determine the range of the intervals requested from each annotation file (Fig. 3). Currently these requests are to either a Tabix [17] indexed file or a BAM file via the bíogo package [18] but any spatial query can be easily supported. An important side effect of gathering database intervals using these requests is that, while the annotation files must be sorted, there is no need for the chromosome orders of the annotations to match. This, along with internally removing any “chr” prefix, alleviates the associated chromosome order and representation complexities detailed above. The set of intervals from these requests are integrated with the query intervals to complete the chunk, which is then processed by the standard chrom-sweep algorithm. However, in practice this is accomplished by streams so that only the query intervals are held in memory while the annotation intervals are retrieved from their iterators during the chrom-sweep. One performance bottleneck in this strategy is that the output should be sorted and, since chunks may finish in any order, we must buffer completed chunks to restore sorted order. This, along with disk speed limitations, is the primary source of overhead preventing optimal parallelization efficiency.

Parallel sweeping algorithm. As in Fig. 2, we sweep across the chromosome from lower to higher positions (and left to right in the figure). The green query intervals are to be annotated with the two annotation files depicted with blue and orange intervals. The parallelization occurs in chunks of query intervals delineated by the black vertical lines. One process reads query intervals into memory until a maximum gap size to the next interval is reached (e.g., chunks 2, 4) or the number of intervals exceeds the chunk size threshold (e.g., chunks 1, 3). While a new set of query intervals accumulates, the first chunk, bounded to the right by the first vertical black line above, is sent for sweeping and a placeholder is put into a FIFO (first-in, first-out) queue, so that the output remains sorted even though other chunks may finish first. The annotation files are queried with regions based on the bounds of intervals in the query chunk. The queries then return streams of intervals and, finally, those streams are sent to the chrom-sweep algorithm in a new process. When it finishes, its placeholder can be pulled from the FIFO queue and the results are yielded for output

Vcfanno implementation

Vcfanno is written in Go (, which provides a number of advantages. First, Go supports cross-compilation for 32- and 64-bit systems for Mac, Linux, and Windows. Go’s performance means that vcfanno can process large data sets relatively quickly. Go also offers a simple concurrency model, allowing vcfanno to perform intersections in parallel while minimizing the possibility of race conditions and load balancing problems that often plague parallel implementations. Moreover, as we demonstrate in the Results section, vcfanno’s parallel implementation of the chrom-sweep algorithm affords speed and scalability. Lastly, it is a very flexible tool because of its support for annotations provided in many common formats such as BED, VCF, GFF, BAM, and GTF.

%>% .$column_name equivalent for R base pipe |>

In base pipe no placeholder is provided for the data that is passed in the pipe. This is one difference between magrittr pipe and base R pipe. You may use an anonymous function to access the object.

The direct usage of $ in |> is currently disabled. Maybe one reason could be to write

If the call of $ (or other disabled functions in |>) is still needed, an option, beside the creation of a function, similar to the solution of @jay-sf, is to use $ via the function :: as base::`Use of pipe character in VCF info field - Biology,[nobr][H1toH2] or place it in brakes ( $ ) :

Another option can be the use of a bizarro pipe ->. . Some call it a joke others clever use of existing syntax.

This creates or overwrites . in the .GlobalEnv . rm(.) can be used to remove it. Alternatively it could be processed in local :

In this case it produces two same objects in the environment iris and . but as long as they are not modified they point the the same address.

What's new?

Unprecedented speed
Thanks to heavy use of bitwise operators, sequential memory access patterns, multithreading, and higher-level algorithmic improvements, PLINK 1.9 is much, much faster than PLINK 1.07 and other popular software. Several of the most demanding jobs, including identity-by-state matrix computation, distance-based clustering, LD-based pruning, haplotype block identification, and association analysis max(T) permutation tests, now complete hundreds or even thousands of times as quickly, and even the most trivial operations tend to be 5-10x faster due to I/O improvements.

We hasten to add that the vast majority of ideas contributing to PLINK 1.9's performance were developed elsewhere in several cases, we have simply ported little-known but outstanding implementations without significant further revision (even while possibly uglifying them beyond recognition sorry about that, Roman. ). See the credits page for a partial list of people to thank. On a related note, if you are aware of an implementation of a PLINK command which is substantially better what we currently do, let us know we'll be happy to switch to their algorithm and give them credit in our documentation and papers.

Nearly unlimited scale
The main genomic data matrix no longer has to fit in RAM, so bleeding-edge datasets containing millions of variant calls from exome- or whole-genome sequencing of tens of thousands of samples can be processed on ordinary desktops (and this processing will usually complete in a reasonable amount of time). In addition, several key sample x sample and variant x variant matrix computations (including the GRM mentioned below) can be cleanly split across computing clusters (or serially handled in manageable chunks by a single computer).

Command-line interface improvements
We've standardized how the command-line parser works, migrated from the original "everything is a flag" design toward a more organized flags + modifiers approach (while retaining backwards compatibility), and added a thorough command-line help facility.

Additional functions
In 2009, GCTA didn't exist. Today, there is an important and growing ecosystem of tools supporting the use of genetic relationship matrices in mixed model association analysis and other calculations our contributions are a fast, multithreaded, memory-efficient --make-grm-gz/--make-grm-bin implementation which runs on macOS and Windows as well as Linux, and a closer-to-optimal --rel-cutoff pruner.

There are other additions here and there, such as cluster-based filters which might make a few population geneticists' lives easier, and a coordinate-descent LASSO. New functions are not a top priority for now (reaching 95%+ backward compatibility, and supporting dosage/phased/triallelic data, are more important. ), but we're willing to take time off from just working on the program core if you ask nicely.

Operating systemDevelopment (8 Jun)Alpha 2.3 final (24 Jan 2020)
Linux AVX2 Intel 1 downloaddownload
Linux 64-bit Intel 1 downloaddownload
Linux 32-bitdownloaddownload
macOS AVX2downloaddownload
macOS 64-bitdownloaddownload
Windows AVX2downloaddownload
Windows 64-bitdownloaddownload
Windows 32-bitdownloaddownload

1: These builds can still run on AMD processors, but they're statically linked to Intel MKL, so some linear algebra operations will be slow. We will try to provide an AMD Zen-optimized build as soon as supporting libraries are available.

Source code and build instructions are available on GitHub. (Here's another copy of the source code.)

Recent version history

8 Jun 2021: Fixed multiallelic-variant-writing bug (typically manifesting as a segmentation fault or assertion failure) that could occur with --sort-vars or under low-memory conditions.

25 May: .fa loader now tolerates blank lines. gzip files containing multiple streams or trailing garbage should be accepted again.

23 May: Fixed FID+IID+SID loading (recent builds were giving an incorrect "SID column does not immediately follow IID column" error).

5 May: Fixed --within bug introduced on 16 Jan.

20 Apr: --het cols= should now work properly.

16 Apr: --data/--gen now supports .gen files with 6 leading columns. This format can be exported with "--export oxford-v2".

14 Apr: --pmerge-list should no longer be limited by the system's #-of-open-files cap.

13 Apr: --glm local-covariate-handling bugfix. Fixed --pmerge[-list] bug that could cause the generated .pgen header to be invalid when multiallelic variants were present.

6 Apr: --data/--sample now recognizes column type 'C' as a synonym for 'P' (continuous phenotype). (This build has an incorrect "6 Mar" datestamp sorry about that.)

28 Mar: --sample-counts chrX no-known-males bugfix.

25 Mar: --pmerge[-list] .bim-handling bugfix.

23 Mar: Unbreak --make-pgen + --sort-vars (this was broken by the 28 Feb build).

2 Mar: --pmerge[-list] bugfixes (no longer segfaults when all variants are at different positions if the output .pvar file already exists, it's deleted first instead of appended to if an input file covers multiple chromosomes, there is no longer a likely assert failure fixed some issues with merging of same-position same-ID variants).

1 Mar: --pmerge-list-dir flag implemented (specifies a common directory prefix for all --pmerge-list entries).

28 Feb: --pmerge[-list] can now be used for concatenation-like jobs.
Note that it doesn't necessarily perform pure concatenation on a chromosome-split dataset: if two variants in a file have the same position and ID, they will be merged, in a way that's not compatible with 'split' multiallelic variants sharing a single ID (those must be merged with a dedicated 'join' operation, such as "bcftools norm -m +"). As a consequence, --pmerge[-list] defaults to erroring out when it detects such a split variant. One workaround is to use --set-all-var-ids to assign distinct IDs to each piece of the split variant.

3 Feb: Fixed .pvar loading bug that triggered when FILTER values were relevant at the same time as either INFO/PR or CM values. .ped-derived filesets containing variants where both REF and ALT are missing are permitted again (such variants were prohibited in recent builds). --vcf-ref-n-missing flag added to simplify re-import of .ped-derived VCFs. Removed extra tabs from --pgen-diff output.

23 Jan: --chr-set now sets MT to haploid. ##chrSet .pvar header line without the corresponding command-line flag now initializes chrX, chrY, and MT ploidy correctly.

18 Jan: ##chrSet .pvar header lines now conform to VCFv4.3 specification (an ID field is included). VCF/BCF export now performs more header validation.

16 Jan: --update-parents now works properly with 'maybeparents' output column sets in other commands.

14 Jan: --update-ids now works properly with 'maybefid' output column sets in other commands when it creates a FID column.

4 Jan: --normalize now properly skips missing and '*' alleles.

3 Jan: Fixed --bcf bug that affected unphased multiallelic variants (usually resulting in a spurious "GT half-call" crash, but if you suppressed that with --vcf-half-call the data would not be imported correctly). Fixed "--export bcf" bug that occurred on headers with FILTER/INFO/FORMAT keys with identical names, and a crash that occurred on variants with multiple FILTER failures. --output-missing-genotype/--output-missing-phenotype bugfixes/cleanup.

2 Jan: --pgen-diff multiallelic-variant handling bugfix. --pgen-diff DS comparison implemented. --adjust cols= parsing bugfix ('cols=+qq' should work now).

1 Jan 2021: Several SID-handling bugfixes. --sample-diff 'dosage' and 'id-delim=' modifier command-line parsing bugfixes. --sample-diff no longer omits later ALT alleles when they're absent from the samples being compared. --pgen-diff GT comparison implemented (generalization of PLINK 1.x --merge-mode 6/7).

12 Dec 2020: --q-score-range score-average, ALLELE_CT, DENOM, and NAMED_ALLELE_DOSAGE_SUM column bugfix.

28 Oct: Multipass "--export A" bugfix. If you've previously run plink2 "--export A" on a file too large to fit in memory, we recommend that you rerun with this build.

20 Oct: --fst Weir-Cockerham method implemented. --fst ids= and chrX bugfixes. --fst variant-report OBS_CT is now specific to population pair.

19 Oct: Linux binaries should now yield reproducible results across machines unless --native is specified (previously, Intel MKL could select processor-dependent code paths with different floating-point rounding behavior). --fst Hudson method implemented. Categories within categorical phenotypes are now reported in natural-sorted order. --variant-score MISSING_CT/OBS_CT bugfix.

23 Sep: --update-ids no-FID bugfix.

14 Sep: --glm + --parameters chrX/chrY bugfix.

31 Aug: --data/--sample now supports QCTOOLv2's .sample dialect. --export 'sample-v2' exports it.

27 Jul: --glm 'cc-residualize' implemented. Note that these approximations are not recommended if you have a significant number of missing genotypes.

25 Jul: --glm 'firth-residualize' modifier added. This implements the fast Firth approximation introduced in Mbatchou J et al. (2020) Computationally efficient whole genome regression for quantitative and binary traits.

6 Jul: --af-pseudocount flag implemented this lets you specify a pseudocount other than 0 or 1 for allele frequency estimation.

1 Jul: --make-[b]pgen 'fill-missing-from-dosage' modifier implemented, to support algorithms that require no missing hardcalls.

27 Jun: --hardy/--hwe chrX multiallelic-variant handling bugfixes.

25 Jun: Replaced a misleading "No such file or directory" file-read error message.

15 Jun: --glm local-covar= no longer errors out on long RFMix2 header lines, as long as ID lengths are reasonable.

31 May: Added single-precision --variant-score mode.

11 May: Fixed --glm segfault that occurred when categorical covariates were present, but none had more than 2 categories.

9 Apr: Firth regression implementation now uses the same maxit=25 value as R logistf(). 'UNFINISHED' error code added to flag logistic/Firth regression results which would change with even more iterations.

28 Mar: Fixed --glm bug in 21 Mar build that caused segfaults when zero-MAF biallelic variants were present. --glm now errors out when no covariate file is specified, unless the 'allow-no-covars' modifier is specified.

21 Mar: Fixed --glm multiallelic-variant handling bugs that could occur when 'genotypic', 'hethom', 'dominant', 'recessive', 'interaction', or --tests was specified, and corrected 'dominant'/'recessive' documentation. It is no longer necessary to trim zero- (or other-constant-) dosage alleles from multiallelic variants to get --glm results for the other alleles.

14 Mar: --make-pgen/--make-just-pvar 'vcfheader' column set added (this makes it possible to directly generate a valid sites-only VCF). Bgzipping of the .pvar file is not directly supported, but you can use a named pipe to accomplish that with low overhead.

11 Mar: Fixed --glm segfault that could occur when no covariates were specified. VCF/BCF importers now default to compressing the temporary .pvar file, so that files with lots of INFO field content don't require a disproportionally large amount of free disk space to work with. --keep-autoconv now has a 'vzs' modifier to request compression of the .pvar file (and conversely, when --vcf/--bcf is used with bare --keep-autoconv, the .pvar is not compressed).

10 Mar: Fixed --make-pgen segfault that occurred when phased dosages were present without any phased hardcalls.

8 Mar: "--export bcf" implemented. VCF-export multiallelic HDS-force bugfixes. Added missing FILTER/fa header line to whole-genome 1000 Genomes phase 3 annotated .pvar files on Resources page.

25 Feb: --ld multiallelic-phased data handling bugfix.

22 Feb: --bcf n_allele=1 (ALT='.') bugfix.

19 Feb: --bcf GQ/DP-filtering bugfixes. --vcf and --bcf now enforce VCF contig naming restrictions.

11 Feb: "--vcf-half-call reference" works properly again (it was behaving like "--vcf-half-call error" in recent builds).

8 Feb: BGZF-compressed text files should now work properly with all commands that make multiple passes over the file (previously they worked with --vcf, but almost no other commands of this type). Named-pipe input to these commands should now consistently result in an error message in a reasonable amount of time previously this could hang forever.

3 Feb: --missing-code now works properly with --haps.

24 Jan: Fixed --extract/--exclude bug that could occur when another variant filter was applied earlier in the order of operations (e.g. --snps-only, --max-alleles, --extract-if-info). This bugfix has been backported to alpha 2.

21 Jan: "--extract range" and "--exclude range" no longer error out when their input files contain a chromosome code absent from the current dataset.

16 Jan: --pca allele/variant weight multithreading bugfix.

14 Jan: --make-king-table rel-check bugfix.

3 Jan 2020: Fixed --extract-if-info/--exclude-if-info numeric-argument bug introduced in late October.

30 Dec 2019 (alpha 3): This makes the following potentially compatibility-breaking changes:

    and --indep-pairwise require all variant IDs to be unique. For --write-snplist, this can be overridden by adding the 'allow-dups' modifier. require the REF/ALT mode to be explicitly declared. defaults to 'firth-fallback' mode for binary phenotypes. The old behavior can be requested with the 'no-firth' modifier.
  • --glm errors out, instead of just skipping the phenotype and printing a warning, when there's a linear dependency between the phenotype and the covariates. The old behavior can be requested with the 'skip' modifier. 's 'var-wts' subcommand has been replaced with 'allele-wts', which handles multiallelic variants properly. For datasets that contain only biallelic variants, the old output format can still be requested with 'biallelic-var-wts'.
  • PLINK 2 now errors out when you request an LD computation on a dataset with less than 50 founders. This can be overridden with --bad-ld. 's old NMISS_ALLELE_CT column (nonmissing allele count) has been renamed to ALLELE_CT, and the column set renamed accordingly, since in other contexts, 'nmiss' refers to the number of missing values, which is essentially the opposite. 's ID <1,2>columns have been renamed to IID<1,2>, for consistency with other PLINK 2 commands.

In addition, the GRM computation (along with "--pca approx" and "--score variance-standardize") now handles multiallelic variants properly, instead of just collapsing all minor alleles together --score allows each allele in a multiallelic variant to be assigned its own score and --glm handles categorical covariates in a manner that's less likely to cause VIF overflow.

The final alpha 2 build has been tagged in GitHub, and will remain downloadable from here for the next several months.

29 Dec: Fixed a bug which affected processing of some heterozygous-double-ALT multiallelic variants, and a bug that caused ALT2/ALT3/etc. allele frequencies to not be properly initialized in some circumstances.

13 Dec: Fixed bug introduced in 22 Nov build which caused some reported dosages/counts (such as --freq's OBS_CT column) to be doubled. --loop-cats bugfixes.

28 Nov: Fixed a VCF half-call handling bug introduced last month.

26 Nov: Fixed recent bug which caused a segfault when no-duplicate-allowed variant ID lookup was performed with more than 16 threads.

25 Nov: Fixed bug that caused --sort-vars to segfault when the number of contigs was a multiple of 16. --keep-fcol and --extract-fcol were judged to be poopy names, and have been renamed to --keep-col-match and --extract-col-cond respectively (the old names will still work in this build).
The online documentation is now almost complete. The sidebar search box works.

22 Nov: Firth regression speed improvement. "--freq counts" now exports dosages with enough precision for --read-freq to perfectly reconstruct the original allele frequencies from the .acount file, and --read-freq has been modified to do that.

15 Nov: Fixed "--glm cols=+err" bug that could cause garbage output when 'hide-covar' was not specified. --covar-number retired (previously it was being incorrectly converted to --covar-col-nums, which does not have the same semantics).

12 Nov: All-vs.-all --make-king[-table] runs now handle MAF < 1% variants much more efficiently. --no-input-missing-phenotype option added. --variant-score now supports binary output.

10 Nov: Fixed bug introduced in 29 Oct build that caused a segfault when a 'NA'/'nan' phenotype or covariate value was encountered.

9 Nov: --variant-score (transpose of --score) implemented.

4 Nov: Restored "--export vcf" invalid-allele-code warning.

31 Oct: --split-cat-pheno 'omit-most' modifier implemented it works better with --glm's built-in variance-inflation-factor check than 'omit-last', and --glm will switch to handling categorical covariates in this manner in alpha 3.

30 Oct: Fixed bug that caused --covar-col-nums and --covar iid-only to get mixed up. Stricter blank-line policy for most text input files: they're allowed at the end (since this happens every once in a while with manually edited files), but they're no longer allowed elsewhere. Removing the FILTER and/or INFO columns when generating a .pvar file (with e.g. 'pvar-cols=-info') now removes the corresponding header lines.

29 Oct: --q-score-range implemented. Strings which start with a number but contain nonnumeric content (e.g. "-123.4abc") now trigger an error when a floating-point number is expected the example string was previously just parsed as -123.4.

25 Oct: --make-king-table 'rel-check' modifier added this has the same effect as it did for PLINK 1.9 --genome. --pca 'var-wts' modifier deprecated: switch to 'biallelic-var-wts' when your data contains only biallelic variants and you want to continue generating only one weight per variant. (Alpha 3 will introduce an 'allele-wts' modifier which generates one weight per allele instead this is necessary to support multiallelic variants in an analytically sound manner.)

22 Oct: --recover-var-ids implemented. (This is designed to reverse --set-all-var-ids.)

20 Oct: --sample-counts implemented this provides the main (non-indel) sample counts reported by "bcftools stats"'s -s flag, and is >100x as fast for plink2-formatted large datasets. --extract-fcol extended to support substring matches.

15 Oct: Fixed bug in 12 Oct Linux builds that caused plink2 to hang on --extract/--exclude/--snps and similar variant ID filters. Implemented --extract-fcol, which filters variants based on a TSV column (this is an extension of PLINK 1.x --qual-scores).

12 Oct: "--hwe 0" no longer removes a small number of very-low-HWE-p-value variants.

9 Oct: --pheno/--covar 'iid-only' modifier added, supporting headerless files with a single ID column. Windows BGZF compression is now multithreaded. Improved read-error messages.

6 Oct: Windows --silent bugfix. Source code now supports dynamic linking with libzstd (though performance may suffer if you don't build the multithreaded version of that library).

4 Oct: --king-table-subset + --parallel bugfix. Automatic Zstd text-file decompression was broken for a few commands by the 28 Sep build that should work properly now.

3 Oct: Fixed BGZF decompression bugs in 28 Sep build. (This did not affect VCF &rarr .bed/.pgen conversion, though some rarer use cases were affected.) SID-loading bugfix.

28 Sep: Mixed-provisional-reference bugfixes. --ref-allele/--alt1-allele/--update-map/--update-name skip-count bugfix. --glm local-covar line-skipping bugfix. Automatic-rename when an input filename matches an output filename should work properly again instead of erroring out (though it should still be avoided).

10 Sep: --glm joint test p-value bug fix. (This bug only affected runs where --tests was invoked with 4 or more predictors.)

26 Aug: --read-freq now prints a warning, instead of segfaulting or entering an infinite loop, when all variants have already been filtered out.

21 Aug: Fixed --ref-from-fa/--ref-allele + VCF export interaction that caused spurious 'PR' INFO flags to be reported.

10 Aug: Open-fail and write-fail error messages now include a more detailed explanation of what went wrong. --bgen, --data, and --gen now have a 'ref-unknown' modifier for explicitly specifying that neither the first nor last allele is consistently REF.

31 Jul: --score prints an error message instead of segfaulting when an input-file line is truncated. Fixed rare --glm bug that could cause all results to be reported as 'NA' when exactly one covariate is defined. .log files print '--out' and '--d' properly again (this was broken by the 24 Jul build). --glm now has an optional output column ('err') which reports the reason for each 'NA' coefficient.

8 Jul: --rm-dup/--sample-diff/--ld multiallelic variant bugfix.

5 Jul: --read-freq moved before usual allele frequency/count computation in order of operations. Loaded allele frequencies are not recomputed any more.

28 Jun: --king-table-subset should work properly again.

26 Jun: Fixed --glm multiallelic-variant bug that could cause one allele to be reported twice and one covariate test to be unreported, when neither 'hide-covar' nor 'intercept' was specified. Fixed issue that could cause --glm genotypic/hethom to segfault with no covariates.

17 Jun: Fixed rare underflow in --glm p-value computation which could cause an assertion failure.

27 May: Unbroke --adjust-file. "--export ind-major-bed" performance improvement.

12 May: Fixed --glm linear regression phenotype-batch handling bug that could cause a crash (or, on .bed-formatted data, generate incorrect results) on batches of size > 240.

29 Apr: BGEN 1.2/1.3 phased-dosage import bugfixes. --make-pgen + --dosage-erase-threshold without --hard-call-threshold no longer crashes.

28 Apr: PLINK 2-specific extensions to --update-ids and --update-parents simplified. --id-delim/--sample-diff 'sid' modifier for specifying that single-delimiter sample IDs should be interpreted as IID-SID changed to --iid-sid flag.

27 Apr: --haps bugfix for sample counts congruent to 17..31 (mod 32). This only affected the last few samples of the file, but if you used --haps with an earlier build, we strongly recommend rerunning it. --glm logistic regression 'SE' column renamed to LOG(OR)_SE when reporting odds ratio, to make it more obvious that the reported standard error does not use odds ratio units. --update-parents implemented.

2 Apr: Fixed --hwe bug that could cause chrY and MT variants to be improperly filtered. --glm 'pheno-ids' now works for groups of quantitative phenotypes.

1 Apr: --glm without --adjust now detects groups of quantitative phenotypes with the same "missingness pattern", and processes them together (with a large speed increase but be careful re: disk space, you probably want to use the 'hide-covar' modifier, 'zs' and/or --pfilter might also be useful). --glm linear regression local-covar= bugfix.

26 Mar: Minimac3-r2 computation bugfix. --glm no longer generates .id files listing all samples used for each phenotype, unless the 'pheno-ids' modifier is added. --update-ids implemented.

23 Mar: Fixed multiallelic-variant writer bug that could affect files where the largest number of alleles is 6 or 18. --minimac3-r2-filter and --freq minimac3r2 column implemented.

18 Mar: --write-covar can now be used when no covariates are loaded, if at least one phenotype is loaded and phenotype output was requested.

9 Mar: plink2 --version and --help no longer return nonzero exit codes.
A draft PGEN specification is now available.

6 Mar: Fixed allele frequency computation bug that could cause a spurious "Malformed .pgen file" error when a variant filter was active.

5 Mar: Multithreaded --extract/--exclude.

4 Mar: --tests linear-regression output bugfix.

3 Mar: Fix --glm odds-ratio printing bug introduced on 1 Mar.

2 Mar: More help text cleanup (now including online documentation).

1 Mar: --recode-allele implemented (and renamed to --export-allele for consistency). VCF import now errors out when a space-containing INFO value is imported. Brackets in command-line help text are now used in a manner more similar to other tools.

21 Feb: --glm joint tests are now based on F-statistics, for better small-sample accuracy.

20 Feb: --import-dosage-certainty now always produces a missing call, instead of falling back on the VCF GT field, when dosage certainty is inadequate. --extract-intersect flag added.

19 Feb: --glm works properly again with no covariates (it was exiting with a spurious "out of memory" error). --import-dosage-certainty now has the expected effect on single-valued dosages, instead of just genotype-probability triplets.

18 Feb: Fixed a bug that could cause --missing to crash on dosage data.

14 Feb: Command-line integer parameters can now use scientific notation.

12 Feb: Phased-dosage import bugfix.

2 Feb: --tests + --parameters bugfix.

31 Jan: --pca approx now errors out instead of reporting inaccurate results when the number of variants is too small relative to the number of PCs. --pca approx eigenvalue bugfix.

30 Jan: --glm covariate-scale error is now propagated properly, instead of producing a mysterious out-of-memory error message.

22 Jan: --glm now errors out and recommends adding --covar-variance-standardize when covariates vary enough in scale for numeric instability to be a major concern.

2 Jan 2019: Phased-dosage import bugfix.

27 Dec 2018: --ref-allele/--alt1-allele skipchar was broken for the past few months it should work properly again. Fixed a bug which occurred when importing an all-noninteger-dosage variant.

28 Oct: --keep-fam/--remove-fam bugfix.

2 Oct: Fixed bug that could occur when loading very long text lines (e.g. VCF lines longer than 5 MB).

22 Sep: Fixed rare bug that could occur when processing variants out of order. --sample-diff command implemented.

12 Sep: --normalize 'list' modifier added.

11 Sep: --rm-dup 'list' modifier added, for listing all duplicated variant IDs. (This can be run as a standalone command.)

9 Sep: Fixed rare race condition in text decompressor that could cause input lines to be skipped. (We believe this was the cause of the VCF-import "File read failure" crashes reported over the last few months.)

8 Sep: Fixed VCF-export bug that could occur when extra ##contig header lines were present. --sort-vars bugfix. --normalize now detects when post-normalization variants are no longer in sorted order, and prints a warning in that case.

7 Sep: --ld bugfix for phased multiallelic variants. --rm-dup flag added (removes duplicate-ID variants, can check for genotype/INFO/etc. equality).

4 Sep: Fixed A1_CASE_FREQ and related columns in --glm output broken by recent multiallelic update. Cleaned up a few column names in --geno-counts and --hardy output.

31 Aug: Fixed --glm bug with handling constant and all-constant-but-1 covariates.

30 Aug: AVX2 and 32-bit --export bgen-1.2/1.3 bugfixes (mainly affects missing genotypes). "--export vcf-4.2" mode added for compatibility with programs (e.g. SNPTEST) which reject VCF 4.3 files. Exported VCFs should now have more appropriate ##contig headers when PAR1 and/or PAR2 are present in the input. Left-normalization (--normalize) flag added.

26 Aug: Last column of --pca .eigenvec header line is no longer omitted.

21 Aug: Fixed --mac/--max-mac 'nref' and 'alt1' mode bugs in yesterday's build.

20 Aug: Fixed "--vcf dosage=GP" bug introduced on 7 May if you used any build from the last three-and-a-half months to import VCF FORMAT/GP data, rerun with a newer build. "--vcf dosage=GP" now errors out with a suitable message when the file also contains a FORMAT/DS field, and a 'dosage=GP-force' option has been added to cover the rare cases where importing the GP field might still be worthwhile. --maf/--max-maf/--mac/--max-mac now let you filter on nonmajor (default), non-reference, alt1, or minor allele frequencies/counts you can use bcftools notation for this (e.g. "--min-af 0.01:minor"), but keep the different default in mind.

18 Aug: plink2-formatted 1000 Genomes phase 3 files, with phased haplotypes and annotations included, and a few corrections to the official pedigree (determined via KING-robust analysis), can now be downloaded from the Resources page. --king-cutoff can now handle sample ID files containing a header line.

16 Aug: --glm logistic regression now supports multiallelic variants. Fixed --glm linear-regression dosage handling bug in yesterday's build.

15 Aug: --glm linear regression now supports multiallelic variants. --ld bugfix. --parameters + "--glm interaction" now works properly when a covariate is only involved as part of an interaction.

9 Aug: --make-king[-table] singleton/monomorphic-variant optimization implemented.

7 Aug: GRM construction and --missing no longer break with multiallelic data.

6 Aug: VCF multiallelic(-phased) import and export implemented. --hwe now tests each allele separately for multiallelic variants. --min-alleles/--max-alleles filtering flags added.
(--glm doesn't support multiallelic variants yet that update is planned for next week.)

30 Jul: --vcf-max-dp flag added.

26 Jul: --vcf-half-call should now work properly on unphased data.

25 Jul: Fixed --sort-vars/low-memory-make-pgen dosage-handling bug that could trigger unwanted hardcall thresholding. If you used a build from 14 Apr - 19 Jul 2018 to work with dosage data, the hardcalls may not have been thresholded correctly. Unfiltered dosage datasets imported by an affected build can be corrected by running --make-pgen + explicit --hard-call-threshold. Hardcall-based filters such as --geno/--mind should be rerun (after the hardcalls have been corrected).

19 Jul: --update-alleles implemented.

16 Jul: Added more multithreaded-VCF-parse debug logging code.

13 Jul: Fixed chrX/Y/MT autoremoval bug in --make-king/--make-grm/--pca.

12 Jul: Unbroke --mach-r2-filter.

3 Jul: .fam/.psam files now load properly when only the IID column is requested or present.

29 Jun: .bim/.pvar files with more than

134 million variants load properly again (given sufficient memory).

25 Jun: Fixed a few odd-sample-count export cases which were broken around 30 May.

22 Jun: Fixed a few log messages which were broken in the 19-20 Jun builds. Added debug-print code to support an ongoing multithread-VCF-dosage-import bug investigation (if you are encountering mysterious "File read failure" errors during VCF import or "Malformed .pgen" errors when reading the result, adding "--threads 1" to your VCF-import command will probably solve your immediate problem, but if you can also send me a .log file from the failing multithreaded run (or even better, test data) that would be very helpful).

20 Jun: Fix GRM/PCA/score-computation bug introduced on 30 May. If you used the 30 May or an early June build for GRM/--pca/--score, you should repeat the operation(s) with this build apologies for the error.

19 Jun: Fixed rare --ref-allele/--alt1-allele corner case which could occur when a missing allele was replaced with a very long allele.

5 Jun: VCF import uninitialized-variable bugfix. --score 'ignore-dup-ids' modifier added.

30 May: "--export haps[legend]" bugfixes and bgzip support. "--export vcf vcf-dosage=DS" no longer exports undeclared HDS values when phase information is present. Unbreak --import-dosage + --map, for real this time.

21 May: --pgen-info command added (displays basic information about a .pgen file, such as whether it has any phase or dosage data).

17 May: --import-dosage and .gen import were broken for the last several weeks this should be fixed now. A1 column added to --adjust output in preparation for multiallelic variants. --glm 'a0-ref' modifier renamed to 'omit-ref'.

15 May: Fixed chrX allele frequency computation bug when dosages are present. --ld modified to be based on major instead of reference alleles, to play better with multiallelic variants. --hardy header line and allele columns changed in preparation for multiallelic variant support.

8 May: --vcf dosage=HDS should now handle files with no DS field properly.

7 May: Fixed rare I/O deadlock. Improved VCF-import parallelism.

4 May: Fixed --bgen import/export when dosage precision bits isn't a multiple of 8 (previously misinterpreted the spec for those cases, sorry about that).

3 May: --bgen can now import variant records with up to 28 bits of dosage precision (though only

15 bits will survive). "--export vcf-dosage=HDS-force" bugfix.

2 May: --vcf dosage= import no longer requires GT field to be present. Fixed potential --vcf dosage=HDS buffer overflow.

28 Apr: Fixed a --glm bug which occurred when autosomes and sex chromosome(s) were both present, or both chrX and chrY were present. If you performed a whole-genome --glm run with the 9 Feb 2018 build or later, you should rerun with the latest build. However, single-chromosome and autosome-only --glm runs were unaffected by the bug.

24 Apr: VCF phased-dosage import ("--vcf dosage=HDS") and export ("--export vcf vcf-dosage=HDS"). --pca and GRM computation now use correct variance for all-haploid genomes.

22 Apr: --export bgen-1.2/bgen-1.3 should now work for chrX/chrY/chrM also fixed import bugs for those chromosomes.

16 Apr: --ref-from-fa contig line parsing bugfix.

14 Apr: --export bgen-1.2/bgen-1.3 implemented for autosomal diploid data. Operations like --pca which require decent allele frequencies now error out when frequencies are being estimated from less than 50 samples, unless you add the --bad-freqs flag. Phased dosage support implemented. Sample missingness rate in exported .sample files is now based on dosages rather than hardcalls. Non-AVX2 phase subsetting bugfix. --vcf + --psam bugfix. --vcf dosage= now ignores the hardcall when a dosage is present instead, it's regenerated under --hard-call-threshold 0.1 (unless you specified a different threshold). --bgen 'ref-second' modifier renamed to 'ref-last', to generalize properly to multiallelic variants.

31 Mar: --export haps[legend] should now work properly when --ref-allele/--ref-from-fa/etc. flips some alleles in the same run.

29 Mar: --set--var-ids non-AVX2 bugfix. --pheno/--covar autonaming bugfix.

28 Mar: --bgen 1-bit phased haplotype import implemented.

26 Mar: --make-bed + --indiv-sort bugfix.

23 Mar: Windows builds should work properly again (the 20-21 Mar Windows builds were badly broken). --glm now supports log-pvalue output (add the 'log10' modifier), and these remain accurate below the double-precision floating point limit of p=5e-324.

21 Mar: 3-column .sample file loading works properly again. Fixed a file-reading race condition.

20 Mar: Fix possible deadlock in recent builds when loading very long lines.

19 Mar: Fix --sample segfault in recent builds. .bgen import/export speed improvement. --oxford-single-chr wasn't extended correctly in the 4 Mar build this should be fixed now.

11 Mar: Fix --pheno segfault in last week's builds that could occur when the file didn't have a header line.

9 Mar: Fix "File write failure" bug that occurred when a single write operation was larger than 2 GB (this could occur when running --make-bed with more than 128k samples). Reduced --make-bed memory requirement.

7 Mar: Fixed potential file-reading deadlock in recent builds (23 Feb or later).

5 Mar: --glm local-covar= should work properly again.

4 Mar: --oxford-single-chr can now be used on .bgen files. --make-pgen partially-phased data handling bugfix.

26 Feb: --keep/--remove/etc. should work properly now on IID-only files with no header line.

23 Feb: Fixed alpha 2 --vcf + --id-delim bug. Improved parsing speed for compressed VCF and .pvar files.

20 Feb: "--xchr-model 1" should work properly now.

16 Feb 2018 (alpha 2): This makes the following potentially compatibility-breaking changes:

  • FID is now an optional field: if it isn't in the input .psam file, it's omitted from several output files by default (these now have 'maybefid' and 'fid' column sets, where the default set includes 'maybefid'), and treated as always-'0' by any operation which requires FID values (such as --make-bed). When exporting genomic data files, 'maybefid' also treats the column as missing if all remaining values are '0'.
  • Relatedly, when importing sample IDs from a VCF or .bgen file, the default mode is now "--const-fid 0", and no FID column will be written to disk at all. --keep, --remove, and similar commands also now have "--const-fid 0" semantics when an input line contains only one token. You can now act as if IID is the only sample ID component, if that's what makes the most sense for your workflow. Conversely, it is now necessary to explicitly use --id-delim when you want to split the VCF/.bgen sample IDs into multiple components.
  • MT is treated as a haploid chromosome again. In PLINK 1.9 and earlier plink2 builds, MT was treated as diploid-ish to avoid throwing away information about heteroplasmic mutations as a consequence, the --glm(/--linear/--logistic) genotype column and commands like "--freq counts" used a 0..2 scale. Now that plink2 has proper support for dosages, this kludge is no longer necessary.
  • --glm's 't' column set has been renamed to 'tz', to reflect it being a T-statistic for linear regression but a Wald Z-score for logistic/Firth. The corresponding column in .glm.logistic[.hybrid] and .glm.firth files now has 'Z_STAT' in the header line.

Also, --glm now defaults to regressing on minor instead of ALT allele dosages (this can be overridden with 'a0-ref').

The final alpha 1 build has been tagged in GitHub, and will remain downloadable from here for the next few months.

11 Feb: files now end in .id, for consistency with other output files with sample IDs and no other information. Similarly, --mind's output file now has the extension and defaults to having a header line. You can now use --no-id-header to suppress the header line (and force the columns to be FID/IID) in all .id output files.

10 Feb: --update-sex 'male0' option added, and custom column selection interface changed (now 'col-num='). --glm 'gcountcc' column names updated (now 'CASE_NON_A1_CT', 'CASE_HET_A1_CT', etc.) in preparation for switch to A1=major allele. --make-just-pvar + --ref-allele/--ref-from-fa no longer treats all initial reference alleles as provisional when the input .pvar has a header line.

9 Feb: Forcing .pvar QUAL/FILTER output when no such values are loaded no longer causes a segfault.

5 Feb: AVX2 phase-subsetting bugfix.

3 Feb: --score 'dominant' and 'recessive' modifiers added.

30 Jan: Fix .pgen writing bug which occurred when the number of variants was a multiple of 64 and the number of samples was large.

24 Jan: "--export oxford" now supports bgzipped output.

21 Jan: --glm now always reports an additional 'A1' column, indicating which allele(s) correspond to positive genotype column values. --glm column sets have been changed to revolve around A1 instead of ALT, so minor script modifications may be necessary when switching to this build.
In this build, A1 and ALT are still synonymous. This will change in alpha 2: A1 will default to the minor allele(s) to reduce multicollinearity (imitating PLINK 1.x's behavior in the absence of --keep-allele-order), though you will still have the option of forcing A1=ALT.

12 Jan: Fixed "--glm interaction" bug that occurred when multiple consecutive variants had no missing calls. We recommend redoing all --glm runs with the 'interaction' modifier which were performed with a build produced between 27 Nov 2017 and 10 Jan 2018 inclusive.

10 Jan: --adjust-file implemented (performs --adjust's multiple-testing correction on any association analysis file).

9 Jan: Added 'no-idheader' modifiers to a few commands, and made that the default for --make-grm-bin/--make-grm-list to avoid breaking interoperability.

7 Jan: --vcf can now be given a sites-only VCF when the run doesn't require genotype data. Sample ID files, such as those produced by --write-samples, now include a header line by default this will be necessary to distinguish between FID-IID and IID-SID output in the future. (With --write-samples, you can suppress the header line by adding the 'noheader' modifier.)

5 Jan: --pheno-col-nums/--covar-col-nums implemented.

2 Jan 2018: --keep-fcol (equivalent to PLINK 1.x --filter) implemented.

19 Dec 2017: --adjust implemented. --zst-level implemented (lets you control Zstd compression level). Un-broke --rerun.

18 Dec: --extract/--exclude can now be used directly on UCSC interval-BED files (ok for coordinates to be 0-based or for no 4th column to be present). "--output-chr 26" now causes PAR1/PAR2 to be rendered as '25' (for humans), to restore interoperability with programs like ADMIXTURE which can't handle alphabetic chromosome codes. --merge-x implemented (usually needs to be combined with --sort-vars now). --pvar can usually handle 'sites-only' VCF files (e.g. those released by the gnomAD project) now. --thin, --thin-count, --thin-indiv, and --thin-indiv-count implemented.

16 Dec: Multithreaded zstd compression implemented (on Linux and macOS). --make-grm-gz renamed to --make-grm-list, and gzip mode removed.

15 Dec: Fixed --extract-if-info and --exclude-if-info's behavior for non-numeric values which start with a number. Existence-checking flags renamed to --require-info and --require-no-info for naming consistency.

13 Dec: --extract-if-info and --exclude-if-info flags added, for simple filtering on INFO key/value pairs or key existence.

11 Dec: --king-table-subset flag added. This makes it straightforward to perform two-stage relationship/duplicate detection: start with --make-king-table on a small number of higher-MAF variants scattered across the genome, and then rerun it with --king-table-subset on an appropriate subset of candidate sample pairs from the first stage. --bp-space implemented (useful for the first stage above).
The two-stage workflow was first implemented by Wei-Min Chen in a recent version of KING contact him for citation information.

7 Dec: Fixed bug which could occur when filtering samples from a phased dataset. Windows AVX2 build now available.

28 Nov: --import-dosage 'format=infer' (this is now the default) and 'id-delim=' (needed for reimport of "--export A-transpose" data) options added. Fixed --import-dosage bug that caused it to error out on missing genotypes under format=1. --no-psam-pheno (or --no-pheno/--no-fam-pheno) can now be used to ignore all phenotypes in the sample file, while keeping the phenotype(s) in the --pheno file if one was specified.

27 Nov: Implemented fast path for --glm no-missing-genotype case (mainly affects linear regression). --make-king[-table] can now automatically handle matrices too large to fit in memory without explicit use of --parallel. AVX2 sample filtering performance improvement. --validate bugfix.

19 Nov: Fix VCF FORMAT/GT header line parsing bug introduced in 14 Nov build.

18 Nov: --make-king[-table] performance improvements.

16 Nov: Fixed bug in 14 Nov build that broke ##chrSet header line parsing.

14 Nov: Fixed bug that caused --export to hang when the number of variants was between 65 and about a thousand.

4 Nov: Linux and macOS prebuilt AVX2 binaries now available these should work well on most machines built within the last 4 years. Fixed another Firth regression spurious NA bug. Fixed --score bug that occurred when sample filter(s) were applied simultaneously. Fixed a --ld phased-hardcall handling bug. Array-popcount upgrade in progress (thanks to recent work by Wojciech Muła, Nathan Kurz, Daniel Lemire, and Kim Walisch).

3 Nov: Fixed multipass --export bug. --dummy dosage-freq= now fills in hardcalls with the default --hard-call-threshold cutoff of 0.1 when --hard-call-threshold is not explicitly specified.

2 Nov: --export implemented (with dosage support). --dummy dosage-freq= modifier now works properly for dosage frequencies above 0.75.

16 Oct: --ref-from-fa flag implemented, to set reference alleles from a FASTA file. (Note that this may be unable to determine which allele is reference when length changes are involved, but it should always work for SNPs and multi-nucleotide polymorphisms.) --update-name implemented. Fixed column-set parsing bug in 13 Oct build.

13 Oct: Fixed --glm logistic/Firth regression bug which could produce spurious NA results.

9 Oct: Fixed --ld's handling of some dosage and haploid cases. Fixed bug which could cause --make-pgen to discard phase/dosage information when extracting a small variant subset. --geno-counts no longer double-reports chrY counts.

8 Oct: --ld implemented, with supported for phased genotypes and dosages (try "--ld <var1> <var2> dosage"). Fixed tiny bgen-1.1 import bug that triggered when the number of threads exceeded the number of variants. Allele frequency computation no longer crashes on chrX when dosages are present but only hardcalls are needed.

1 Oct: Fixed GRM computation bug which sometimes caused segfaults when both dosages and missing values were present. --glm is now a bit faster when many covariates are present.

20 Sep: Firth regression Hessian matrix inversion step raised to double-precision, after last week's builds revealed that single-precision inversion could be unreliable.

15 Sep: --vif/--max-corr per-variant checks are now working. These are no longer skipped during logistic regression.

8 Sep: Alternative VCF INFO/PR fields are now tolerated. Removed debug code that slowed down yesterday's --make-pgen.

7 Sep: --score uninitialized memory bugfix. Partially-phased data handling bugfix.

6 Sep: Fix macOS stack size issue (could cause --pca and some other commands to crash in recent builds 1 Sep build had an incomplete workaround).

4 Sep: --[covar-]variance-standardize missing value handling bugfix. --ref-allele/--alt1-allele implemented (--a2-allele and --a1-allele are treated as aliases).

1 Sep: ---quantile-normalize missing-phenotype handling bugfix.

29 Aug: --glm 'gcountcc' column set option added (reports genotype hardcall counts, stratified by case/control status). --write-samples command added (analogous to --write-snplist).

2 Aug: --sort-vars implemented.

25 Jul: --loop-cats now works properly with genotype-based variant filters.

24 Jul: Fixed "--pca approx" allele frequency handling bug introduced in 4 Jun build we recommend redoing any "--pca approx" runs performed with an affected build . (Regular --pca was not affected.) --loop-cats implemented (similar to PLINK 1.x --loop-assoc, except it's not restricted to association tests). VCF export now supports 'vcf-dosage=DS-force' mode. --dummy multithread + dosage bugfix.

17 Jul: BGEN v1.2/1.3 importer memory allocation bugfix. Size of failed allocation is now logged on most out-of-memory errors.

2 Jul: Improved multithreading in BGEN v1.2/1.3 importer. Python writer can now be called with multiple variants at a time.

25 Jun: Basic BGEN v1.2/1.3 import (unphased biallelic dosages suffices for main UK Biobank data release). --warning-errcode flag added (causes an error code to be returned to the OS on exit when at least one warning is printed).

20 Jun: --condition-list + variant filter bugfix.

5 Jun: --make-pgen memory requirement greatly reduced. End time now printed to console in most situations.

4 Jun: --hwe no longer causes a segfault when chrX is present and no gender information is available. Fixed --dummy bug.

29 May: --import-dosage format=1 bugfix.

26 May: --glm 'standard-beta' modifier replaced with --variance-standardize flag. --quantile-normalize function added. Fixed a missing-sex allele counting bug.

25 May: --hardy/--hwe works properly again when chrX is present but not at the beginning of the dataset.

22 May: Fixed major dosage data + sample-filter bug we recommend rerunning any operations involving both dosage data and sample filtering performed with earlier plink2 builds . --score 'list-variants' modifier added.

19 May: Fixed a bug with allele frequency computation on dosage data when sample filter(s) are applied.

18 May: Many categorical phenotype-handling flags (--within, --keep-cats, --split-cat-pheno, . ) implemented. Basic phenotype-based filtering implemented (e.g. "--remove-if PHENO1 '>' 2.5" note that unnamed phenotypes are assigned the names 'PHENO1', 'PHENO2', etc., and that the '<' and '>' characters must be quoted in most shells). --write-covar implemented. --mach-r2-filter implemented, and raw MaCH r 2 values can be dumped with "--freq cols=+machr2".

11 May: --condition[-list] + --covar bugfix.

8 May: Fix quantitative phenotype/covariate loading bug introduced in 6 May build.

7 May: --import-dosage implemented.

6 May: Fixed bug which caused '0' to be treated as control instead of missing for binary phenotypes. Minor change to --glm's column headers, in preparation for multiallelic data.

2 May: --score bugfix. --maj-ref bugfix. --vcf-min-dp and "--export A-transpose" implemented.

1 May: VCF dosage import/export, --vcf-min-gq, and --read-freq implemented. --score can now work with standard errors. --autosome[-par] now works properly. SNPHWE2 and SNPHWEX functions relicensed as GPL-2+, to enable inclusion in the HardyWeinberg R package.

20 April: .sample export bugfix (didn't work if file was over 256 KB and no phenotypes were present). --dummy implemented (can now generate dosages).

19 April: --hardy/--hwe chrX bugfix (thanks to Jan Graffelman for catching the problem and validating the fix). --new-id-max-allele-len now has three modes ('error', 'missing', and 'truncate'), and the default mode is now 'error' (i.e. --set-missing-var-ids and --set-all-var-ids now error out when an allele code longer than 23 characters is encountered, instead of silently truncating). --score implemented, and extended to support variance-normalization and multiple score columns (these two features provide a simple way to project new samples onto previously computed principal components).

11 April: --pca var-wts bugfix, and --pca eigenvalue ordering bugfix. --glm linear regression and --condition[-list] support added. --geno/--mind/--missing/--genotyping-rate can now refer to missing dosages instead of just missing hardcalls (note that, when importing dosage data, dosages in (0.1, 0.9) and (1.1, 1.9) are saved but there usually won't be associated hardcalls).


These filtering expressions are accepted by most of the commands.

Valid expressions may contain:

numerical constants, string constants, file names (this is currently supported only to filter by the ID column)

". The expressions are case sensitive unless "/i" is added.

logical operators. See also the examples below and the filtering tutorial about the distinction between "&&" vs "&" and "||" vs "|".

INFO tags, FORMAT tags, column names

starting with 1.11, the FILTER column can be queried as follows:

1 (or 0) to test the presence (or absence) of a flag

missing genotypes can be matched regardless of phase and ploidy (".|.", "./.", ".", "0|.") using these expressions

missing genotypes can be matched including the phase and ploidy (".|.", "./.", ".") using these expressions

sample genotype: reference (haploid or diploid), alternate (hom or het, haploid or diploid), missing genotype, homozygous, heterozygous, haploid, ref-ref hom, alt-alt hom, ref-alt het, alt-alt het, haploid ref, haploid alt (case-insensitive)

TYPE for variant type in REF,ALT columns (indel,snp,mnp,ref,bnd,other,overlap). Use the regex operator "

" to require at least one allele of the given type or the equal sign " literallayout">TYPE="snp" TYPE

array subscripts (0-based), "*" for any element, "-" to indicate a range. Note that for querying FORMAT vectors, the colon ":" can be used to select a sample and an element of the vector, as shown in the examples below

with many samples it can be more practical to provide a file with sample names, one sample name per line

function on FORMAT tags (over samples) and INFO tags (over vector fields): maximum minimum arithmetic mean (AVG is synonymous with MEAN) median standard deviation from mean sum string length absolute value number of elements:

Note that functions above evaluate to a single value across all samples and are intended to select sites, not samples, even when applied on FORMAT tags. However, when prefixed with SMPL_ (or "s" for brevity, e.g. SMPL_MAX or sMAX), they will evaluate to a vector of per-sample values when applied on FORMAT tags:

two-tailed binomial test. Note that for N=0 the test evaluates to a missing value and when FORMAT/GT is used to determine the vector indices, it evaluates to 1 for homozygous genotypes.

variables calculated on the fly if not present: number of alternate alleles number of samples count of alternate alleles minor allele count (similar to AC but is always smaller than 0.5) frequency of alternate alleles (AF=AC/AN) frequency of minor alleles (MAF=MAC/AN) number of alleles in called genotypes number of samples with missing genotype fraction of samples with missing genotype indel length (deletions negative, insertions positive)

the number (N_PASS) or fraction (F_PASS) of samples which pass the expression

custom perl filtering. Note that this command is not compiled in by default, see the section Optional Compilation with Perl in the INSTALL file for help and misc/ for a working example. The demo defined the perl subroutine "severity" which can be invoked from the command line as follows:

Comma in strings is interpreted as a separator and when multiple values are compared, the OR logic is used. Consequently, the following two expressions are equivalent but not the third:

When querying multiple values, all elements are tested and the OR logic is used on the result. For example, when querying "TAG=1,2,3,4", it will be evaluated as follows:

Shell expansion:

Note that expressions must often be quoted because some characters have special meaning in the shell. An example of expression enclosed in single quotes which cause that the whole expression is passed to the program as intended:

Please refer to the documentation of your shell for details.

Code Multiple Records After You Collect the Data in a Batch

The NIOSH Industry and Occupation Computerized Coding System (NIOCCS) is a free, web-based software application that translates industry and occupation text to standardized industry and occupation codes. NIOCCS codes large batches of industry and occupation data you have already collected.

1.Go to the NIOCCS page. If you have a large number of records to code, you’ll need to register for a Secure Access Management Service (SAMS) account. To register, send your first and last name and email address to [email protected] If you have only a few records to code, you can use the single record coder, which does not require account.

Provide the industry and occupation text you need coded. If you have only a few records, you can enter these without uploading a file. If you have a lot of records, it’s fastest to upload the information in a file format. Files uploaded into NIOCCS must be in a standard .txt file format delimited by a Tab or Pipe character (|) and must contain at least:

Each record submitted must have a value in the ID field and must have at least one value in either the Industry Title or Occupation Title – an example of what the file could look like is shown here:

2.Code using NIOCCS. NIOCCS automatically codes all of the records entered, though it important that you enter good information. If there are problems with job descriptions you enter, like misspellings or incomplete descriptions, your output won’t be as good.

3. Download your results. Once the coding process is complete, you can download your coded output file, which includes the original uploaded data or input data fields, plus standardized Census, NAICS, and SOC industry and occupation codes.



Generates a random company name, comprised of a lorem ipsum word and an appropriate suffix, like Dolor Inc., or Convallis Limited.

This Data Type generates a random SIRET/SIREN French business identification number.



More info:

Generates a personal number, used in some countries for social security insurance. At the present time only swedish ones are supported. The personal numbers are generated according to the format you specify:



Generates organisation numbers, used in some countries for registration of companies, associations etc. At the present time only Swedish ones are supported. The organisation numbers are generated according to the format you specify:



Generates random Canadian provinces, states, territories or counties, based on the options you select. The Full Name and Abbreviation sub-options determine whether the output will contain the full string (e.g. "British Columbia") or its abbreviation (e.g. "BC"). For UK counties, the abbreviation is the standard 3-character Chapman code.

This data type generates a random latitude and/or longitude. If both are selected, it displays both separated by a comma.

This data type generates random, valid credit card numbers according to the format you specify. It is currently capable of generating numbers for the following brands: Mastercard, Visa, Visa Electron, American Express, Discover, American Diner's, Carte Blanche, Diner's Club International, , JCB, Maestro, Solo, Switch, Laser.

Generates a random credit card PIN number from 1111 to 9999.

Generates a random credit card CVV number from 111 to 999.

This option generates a fixed number of random words, pulled from the standard lorem ipsum latin text.

This option generates a random number of words - the total number within the range that you specify (inclusive). As with the Fixed number option, the words are pulled the standard lorem ipsum latin text.

This Data Type lets you generate random alpha-numeric strings. The following table contains the character legend for this field. Any other characters you enter into this field will appear unescaped.

Generates a Boolean value in the format you need. You can specify multiple formats by separating them with the pipe (|) character. The following strings will be converted to their Boolean equivalent:

  • Yes or No
  • False or True
  • 0 or 1
  • Y or N
  • F or T
  • false or true

true and false values are special. Depending on the export type, these may be output without double quotes.

Generates a column that contains a unique number on each row, incrementing by whatever value you enter. This option can be helpful for inserting the data into a database field with an auto-increment primary key.

The optional placeholder string lets you embed the generated increment value within a string, via the placeholder. For example:

This randomly generates a number between the values you specify. Both fields allow you to enter negative numbers.

This data type generates random currency values, in whatever format and range you want. The example dropdown contains several options so you can get a sense of how it works, but here's what each of the options means.


Range - From

Range - To

Currency Symbol


This data type lets you generate a column of data that has repeating values from row to row. Here's a couple of examples to give you an idea of how this works.

  • If you'd like to provide the value "1" for every row, you can enter "1" in the Value(s) field and any value (>0) in the Loop Count field.
  • If you'd like to have 100 rows of the string "Male" followed by 100 rows of the string "Female" and repeat, you can enter "100" in the Loop Count field and "Male|Female" in the Value(s) field.
  • If you'd like 5 rows of 1 through 10, enter "5" for the Loop Count field, and "1|2|3|4|5|6|7|8|9|10" in the Value(s) field.

Try tinkering around with it. You'll get the idea.

The Composite data type lets you combine the data from any other row or rows, and manipulate it, change it, combine the information and more. The content should be entered in the Smarty templating language.

To output the value from any row, just use the placeholders , , etc. You cannot refer to the current row - that would either melt the server and/or make the universe implode.

  • Display a value from row 6:
  • Assuming row 1 and row 2 contain random numbers, the following are examples of some simple math:
    • - subtraction
    • - multiplication
    • <$ROW2/$ROW1> - division

    Please see the Smarty website for more information on the syntax.

    This data type lets you generate tree-like data in which every row is a child of another row - except the very first row, which is the trunk of the tree. This data type must be used in conjunction with the Auto-Increment data type: that ensures that every row has a unique numeric value, which this data type uses to reference the parent rows.

    The options let you specify which of your form fields is the appropriate auto-increment field and the maximum number of children a node may have.

    Enter a list of items, separated by a pipe | character. Then select whether you want Exactly X number of items, or At most X items from the list. Multiple items are returned in a comma-delimited list in the results. If you want your data set to include empty values, just add one or more pipe characters at the end - the more pipes you enter, the greater the probability of an empty value being generated.

    The Computed Data Type gives you access to the metadata about fields in the row to let you generate whatever output you want based on that information. If you just need to access the generated string value from another field (i.e. what you see in the output), see the Composite Data Type. This field type gives you much more access to each field.

    , etc. contain everything available about that particular row. The content changes based on the row's Data Type and what has been generated, but high-level it contains the following properties:

    • - whatever options were entered in the interface/API call for the row
    • - any additional metadata returned for the Data Type
    • - the actual generated random content for this field (always in a "display" property) plus any other information about the generated content
    • - a handy JSON-serialization of everything in the row, so you can see what's available. Just run it through a JSON formatter.
    • - will output the gender ("male", "female" or "unknown") of the generated content of a Names Data Type field (be sure to replace "1" with the right row number!). If you used FemaleName as the placeholder string this variable will return "female" every time. If you entered "Name", the value returned will depend on the generated string. If you entered a placeholder string with multiple formats, it will return "unknown" if it contained both genders, or no genders (e.g. a surname without a first name).

    De-nied. In order to share this Data Set with other people, you need to save it first.

    I understand that to share this Data Set, I need to make it public.

    Email the user their login information

    Are you sure you want to delete this user account?

    First Name
    Last Name

    You have bundling/minification enabled. If you click the Reset Plugins button you will need to run grunt to recreate the bundles. For more information read this documentation page. If you have any problems, you may want to turn off bundling.


    Ever needed custom formatted sample / test data, like, bad? Well, that's the idea of this script. It's a free, open source tool written in JavaScript, PHP and MySQL that lets you quickly generate large volumes of custom data in a variety of formats for use in testing software, populating databases, and. so on and so forth.

    This site offers an online demo where you're welcome to tinker around to get a sense of what the script does, what features it offers and how it works. Then, once you've whet your appetite, there's a free, fully functional, GNU-licensed version available for download. Alternatively, if you want to avoid the hassle of setting it up on your own server, you can donate $20 or more to get an account on this site, letting you generate up to 5,000 records at a time (instead of the maximum 100), and let you save your data sets. Click on the Donate tab for more information.

    Extend it

    The out-the-box script contains the sort of functionality you generally need. But nothing's ever complete - maybe you need to generate random esoteric math equations, pull random tweets or display random images from Flickr with the word "Red-backed vole" in the title. Who knows. Everyone's use-case is different.

    With this in mind, the new version of the script (3.0.0+) was designed to be fully extensible: developers can write their own Data Types to generate new types of random data, and even customize the Export Types - i.e. the format in which the data is output. For people interested in generating more accurate localized geographical data, they can add new Country plugins that supply region names (states, provinces, territories etc), city names and postal/zip code formats for their country of choice. For more information on all this, visit the Developer Documentation.


    Click the button below to download the latest version of the script from github. For more information see the User Documentation.

    Project News

    User Accounts

    This section lets you create any number of users accounts to allow people access to the script. Only you are able to create or delete accounts.

    No user accounts added yet.

    Donate now!

    If this has helped you in your work, a donation is always appreciated! If a general sense of do-goodery isn't enough to persuade you to donate, here are a few more material incentives:

    • Supporting the project leads to great new features! Honest!
    • Donating $20 or more will get you a user account on this website. With a user account you can:
      • Generate up to 10,000 rows at a time instead of the maximum 100.
      • Save your form configurations so you don't have to re-create your data sets every time you return to the site.

      Every $20 you donate adds a year to your account. You may return at a later date to add more time to your account - it will be added to the end of your current time. Just be sure to donate with the same email address. If you have any trouble donating or with your user account, just drop me a line.

      After donating, you will be emailed with details about how to finish setting up your account (check your spam folder!). If you have any problems, please contact me.

      Watch the video: Επαγγέλματα με Μέλλον: Τεχνικός Ηλεκτρονικών Υπολογιστικών Συστημάτων και Δικτύων (May 2022).