Quick Start¶
We first download the example.bgen
, haplotypes.bgen
, and complex.bgen
files:
>>> from bgen_reader import example_filepath
>>>
>>> filepath = {}
>>> filepath["example.bgen"] = example_filepath("example.bgen")
>>> filepath["haplotypes.bgen"] = example_filepath("haplotypes.bgen")
>>> filepath["complex.bgen"] = example_filepath("complex.bgen")
The above-mentioned files are read by this software in the same way but they present
different levels of genotype complexity.
The variant loci of the genotype stored in example.bgen
have the same ploidy equal
to two and are all unphased.
The haplotypes.bgen
file stores a phased, diploid genotype.
Both genotypes stored in example.bgen
and haplotypes.bgen
have the same number
of alleles for each loci.
The complex.bgen
file, on the other hand, have different ploidy levels and number of
alleles across different variant loci.
Unphased genotype¶
With unphased genotype we don’t have the information about which chromosome holds a
given allele.
Let’s read the example.bgen
file and print out some information.
>>> from bgen_reader import read_bgen
>>>
>>> bgen = read_bgen(filepath["example.bgen"], verbose=False)
>>>
>>> # Variants metadata.
>>> print(bgen["variants"].head())
id rsid chrom pos nalleles allele_ids vaddr
0 SNPID_2 RSID_2 01 2000 2 A,G 6069
1 SNPID_3 RSID_3 01 3000 2 A,G 9635
2 SNPID_4 RSID_4 01 4000 2 A,G 12956
3 SNPID_5 RSID_5 01 5000 2 A,G 16034
4 SNPID_6 RSID_6 01 6000 2 A,G 19377
>>> # Samples read from the bgen file.
>>> print(bgen["samples"].head())
0 sample_001
1 sample_002
2 sample_003
3 sample_004
4 sample_005
Name: id, dtype: object
>>> # There are 199 variants in total.
>>> print(len(bgen["genotype"]))
199
>>> # This library avoid as much as possible accessing the bgen file for performance
>>> # and memory reasons. The `compute` function actually tells the library to
>>> # access the file to retrieve some data.
>>> geno = bgen["genotype"][0].compute()
>>> print(geno.keys())
dict_keys(['probs', 'phased', 'ploidy', 'missing'])
>>> # Let's have a look at the probabilities regarding the first variant.
>>> print(geno["probs"])
[[ nan nan nan]
[0.02780236 0.00863674 0.9635609 ]
[0.01736504 0.04968414 0.93295083]
...
[0.01419069 0.02810669 0.95770262]
[0.91949463 0.05206298 0.02844239]
[0.00244141 0.98410029 0.0134583 ]]
>>> # The above matrix is of size samples-by-(combination-of-alleles).
>>> print(geno["probs"].shape)
(500, 3)
The columns dimension of the probability matrix of a given variant depend on the number of alleles, the ploidy, and whether the locus is phased or unphased. Please, refer to bgen specification⧉ for a detailed description.
Phased genotype¶
>>> bgen = read_bgen(filepath["haplotypes.bgen"], verbose=False)
>>>
>>> print(bgen["variants"].head(4))
id rsid chrom pos nalleles allele_ids vaddr
0 SNP1 RS1 1 1 2 A,G 102
1 SNP2 RS2 1 2 2 A,G 159
2 SNP3 RS3 1 3 2 A,G 216
3 SNP4 RS4 1 4 2 A,G 273
>>> print(bgen["samples"].head())
0 sample_0
1 sample_1
2 sample_2
3 sample_3
Name: id, dtype: object
>>> # Print the estimated probabilities for the first variant
>>> # and second individual.
>>> geno = bgen["genotype"][0].compute()
>>> print(geno["probs"][1])
[0. 1. 1. 0.]
>>> # Is it a phased one?
>>> print(geno["phased"])
1
>>> # How many haplotypes for each sample?
>>> print(geno["ploidy"])
[2 2 2 2]
>>> # And how many alleles?
>>> variant = bgen["variants"].compute()
>>> print(variant.loc[0, "nalleles"])
2
>>> # Therefore, the first haplotype has probability 100%
>>> # of having the allele
>>> alleles = variant.loc[0, "allele_ids"].split(",")
>>> print(alleles[1])
G
>>> # And the second haplotype has probability 100% of having
>>> # the first allele
>>> print(alleles[0])
A
Please, refer to bgen specification⧉ for a detailed description.
Complex file¶
The bgen file format allows the storage of very heterogeneous genetic data.
In the complex.bgen
file we have variants with different ploidy and number of
alleles, as well as phasedness.
>>> bgen = read_bgen(filepath["complex.bgen"], verbose=False)
>>>
>>> # Note how the number of alleles very widely across loci.
>>> print(bgen["variants"].compute())
id rsid chrom pos nalleles allele_ids vaddr
0 V1 01 1 2 A,G 98
1 V2.1 V2 01 2 2 A,G 175
2 V3 01 3 2 A,G 232
3 M4 01 4 3 A,G,T 305
.. ... ... ... ... ... ... ...
6 M7 01 7 6 A,G,GT,GTT,GTTT,GTTTT 557
7 M8 01 8 7 A,G,GT,GTT,GTTT,GTTTT,GTTTTT 663
8 M9 01 9 8 A,G,GT,GTT,GTTT,GTTTT,GTTTTT,GTTTTTT 783
9 M10 01 10 2 A,G 863
[10 rows x 7 columns]
>>> print(bgen["samples"])
0 sample_0
1 sample_1
2 sample_2
3 sample_3
Name: id, dtype: object
>>> # Print the estimated probabilities for the first variant
>>> # and second individual.
>>> geno = bgen["genotype"][0].compute()
>>> print(geno["probs"][1])
[1. 0. 0.]
>>> # The 9th variant for the 4th individual has ploidy
>>> geno = bgen["genotype"][8].compute()
>>> ploidy = geno["ploidy"][3]
>>> print(ploidy)
2
>>> # and number of alleles equal to
>>> nalleles = bgen["variants"].loc[8, "nalleles"].compute().values[0]
>>> print(nalleles)
8
>>> # Its probability distribution is given by the array
>>> p = geno["probs"][3]
>>> print(p)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> # Since the 9th variant for the 4th individual is unphased,
>>> print(geno["phased"])
0
>>> # we can pick an alternative allele and compute the dosage
>>> # from allele expectation.
>>> # If we select the third allele as being the alternative one, we have
>>> from bgen_reader import allele_expectation, compute_dosage
>>> e = allele_expectation(bgen, 8)
>>> print(compute_dosage(e, 2))
[0. 0. 0. 1.]
Please, refer to Dosage section for further details.