API
|
A NumPy-inspired class for fast opening and reading of BGEN files. |
|
Allele expectation. |
|
Allele expectation frequency. |
The comma-delimited list of alleles for each variant (a |
|
The chromosome of each variant (a |
|
The variant identifiers (a |
|
The maximum number of values in any variant's probability distribution ( |
|
The number of alleles for each variant (a |
|
The number of values needed for each variant's probability distribution (a |
|
The number of samples in the data ( |
|
The number of variants in the data ( |
|
For each variant, |
|
The genetic position of each variant (a |
|
|
Read genotype information from an |
The variant RS numbers (a |
|
The sample identifiers (a |
|
The tuple ( |
- class bgen_reader._bgen2.open_bgen(filepath: str | Path, samples_filepath: str | Path | None = None, metadata_filepath: str | Path | None = None, allow_complex: bool = False, verbose: bool = True)[source]
A NumPy-inspired class for fast opening and reading of BGEN files.
- Parameters:
filepath – BGEN file path.
samples_filepath – Path to a sample format file or
None
to read samples from the BGEN file itself. Defaults toNone
.metadata_filepath – Tells where to be put the constructed metadata file. By default, will use the same directory and name as the BGEN file, but with extension
.metadata2.mmm
. Use this option, for example, when the BGEN file’s directory is read-only.allow_complex –
False
(default) to assume homogeneous data;True
to allow complex data. The BGEN format allows every variant to vary in its phaseness, its allele count, and its maximum ploidy. For files where these values may actually vary, setallow_complex
toTrue
.verbose –
True
(default) to show progress;False
otherwise.
- Returns:
an open_bgen object
- Return type:
The first time a file is opened ,
open_bgen
creates a .metadata2.mmm file, a process that takes seconds to hours, depending on the size of the file and theallow_complex
setting. Subsequent openings take just a fraction of a second. Changingsamples_filepath
orallow_complex
results in a new default .metadata2.mmm with a slightly different name.Examples
With the with statement, list
samples
and variantids
, thenread()
the whole file.>>> from bgen_reader import example_filepath, open_bgen >>> >>> file = example_filepath("haplotypes.bgen") >>> with open_bgen(file, verbose=False) as bgen: ... print(bgen.ids) ... print(bgen.samples) ... print(bgen.read()) ['SNP1' 'SNP2' 'SNP3' 'SNP4'] ['sample_0' 'sample_1' 'sample_2' 'sample_3'] [[[1. 0. 1. 0.] [0. 1. 1. 0.] [1. 0. 0. 1.] [0. 1. 0. 1.]] [[0. 1. 1. 0.] [1. 0. 0. 1.] [0. 1. 0. 1.] [1. 0. 1. 0.]] [[1. 0. 0. 1.] [0. 1. 0. 1.] [1. 0. 1. 0.] [0. 1. 1. 0.]] [[0. 1. 0. 1.] [1. 0. 1. 0.] [0. 1. 1. 0.] [1. 0. 0. 1.]]]
Open the file (without with) and read probabilities for one variant.
>>> bgen = open_bgen(file, verbose=False) >>> print(bgen.read(2)) [[[1. 0. 0. 1.]] [[0. 1. 0. 1.]] [[1. 0. 1. 0.]] [[0. 1. 1. 0.]]] >>> del bgen # close and delete object
Open the file and then first read for a
slice
of samples and variants, and then for a single sample and variant.>>> bgen = open_bgen(file, verbose=False) >>> print(bgen.read((slice(1,3),slice(2,4)))) [[[0. 1. 0. 1.] [1. 0. 1. 0.]] [[1. 0. 1. 0.] [0. 1. 1. 0.]]] >>> print(bgen.read((0,1))) [[[0. 1. 1. 0.]]] >>> del bgen # close and delete object
- allele_expectation(index: Any | None = None, assume_constant_ploidy: bool = True) ndarray | Tuple[ndarray, ndarray] [source]
Allele expectation.
- Parameters:
index – An expression specifying the samples and variants of interest. (See Examples in
read()
for details.) Defaults toNone
, meaning compute for all samples and variants.assume_constant_ploidy (bool) – When ploidy count can be assumed to be constant, calculations are much faster. Defaults to
True
.
- Returns:
always in this order
Samples-by-variants-by-alleles matrix of allele expectations,
Samples-by-variants-by-alleles matrix of frequencies, if
return_frequencies
isTrue
- Return type:
one or two
numpy.ndarray
Note
This method supports unphased genotypes only.
Examples
>>> from bgen_reader import allele_expectation, example_filepath, read_bgen >>> from texttable import Texttable >>> >>> filepath = example_filepath("example.32bits.bgen") >>> >>> # Read the example. >>> bgen = open_bgen(filepath, verbose=False) >>> sample_index = bgen.samples=="sample_005" # will be only 1 sample >>> variant_index = bgen.rsids=="RSID_6" # will be only 1 variant >>> p = bgen.read((sample_index,variant_index)) >>> # Allele expectation makes sense for unphased genotypes only, >>> # which is the case here. >>> e = bgen.allele_expectation((sample_index,variant_index)) >>> alleles_per_variant = [allele_ids.split(',') for allele_ids in bgen.allele_ids[variant_index]] >>> >>> # Print what we have got in a nice format. >>> table = Texttable() >>> table = table.add_rows( ... [ ... ["", "AA", "AG", "GG", "E[.]"], ... ["p"] + list(p[0,0,:]) + ["na"], ... ["#" + alleles_per_variant[0][0], 2, 1, 0, e[0,0,0]], ... ["#" + alleles_per_variant[0][1], 0, 1, 2, e[0,0,1]], ... ] ... ) >>> print(table.draw()) +----+-------+-------+-------+-------+ | | AA | AG | GG | E[.] | +====+=======+=======+=======+=======+ | p | 0.012 | 0.987 | 0.001 | na | +----+-------+-------+-------+-------+ | #A | 2 | 1 | 0 | 1.011 | +----+-------+-------+-------+-------+ | #G | 0 | 1 | 2 | 0.989 | +----+-------+-------+-------+-------+
If
return_frequencies
is true, this method will also return the allele frequency.>>> from bgen_reader import open_bgen, example_filepath >>> >>> filepath = example_filepath("example.32bits.bgen") >>> bgen = open_bgen(filepath, verbose=False) >>> >>> variant_index = (bgen.rsids=="RSID_6") # will be only 1 variant >>> e = bgen.allele_expectation(variant_index) >>> f = bgen.allele_frequency(e) >>> alleles_per_variant = [allele_ids.split(',') for allele_ids in bgen.allele_ids[variant_index]] >>> print(alleles_per_variant[0][0] + ": {}".format(f[0,0])) A: 229.23103218810434 >>> print(alleles_per_variant[0][1] + ": {}".format(f[0,1])) G: 270.7689678118956 >>> print(bgen.ids[variant_index][0],bgen.rsids[variant_index][0]) SNPID_6 RSID_6
To find dosage, just select the column of interest from the expectation.
>>> from bgen_reader import example_filepath, open_bgen >>> >>> filepath = example_filepath("example.32bits.bgen") >>> >>> # Read the example. >>> bgen = open_bgen(filepath, verbose=False) >>> >>> # Extract the allele expectations of the fourth variant. >>> variant_index = 3 >>> e = bgen.allele_expectation(variant_index) >>> >>> # Compute the dosage when considering the allele >>> # in position 1 as the reference/alternative one. >>> alt_allele_index = 1 >>> dosage = e[...,1] >>> >>> # Print the dosage for only the first five samples >>> # and the one (and only) variant >>> print(dosage[:5,0]) [1.96185308 0.00982666 0.01745552 1.00347899 1.01153563] >>> del bgen >>> >>> import pandas as pd >>> from bgen_reader import open_bgen >>> filepath = example_filepath("example.32bits.bgen") >>> bgen = open_bgen(filepath, verbose=False) >>> >>> variant_index = [3] >>> # Print the metadata of the fourth variant. >>> print(bgen.ids[variant_index],bgen.rsids[variant_index]) ['SNPID_5'] ['RSID_5'] >>> probs, missing, ploidy = bgen.read(variant_index,return_missings=True,return_ploidies=True) >>> print(np.unique(missing),np.unique(ploidy)) [False] [2] >>> df1 = pd.DataFrame({'sample':bgen.samples,'0':probs[:,0,0],'1':probs[:,0,1],'2':probs[:,0,2]}) >>> print(df1) sample 0 1 2 0 sample_001 0.00488 0.02838 0.96674 1 sample_002 0.99045 0.00928 0.00027 2 sample_003 0.98932 0.00391 0.00677 3 sample_004 0.00662 0.98328 0.01010 .. ... ... ... ... 496 sample_497 0.00137 0.01312 0.98550 497 sample_498 0.00552 0.99423 0.00024 498 sample_499 0.01266 0.01154 0.97580 499 sample_500 0.00021 0.98431 0.01547 [500 rows x 4 columns] >>> alleles_per_variant = [allele_ids.split(',') for allele_ids in bgen.allele_ids[variant_index]] >>> e = bgen.allele_expectation(variant_index) >>> f = bgen.allele_frequency(e) >>> df2 = pd.DataFrame({'sample':bgen.samples,alleles_per_variant[0][0]:e[:,0,0],alleles_per_variant[0][1]:e[:,0,1]}) >>> print(df2) sample A G 0 sample_001 0.03815 1.96185 1 sample_002 1.99017 0.00983 2 sample_003 1.98254 0.01746 3 sample_004 0.99652 1.00348 .. ... ... ... 496 sample_497 0.01587 1.98413 497 sample_498 1.00528 0.99472 498 sample_499 0.03687 1.96313 499 sample_500 0.98474 1.01526 [500 rows x 3 columns] >>> df3 = pd.DataFrame({'allele':alleles_per_variant[0],bgen.rsids[variant_index][0]:f[0,:]}) >>> print(df3) allele RSID_5 0 A 305.97218 1 G 194.02782 >>> alt_index = f[0,:].argmin() >>> alt = alleles_per_variant[0][alt_index] >>> dosage = e[:,0,alt_index] >>> df4 = pd.DataFrame({'sample':bgen.samples,f"alt={alt}":dosage}) >>> # Dosages when considering G as the alternative allele. >>> print(df4) sample alt=G 0 sample_001 1.96185 1 sample_002 0.00983 2 sample_003 0.01746 3 sample_004 1.00348 .. ... ... 496 sample_497 1.98413 497 sample_498 0.99472 498 sample_499 1.96313 499 sample_500 1.01526 [500 rows x 2 columns]
- static allele_frequency(allele_expectation: ndarray) ndarray [source]
Allele expectation frequency.
You have to provide the allele expectations,
allele_expectation()
.
- property allele_ids: List[str]
The comma-delimited list of alleles for each variant (a
numpy.ndarray
ofstr
).Note
To access after file closes, make a copy.
Example
>>> from bgen_reader import example_filepath, open_bgen >>> >>> file = example_filepath("haplotypes.bgen") >>> with open_bgen(file, verbose=False) as bgen: ... print(bgen.allele_ids) ['A,G' 'A,G' 'A,G' 'A,G'] >>> # To access after file closes, make a copy >>> with open_bgen(file, verbose=False) as bgen: ... allele_ids = bgen.allele_ids.copy() >>> print(allele_ids) ['A,G' 'A,G' 'A,G' 'A,G']
- property chromosomes: List[str]
The chromosome of each variant (a
numpy.ndarray
ofstr
).Note
To access after file closes, make a copy.
Example
>>> from bgen_reader import example_filepath, open_bgen >>> >>> file = example_filepath("haplotypes.bgen") >>> with open_bgen(file, verbose=False) as bgen: ... print(bgen.chromosomes) ['1' '1' '1' '1'] >>> # To access after file closes, make a copy >>> with open_bgen(file, verbose=False) as bgen: ... chromosomes = bgen.chromosomes.copy() >>> print(chromosomes) ['1' '1' '1' '1']
- close()[source]
Close a
open_bgen
object that was opened for reading.Notes
Better alternatives to
close()
include the with statement (closes the file automatically) and the del statement (which closes the file and deletes the object). Doing nothing, while not better, is usually fine.>>> from bgen_reader import example_filepath, open_bgen >>> file = example_filepath("haplotypes.bgen") >>> bgen = open_bgen(file, verbose=False) >>> print(bgen.read(2)) [[[1. 0. 0. 1.]] [[0. 1. 0. 1.]] [[1. 0. 1. 0.]] [[0. 1. 1. 0.]]] >>> bgen.close() #'del bgen' is better.
- property ids: List[str]
The variant identifiers (a
numpy.ndarray
ofstr
).Note
To access after file closes, make a copy.
Example
>>> from bgen_reader import example_filepath, open_bgen >>> >>> file = example_filepath("haplotypes.bgen") >>> with open_bgen(file, verbose=False) as bgen: ... print(bgen.ids) ['SNP1' 'SNP2' 'SNP3' 'SNP4'] >>> # To access after file closes, make a copy >>> with open_bgen(file, verbose=False) as bgen: ... ids = bgen.ids.copy() >>> print(ids) ['SNP1' 'SNP2' 'SNP3' 'SNP4']
- property max_combinations: int
The maximum number of values in any variant’s probability distribution (
int
).For unphased, diploidy, biallelic data, it will be 3. For phased, diploidy, biallelic data it will be 4. In general, it is the maximum value in
ncombinations
.Example
>>> from bgen_reader import example_filepath, open_bgen >>> >>> file = example_filepath("haplotypes.bgen") >>> with open_bgen(file, verbose=False) as bgen: ... print(bgen.max_combinations) 4
- property nalleles: List[int]
The number of alleles for each variant (a
numpy.ndarray
ofint
).Note
To access after file closes, make a copy.
Example
>>> from bgen_reader import example_filepath, open_bgen >>> >>> file = example_filepath("haplotypes.bgen") >>> with open_bgen(file, verbose=False) as bgen: ... print(bgen.nalleles) [2 2 2 2] >>> # To access after file closes, make a copy >>> with open_bgen(file, verbose=False) as bgen: ... nalleles = bgen.nalleles.copy() >>> print(nalleles) [2 2 2 2]
- property ncombinations: List[int]
The number of values needed for each variant’s probability distribution (a
numpy.ndarray
ofint
).Note
To access after file closes, make a copy.
Example
>>> from bgen_reader import example_filepath, open_bgen >>> >>> file = example_filepath("haplotypes.bgen") >>> with open_bgen(file, verbose=False) as bgen: ... print(bgen.ncombinations) [4 4 4 4] >>> # To access after file closes, make a copy >>> with open_bgen(file, verbose=False) as bgen: ... ncombinations = bgen.ncombinations.copy() >>> print(ncombinations) [4 4 4 4]
- property nsamples: int
The number of samples in the data (
int
).Example
>>> from bgen_reader import example_filepath, open_bgen >>> >>> file = example_filepath("haplotypes.bgen") >>> with open_bgen(file, verbose=False) as bgen: ... print(bgen.nsamples) 4
- property nvariants: int
The number of variants in the data (
int
).Example
>>> from bgen_reader import example_filepath, open_bgen >>> >>> file = example_filepath("haplotypes.bgen") >>> with open_bgen(file, verbose=False) as bgen: ... print(bgen.nvariants) 4
- property phased: List[bool]
For each variant,
True
if and only the variant is phased (anumpy.ndarray
of bool).Note
To access after file closes, make a copy.
Example
>>> from bgen_reader import example_filepath, open_bgen >>> >>> file = example_filepath("haplotypes.bgen") >>> with open_bgen(file, verbose=False) as bgen: ... print(bgen.phased) [ True True True True] >>> # To access after file closes, make a copy >>> with open_bgen(file, verbose=False) as bgen: ... phased = bgen.phased.copy() >>> print(phased) [ True True True True]
- property positions: List[int]
The genetic position of each variant (a
numpy.ndarray
ofint
).Note
To access after file closes, make a copy.
Example
>>> from bgen_reader import example_filepath, open_bgen >>> >>> file = example_filepath("haplotypes.bgen") >>> with open_bgen(file, verbose=False) as bgen: ... print(bgen.positions) [1 2 3 4] >>> # To access after file closes, make a copy >>> with open_bgen(file, verbose=False) as bgen: ... positions = bgen.positions.copy() >>> print(positions) [1 2 3 4]
- read(index: ~typing.Any | None = None, dtype: type | str | None = <class 'numpy.float64'>, order: str | None = 'F', max_combinations: int | None = None, return_probabilities: bool | None = True, return_missings: bool | None = False, return_ploidies: bool | None = False, num_threads: int | None = None) None | ndarray | Tuple[ndarray, ndarray] | Tuple[ndarray, ndarray, ndarray] [source]
Read genotype information from an
open_bgen
object.- Parameters:
index – An expression specifying the samples and variants to read. (See Examples, below). Defaults to
None
, meaning read all.dtype (data-type) – The desired data-type for the returned probability array. Defaults to
numpy.float64
. Usenumpy.float32
ornumpy.float16
, when appropriate, to save 50% or 75% of memory. (See Notes, below).order ({'F','C'}) – The desired memory layout for the returned probability array. Defaults to
F
(Fortran order, which is variant-major).max_combinations (int or
None
.) – The number of values to allocate for each probability distribution. Defaults to a number just large enough for any data in the file. For unphased, diploid, biallelic data, it will default to 3. For phased, diploid, biallelic data, it will default to 4. Any overallocated space is filled withnumpy.nan
.return_probabilities (bool) – Read and return the probabilities for samples and variants specified. Defaults to
True
.return_missings (bool) – Return a boolean array telling which probabilities are missing. Defaults to
False
.return_ploidies (bool) – Read and return the ploidy for the samples and variants specified. Defaults to
False
.num_threads (bool) – The number of threads with which to read data. Defaults to all available processors or the number of variants being read, whichever is less. Can also be set with the ‘MKL_NUM_THREADS’ environment variable.
- Returns:
always in this order:
a
numpy.ndarray
of probabilities withdtype
and shape (nsamples_out,nvariants_out,max_combinations), ifreturn_probabilities
isTrue
(the default). Missing data is filled withnumpy.nan
.a
numpy.ndarray
ofbool
of shape (nsamples_out,nvariants_out), ifreturn_missings
isTrue
a
numpy.ndarray
ofint
of shape (nsamples_out,nvariants_out), ifreturn_ploidies
isTrue
- Return type:
zero to three
numpy.ndarray
Notes
About
dtype
If you know the compression level of your BGEN file, you can sometimes save 50% or 75% on memory with
dtype
. (Test with your data to confirm you are not losing any precision.) The approximate relationship is:BGEN compression 1 to 10 bits:
dtype
=’float16’BGEN compression 11 to 23 bits:
dtype
=’float32’BGEN compression 24 to 32 bits:
dtype
=’float64’ (default)
Examples
Index Examples
To read all data in a BGEN file, set
index
toNone
. This is the default.>>> import numpy as np >>> from bgen_reader import example_filepath, open_bgen >>> >>> with open_bgen(example_filepath("haplotypes.bgen"), verbose=False) as bgen_h: ... print(bgen_h.read()) #real all [[[1. 0. 1. 0.] [0. 1. 1. 0.] [1. 0. 0. 1.] [0. 1. 0. 1.]] [[0. 1. 1. 0.] [1. 0. 0. 1.] [0. 1. 0. 1.] [1. 0. 1. 0.]] [[1. 0. 0. 1.] [0. 1. 0. 1.] [1. 0. 1. 0.] [0. 1. 1. 0.]] [[0. 1. 0. 1.] [1. 0. 1. 0.] [0. 1. 1. 0.] [1. 0. 0. 1.]]]
To read selected variants, set
index
to anint
, a list ofint
, aslice
, or a list ofbool
. Negative integers count from the end of the data.>>> bgen_e = open_bgen(example_filepath("example.bgen"), verbose=False) >>> probs = bgen_e.read(5) # read the variant indexed by 5. >>> print(probs.shape) # print the dimensions of the returned numpy array. (500, 1, 3) >>> probs = bgen_e.read([5,6,1]) # read the variant indexed by 5, 6, and 1 >>> print(probs.shape) (500, 3, 3) >>> probs = bgen_e.read(slice(5)) #read the first 5 variants >>> print(probs.shape) (500, 5, 3) >>> probs = bgen_e.read(slice(2,5)) #read variants from 2 (inclusive) to 5 (exclusive) >>> print(probs.shape) (500, 3, 3) >>> probs = bgen_e.read(slice(2,None)) # read variants starting at index 2. >>> print(probs.shape) (500, 197, 3) >>> probs = bgen_e.read(slice(None,None,10)) #read every 10th variant >>> print(probs.shape) (500, 20, 3) >>> print(np.unique(bgen_e.chromosomes)) # print unique chrom values ['01'] >>> probs = bgen_e.read(bgen_e.chromosomes=='01') # read all variants in chrom 1 >>> print(probs.shape) (500, 199, 3) >>> probs = bgen_e.read(-1) # read the last variant >>> print(probs.shape) (500, 1, 3)
To read selected samples, set
index
to a tuple of the form(sample_index,None)
, wheresample index
follows the form ofvariant index
, above.>>> probs = bgen_e.read((0,None)) # Read 1st sample (across all variants) >>> print(probs.shape) (1, 199, 3) >>> probs = bgen_e.read((slice(None,None,10),None)) # Read every 10th sample >>> print(probs.shape) (50, 199, 3)
To read selected samples and selected variants, set
index
to a tuple of the form(sample_index,variant_index)
, wheresample_index
andvariant_index
follow the forms above.>>> # Read samples 10 (inclusive) to 20 (exclusive) and the first 15 variants. >>> probs = bgen_e.read((slice(10,20),slice(15))) >>> print(probs.shape) (10, 15, 3) >>> #read last and 2nd-to-last sample and the last variant >>> probs = bgen_e.read(([-1,-2],-1)) >>> print(probs.shape) (2, 1, 3)
Multiple Return Example
Read probabilities, missingness, and ploidy. Print all unique ploidies values.
>>> probs,missing,ploidy = bgen_e.read(return_missings=True,return_ploidies=True) >>> print(np.unique(ploidy)) [2]
- property rsids: List[str]
The variant RS numbers (a
numpy.ndarray
ofstr
).Note
To access after file closes, make a copy.
Example
>>> from bgen_reader import example_filepath, open_bgen >>> >>> file = example_filepath("haplotypes.bgen") >>> with open_bgen(file, verbose=False) as bgen: ... print(bgen.rsids) ['RS1' 'RS2' 'RS3' 'RS4'] >>> # To access after file closes, make a copy >>> with open_bgen(file, verbose=False) as bgen: ... rsids = bgen.rsids.copy() >>> print(rsids) ['RS1' 'RS2' 'RS3' 'RS4']
- property samples: List[str]
The sample identifiers (a
numpy.ndarray
ofstr
).Note
To access after file closes, make a copy.
Example
>>> from bgen_reader import example_filepath, open_bgen >>> >>> file = example_filepath("haplotypes.bgen") >>> with open_bgen(file, verbose=False) as bgen: ... print(bgen.samples) ['sample_0' 'sample_1' 'sample_2' 'sample_3'] >>> # To access after file closes, make a copy >>> with open_bgen(file, verbose=False) as bgen: ... samples = bgen.samples.copy() >>> print(samples) ['sample_0' 'sample_1' 'sample_2' 'sample_3']