Dosage

bgen_reader.compute_dosage(expec, alt=None)[source]

Compute dosage from allele expectation.

Parameters
  • expec (array_like) – Allele expectations encoded as a samples-by-alleles matrix.

  • alt (array_like, optional) – Alternative allele index. If None, the allele having the minor allele frequency for the provided expec is used as the alternative. Defaults to None.

Returns

Dosage encoded as an array of size equal to the number of samples.

Return type

numpy.ndarray

Examples

First a quick-start example.
>>> from bgen_reader import allele_expectation, compute_dosage
>>> from bgen_reader import example_filepath, read_bgen
>>>
>>> filepath = example_filepath("example.32bits.bgen")
>>>
>>> # Read the example.
>>> bgen = read_bgen(filepath, verbose=False)
>>>
>>> # Extract the allele expectations of the fourth variant.
>>> variant_idx = 3
>>> e = allele_expectation(bgen, variant_idx)
>>>
>>> # Compute the dosage when considering the first allele
>>> # as the reference/alternative one.
>>> alt_allele_idx = 1
>>> d = compute_dosage(e, alt=alt_allele_idx)
>>>
>>> # Print the dosage of the first five samples only.
>>> print(d[:5])
[1.96185308 0.00982666 0.01745552 1.00347899 1.01153563]
Genotype probabilities, allele expectations and frequencies.
>>> from bgen_reader import (
...     allele_expectation,
...     allele_frequency,
...     compute_dosage,
...     example_filepath,
...     read_bgen,
... )
>>> from pandas import DataFrame
>>> from xarray import DataArray
>>>
>>> # Download an example
>>> filepath = example_filepath("example.32bits.bgen")
>>>
>>> # Open the bgen file.
>>> bgen = read_bgen(filepath, verbose=False)
>>> variants = bgen["variants"]
>>> genotype = bgen["genotype"]
>>> samples = bgen["samples"]
>>>
>>> variant_idx = 3
>>> variant = variants.loc[variant_idx].compute()
>>> # Print the metadata of the fourth variant.
>>> print(variant)
        id    rsid chrom   pos  nalleles allele_ids  vaddr
3  SNPID_5  RSID_5    01  5000         2        A,G  16034

>>> geno = bgen["genotype"][variant_idx].compute()
>>> metageno = DataFrame({k: geno[k] for k in ["ploidy", "missing"]},
...                      index=samples)
>>> metageno.index.name = "sample"
>>> print(metageno)  
            ploidy  missing
sample
sample_001       2    False
sample_002       2    False
sample_003       2    False
sample_004       2    False
...            ...      ...
sample_497       2    False
sample_498       2    False
sample_499       2    False
sample_500       2    False

[500 rows x 2 columns]
>>> p = DataArray(
...     geno["probs"],
...     name="probability",
...     coords={"sample": samples},
...     dims=["sample", "genotype"],
... )
>>> # Print the genotype probabilities.
>>> print(p.to_series().unstack(level=-1))  
genotype          0        1        2
sample
sample_001  0.00488  0.02838  0.96674
sample_002  0.99045  0.00928  0.00027
sample_003  0.98932  0.00391  0.00677
sample_004  0.00662  0.98328  0.01010
...             ...      ...      ...
sample_497  0.00137  0.01312  0.98550
sample_498  0.00552  0.99423  0.00024
sample_499  0.01266  0.01154  0.97580
sample_500  0.00021  0.98431  0.01547

[500 rows x 3 columns]
>>> alleles = variant["allele_ids"].values[0].split(",")
>>> e = DataArray(
...     allele_expectation(bgen, variant_idx),
...     name="expectation",
...     coords={"sample": samples, "allele": alleles},
...     dims=["sample", "allele"],
... )
>>> # Print the allele expectations.
>>> print(e.to_series().unstack(level=-1))  
allele            A        G
sample
sample_001  0.03815  1.96185
sample_002  1.99017  0.00983
sample_003  1.98254  0.01746
sample_004  0.99652  1.00348
...             ...      ...
sample_497  0.01587  1.98413
sample_498  1.00528  0.99472
sample_499  0.03687  1.96313
sample_500  0.98474  1.01526

[500 rows x 2 columns]
>>> rsid = variant["rsid"].values[0]
>>> chrom = variant["chrom"].values[0]
>>> variant_name = f"{chrom}:{rsid}"
>>> f = DataFrame(allele_frequency(e), columns=[variant_name], index=alleles)
>>> f.index.name = "allele"
>>> # Allele frequencies.
>>> print(f) 
        01:RSID_5
allele
A       305.97218
G       194.02782
>>> alt = f.idxmin().values[0]
>>> alt_idx = alleles.index(alt)
>>> d = compute_dosage(e, alt=alt_idx).to_series()
>>> d = DataFrame(d.values, columns=[f"alt={alt}"], index=d.index)
>>> # Dosages when considering G as the alternative allele.
>>> print(d) 
              alt=G
sample
sample_001  1.96185
sample_002  0.00983
sample_003  0.01746
sample_004  1.00348
...             ...
sample_497  1.98413
sample_498  0.99472
sample_499  1.96313
sample_500  1.01526

[500 rows x 1 columns]