# Genomic Interval Operations ## Overview polars-bio provides 8 core operations for genomic interval arithmetic. All operations work on Polars DataFrames or LazyFrames containing genomic intervals (columns: `chrom`, `start`, `end` by default) and return a **LazyFrame** by default. Pass `output_type="polars.DataFrame"` for eager results. ## Operations Summary | Operation | Inputs | Description | |-----------|--------|-------------| | `overlap` | two DataFrames | Find pairs of overlapping intervals | | `count_overlaps` | two DataFrames | Count overlaps per interval in the first set | | `nearest` | two DataFrames | Find nearest intervals between two sets | | `merge` | one DataFrame | Merge overlapping/bookended intervals | | `cluster` | one DataFrame | Assign cluster IDs to overlapping intervals | | `coverage` | two DataFrames | Compute per-interval coverage counts | | `complement` | one DataFrame + genome | Find gaps between intervals | | `subtract` | two DataFrames | Remove overlapping portions | ## overlap Find pairs of overlapping intervals between two DataFrames. ### Functional API ```python import polars as pl import polars_bio as pb df1 = pl.DataFrame({ "chrom": ["chr1", "chr1", "chr1"], "start": [1, 5, 22], "end": [6, 9, 30], }) df2 = pl.DataFrame({ "chrom": ["chr1", "chr1"], "start": [3, 25], "end": [8, 28], }) # Returns LazyFrame by default result_lf = pb.overlap(df1, df2, suffixes=("_1", "_2")) result_df = result_lf.collect() # Or get DataFrame directly result_df = pb.overlap(df1, df2, suffixes=("_1", "_2"), output_type="polars.DataFrame") ``` ### Method-Chaining API (LazyFrame only) ```python result = df1.lazy().pb.overlap(df2, suffixes=("_1", "_2")).collect() ``` ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `df1` | DataFrame/LazyFrame/str | required | First (probe) interval set | | `df2` | DataFrame/LazyFrame/str | required | Second (build) interval set | | `suffixes` | tuple[str, str] | `("_1", "_2")` | Suffixes for overlapping column names | | `on_cols` | list[str] | `None` | Additional columns to join on (beyond genomic coords) | | `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 | | `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 | | `algorithm` | str | `"Coitrees"` | Interval algorithm | | `low_memory` | bool | `False` | Low memory mode | | `output_type` | str | `"polars.LazyFrame"` | Output format: `"polars.LazyFrame"`, `"polars.DataFrame"`, `"pandas.DataFrame"` | | `projection_pushdown` | bool | `True` | Enable projection pushdown optimization | ### Output Schema Returns columns from both inputs with suffixes applied: - `chrom_1`, `start_1`, `end_1` (from df1) - `chrom_2`, `start_2`, `end_2` (from df2) - Any additional columns from df1 and df2 Column dtypes are `String` for chrom and `Int64` for start/end. ## count_overlaps Count the number of overlapping intervals from df2 for each interval in df1. ```python # Functional counts = pb.count_overlaps(df1, df2) # Method-chaining (LazyFrame) counts = df1.lazy().pb.count_overlaps(df2) ``` ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `df1` | DataFrame/LazyFrame/str | required | Query interval set | | `df2` | DataFrame/LazyFrame/str | required | Target interval set | | `suffixes` | tuple[str, str] | `("", "_")` | Suffixes for column names | | `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 | | `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 | | `on_cols` | list[str] | `None` | Additional join columns | | `output_type` | str | `"polars.LazyFrame"` | Output format | | `naive_query` | bool | `True` | Use naive query strategy | | `projection_pushdown` | bool | `True` | Enable projection pushdown | ### Output Schema Returns df1 columns with an additional `count` column (Int64). ## nearest Find the nearest interval in df2 for each interval in df1. ```python # Find nearest (default: k=1, any direction) nearest = pb.nearest(df1, df2, output_type="polars.DataFrame") # Find k nearest nearest = pb.nearest(df1, df2, k=3) # Exclude overlapping intervals from results nearest = pb.nearest(df1, df2, overlap=False) # Without distance column nearest = pb.nearest(df1, df2, distance=False) ``` ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `df1` | DataFrame/LazyFrame/str | required | Query interval set | | `df2` | DataFrame/LazyFrame/str | required | Target interval set | | `suffixes` | tuple[str, str] | `("_1", "_2")` | Suffixes for column names | | `on_cols` | list[str] | `None` | Additional join columns | | `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 | | `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 | | `k` | int | `1` | Number of nearest neighbors to find | | `overlap` | bool | `True` | Include overlapping intervals in results | | `distance` | bool | `True` | Include distance column in output | | `output_type` | str | `"polars.LazyFrame"` | Output format | | `projection_pushdown` | bool | `True` | Enable projection pushdown | ### Output Schema Returns columns from both DataFrames (with suffixes) plus a `distance` column (Int64) with the distance to the nearest interval (0 if overlapping). Distance column is omitted if `distance=False`. ## merge Merge overlapping and bookended intervals within a single DataFrame. ```python import polars as pl import polars_bio as pb df = pl.DataFrame({ "chrom": ["chr1", "chr1", "chr1", "chr2"], "start": [1, 4, 20, 1], "end": [6, 9, 30, 10], }) # Functional merged = pb.merge(df, output_type="polars.DataFrame") # Method-chaining (LazyFrame) merged = df.lazy().pb.merge().collect() # Merge intervals within a minimum distance merged = pb.merge(df, min_dist=10) ``` ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `df` | DataFrame/LazyFrame/str | required | Interval set to merge | | `min_dist` | int | `0` | Minimum distance between intervals to merge (0 = must overlap or be bookended) | | `cols` | list[str] | `["chrom", "start", "end"]` | Column names | | `on_cols` | list[str] | `None` | Additional grouping columns | | `output_type` | str | `"polars.LazyFrame"` | Output format | | `projection_pushdown` | bool | `True` | Enable projection pushdown | ### Output Schema | Column | Type | Description | |--------|------|-------------| | `chrom` | String | Chromosome | | `start` | Int64 | Merged interval start | | `end` | Int64 | Merged interval end | | `n_intervals` | Int64 | Number of intervals merged | ## cluster Assign cluster IDs to overlapping intervals. Intervals that overlap are assigned the same cluster ID. ```python # Functional clustered = pb.cluster(df, output_type="polars.DataFrame") # Method-chaining (LazyFrame) clustered = df.lazy().pb.cluster().collect() # With minimum distance clustered = pb.cluster(df, min_dist=5) ``` ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `df` | DataFrame/LazyFrame/str | required | Interval set | | `min_dist` | int | `0` | Minimum distance for clustering | | `cols` | list[str] | `["chrom", "start", "end"]` | Column names | | `output_type` | str | `"polars.LazyFrame"` | Output format | | `projection_pushdown` | bool | `True` | Enable projection pushdown | ### Output Schema Returns the original columns plus: | Column | Type | Description | |--------|------|-------------| | `cluster` | Int64 | Cluster ID (intervals in the same cluster overlap) | | `cluster_start` | Int64 | Start of the cluster extent | | `cluster_end` | Int64 | End of the cluster extent | ## coverage Compute per-interval coverage counts. This is a **two-input** operation: for each interval in df1, count the coverage from df2. ```python # Functional cov = pb.coverage(df1, df2, output_type="polars.DataFrame") # Method-chaining (LazyFrame) cov = df1.lazy().pb.coverage(df2).collect() ``` ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `df1` | DataFrame/LazyFrame/str | required | Query intervals | | `df2` | DataFrame/LazyFrame/str | required | Coverage source intervals | | `suffixes` | tuple[str, str] | `("_1", "_2")` | Suffixes for column names | | `on_cols` | list[str] | `None` | Additional join columns | | `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 | | `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 | | `output_type` | str | `"polars.LazyFrame"` | Output format | | `projection_pushdown` | bool | `True` | Enable projection pushdown | ### Output Schema Returns columns from df1 plus a `coverage` column (Int64). ## complement Find gaps between intervals within a genome. Requires a genome definition specifying chromosome sizes. ```python import polars as pl import polars_bio as pb df = pl.DataFrame({ "chrom": ["chr1", "chr1"], "start": [100, 500], "end": [200, 600], }) genome = pl.DataFrame({ "chrom": ["chr1"], "start": [0], "end": [1000], }) # Functional gaps = pb.complement(df, view_df=genome, output_type="polars.DataFrame") # Method-chaining (LazyFrame) gaps = df.lazy().pb.complement(genome).collect() ``` ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `df` | DataFrame/LazyFrame/str | required | Interval set | | `view_df` | DataFrame/LazyFrame | `None` | Genome with chrom, start, end defining chromosome extents | | `cols` | list[str] | `["chrom", "start", "end"]` | Column names in df | | `view_cols` | list[str] | `None` | Column names in view_df | | `output_type` | str | `"polars.LazyFrame"` | Output format | | `projection_pushdown` | bool | `True` | Enable projection pushdown | ### Output Schema Returns a DataFrame with `chrom` (String), `start` (Int64), `end` (Int64) columns representing gaps between intervals. ## subtract Remove portions of intervals in df1 that overlap with intervals in df2. ```python # Functional result = pb.subtract(df1, df2, output_type="polars.DataFrame") # Method-chaining (LazyFrame) result = df1.lazy().pb.subtract(df2).collect() ``` ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `df1` | DataFrame/LazyFrame/str | required | Intervals to subtract from | | `df2` | DataFrame/LazyFrame/str | required | Intervals to subtract | | `cols1` | list[str] | `["chrom", "start", "end"]` | Column names in df1 | | `cols2` | list[str] | `["chrom", "start", "end"]` | Column names in df2 | | `output_type` | str | `"polars.LazyFrame"` | Output format | | `projection_pushdown` | bool | `True` | Enable projection pushdown | ### Output Schema Returns `chrom` (String), `start` (Int64), `end` (Int64) representing the remaining portions of df1 intervals after subtraction. ## Performance Considerations ### Probe-Build Architecture Two-input operations (`overlap`, `nearest`, `count_overlaps`, `coverage`, `subtract`) use a probe-build join: - **Probe** (first DataFrame): Iterated over, row by row - **Build** (second DataFrame): Indexed into an interval tree for fast lookup For best performance, pass the **larger** DataFrame as the probe (first argument) and the **smaller** one as the build (second argument). ### Parallelism By default, polars-bio uses a single execution partition. For large datasets, enable parallel execution: ```python import os import polars_bio as pb pb.set_option("datafusion.execution.target_partitions", os.cpu_count()) ``` ### Streaming Execution DataFusion streaming is enabled by default for interval operations. Data is processed in batches, enabling out-of-core computation for datasets larger than available RAM. ### When to Use Lazy Evaluation Use `scan_*` functions and lazy DataFrames for: - Files larger than available RAM - When only a subset of results is needed - Pipeline operations where intermediate results can be optimized away ```python # Lazy pipeline lf1 = pb.scan_bed("large1.bed") lf2 = pb.scan_bed("large2.bed") result = pb.overlap(lf1, lf2).collect() ```