OxiBLAS

Pure Rust BLAS/LAPACK implementation for the SciRS2 ecosystem

OxiBLAS is a production-grade, pure Rust implementation of BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage). Designed as the foundational linear algebra library for the SciRS2 scientific computing ecosystem.

Features

Pure Rust - No C dependencies, fully portable, safe by default
SIMD Optimized - Custom SIMD layer using core::arch intrinsics
- x86_64: AVX2/FMA (256-bit), AVX512F (512-bit)
- AArch64: NEON (128-bit)
- Fallback: Scalar operations
Cache Aware - BLIS-style blocked algorithms for optimal cache usage
Parallel - Optional rayon-based parallelization for multi-core systems
Complete BLAS - Full Level 1, 2, 3 operations (including packed/banded)
Extensive LAPACK - LU, Cholesky, QR, SVD, EVD, Schur, Hessenberg
Extended Precision - f16, f128 (quad precision), Kahan/pairwise summation
Tensor Operations - Einstein summation (24 patterns), batched matmul, batched BLAS
Sparse Support - 9 formats (CSR, CSC, COO, ELL, DIA, BSR, BSC, HYB, SELL) with advanced solvers
Advanced Preconditioners - AMG, SPAI, AINV, Schwarz, polynomial
Recursive Factorizations - Cache-oblivious Cholesky, LU, QR (compute_recursive())
Parallel Factorizations - Parallel blocked Cholesky and LU (compute_blocked_par())
Batched BLAS - gemm_batched, gemm_strided_batched, axpy_batched, gemv_batched
Mixed-Precision Refinement - f32 factorization + f64 residual for LU, Cholesky, QR
Runtime Auto-Tuning - RuntimeAutoTuner selects optimal block sizes at runtime
Multifrontal Sparse - MultifrontalCholesky, MultifrontalLU for large sparse systems
no_std Support - oxiblas-core and oxiblas-matrix support #![no_std] with alloc
Comprehensive Benchmarks - Direct performance comparison with OpenBLAS

Supported Types

Type	Description	Feature Flag
`f32`	Single precision floating point	(always)
`f64`	Double precision floating point	(always)
`f16`	Half precision (16-bit) floating point	`f16`
`f128` (QuadFloat)	Quad precision (~31 decimal digits)	`f128`
`Complex32`	Single precision complex	(always)
`Complex64`	Double precision complex	(always)

Quick Start

Add to your Cargo.toml:

[dependencies]
oxiblas = "0.2"

# With parallelization
oxiblas = { version = "0.2", features = ["parallel"] }

# With extended precision
oxiblas = { version = "0.2", features = ["f16", "f128"] }

# All features
oxiblas = { version = "0.2", features = ["full"] }

Basic BLAS Example

use oxiblas_blas::level3::gemm;
use oxiblas_matrix::Mat;

// Create matrices
let a = Mat::from_rows(&[
    &[1.0, 2.0, 3.0],
    &[4.0, 5.0, 6.0],
]);
let b = Mat::from_rows(&[
    &[7.0, 8.0],
    &[9.0, 10.0],
    &[11.0, 12.0],
]);
let mut c = Mat::zeros(2, 2);

// GEMM: C = A * B
gemm(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut());
// Result: [[58, 64], [139, 154]]

LU Decomposition

use oxiblas_lapack::lu::Lu;
use oxiblas_matrix::Mat;

let a = Mat::from_rows(&[
    &[2.0, 1.0],
    &[1.0, 3.0],
]);

let lu = Lu::compute(a.as_ref()).expect("Matrix is not singular");
let det = lu.determinant();
assert!((det - 5.0_f64).abs() < 1e-10); // det = 2*3 - 1*1 = 5

SVD

use oxiblas_lapack::svd::Svd;
use oxiblas_matrix::Mat;

let a = Mat::from_rows(&[
    &[3.0, 1.0],
    &[1.0, 3.0],
    &[1.0, 1.0],
]);

let svd = Svd::compute(a.as_ref()).expect("SVD failed");
let singular_values = svd.singular_values();  // [4.24, 2.00]
let u = svd.u();   // Left singular vectors
let vt = svd.vt(); // Right singular vectors (transposed)

Examples

OxiBLAS includes comprehensive examples demonstrating all major features:

# Basic BLAS operations (Level 1/2/3)
cargo run --example basic_blas

# LAPACK decompositions (LU, QR, Cholesky, SVD, EVD)
cargo run --example lapack_decompositions

# Extended precision (f128, Kahan, pairwise summation)
cargo run --example extended_precision --features f128

# Tensor operations & Einstein summation (24 patterns)
cargo run --example tensor_operations

# Sparse matrices (CSR, CSC, COO, iterative solvers, preconditioners)
cargo run --example sparse_matrices --features parallel

See crates/oxiblas/examples/ for complete source code.

Crate Structure

OxiBLAS is organized as a workspace with specialized sub-crates:

Crate	Description
`oxiblas`	Unified re-exports and prelude
`oxiblas-core`	Core traits, SIMD abstractions, memory management, scalar types (f16, f128); `no_std`
`oxiblas-matrix`	Matrix types (`Mat`, `MatRef`, `MatMut`, `DiagRef`); `no_std`
`oxiblas-blas`	BLAS Level 1/2/3 operations, tensor operations, einsum, batched BLAS
`oxiblas-lapack`	LAPACK decompositions and solvers (blocked, recursive, parallel, auto variants)
`oxiblas-sparse`	Sparse matrix formats (9 types), multifrontal factorizations, and algorithms
`oxiblas-ndarray`	Integration with ndarray (parallel GEMM, sparse integration)
`oxiblas-benchmarks`	Performance benchmarks with OpenBLAS comparison (not in meta crate)

BLAS Operations

Level 1 (Vector-Vector)

Function	Description
`dot`	Dot product
`axpy`	y = alpha * x + y
`scal`	x = alpha * x
`copy`	y = x
`swap`	Swap x and y
`nrm2`	Euclidean norm
`asum`	Sum of absolute values
`iamax`	Index of max absolute value
`rot`	Givens rotation

Level 2 (Matrix-Vector)

Function	Description
`gemv`	General matrix-vector multiply
`symv`	Symmetric matrix-vector multiply
`hemv`	Hermitian matrix-vector multiply
`trmv`	Triangular matrix-vector multiply
`trsv`	Triangular solve
`ger`	Rank-1 update
`syr`	Symmetric rank-1 update
`her`	Hermitian rank-1 update
`syr2`	Symmetric rank-2 update
`her2`	Hermitian rank-2 update
`gbmv`	General banded matrix-vector
`sbmv`	Symmetric banded matrix-vector
`hbmv`	Hermitian banded matrix-vector
`spmv`	Symmetric packed matrix-vector
`hpmv`	Hermitian packed matrix-vector
`tbmv`	Triangular banded matrix-vector
`tpmv`	Triangular packed matrix-vector
`tbsv`	Triangular banded solve
`tpsv`	Triangular packed solve

Level 3 (Matrix-Matrix)

Function	Description
`gemm`	General matrix-matrix multiply
`symm`	Symmetric matrix-matrix multiply
`hemm`	Hermitian matrix-matrix multiply
`trmm`	Triangular matrix-matrix multiply
`trsm`	Triangular solve with multiple RHS
`syrk`	Symmetric rank-k update
`herk`	Hermitian rank-k update
`syr2k`	Symmetric rank-2k update
`her2k`	Hermitian rank-2k update

LAPACK Operations

Decompositions

Decomposition	Description
`Lu`	LU with partial pivoting (`compute`, `compute_blocked`, `compute_recursive`, `compute_auto`, `compute_blocked_par`)
`LuFullPiv`	LU with full pivoting
`BandLu`	Banded LU factorization
`Cholesky`	LL^T for positive definite matrices (`compute`, `compute_blocked`, `compute_recursive`, `compute_auto`, `compute_blocked_par`)
`Ldlt`	LDL^T decomposition
`BandCholesky`	Banded Cholesky
`Qr`	QR decomposition (`compute`, `compute_blocked`, `compute_recursive`, `compute_auto`)
`QrPivot`	QR with column pivoting
`Svd`	Singular value decomposition (Jacobi)
`SvdDc`	SVD with divide-and-conquer
`SymmetricEvd`	Symmetric eigenvalue decomposition
`SymmetricEvdDc`	Symmetric EVD with divide-and-conquer
`GeneralEvd`	General eigenvalue decomposition
`Schur`	Schur decomposition
`Hessenberg`	Hessenberg reduction

Solvers

Solver	Description
`solve`	General linear system Ax = b
`solve_multiple`	Multiple right-hand sides
`solve_triangular`	Triangular system
`lstsq`	Least squares solution
`tridiag_solve`	Tridiagonal system

Utilities

Function	Description
`det`	Determinant
`inv`	Matrix inverse
`pinv`	Pseudoinverse
`cond`	Condition number
`rcond`	Reciprocal condition number
`rank`	Matrix rank
`nullity`	Nullity (dimension of null space)
`null_space`	Null space basis
`col_space`	Column space basis
`norm_1`, `norm_2`, `norm_inf`, `norm_frobenius`	Matrix norms

Extended Precision & Tensor Operations

Extended Precision

OxiBLAS supports multiple precision levels:

use oxiblas::prelude::*;

// Quad-precision (f128) - ~31 decimal digits
#[cfg(feature = "f128")]
{
    use oxiblas_core::scalar::QuadFloat;
    let x = QuadFloat::from(2.0);
    let sqrt_x = x.sqrt();
    // Extremely high precision calculations
}

// Extended precision dot products
let x: Vec<f64> = vec![/* ... */];
let y: Vec<f64> = vec![/* ... */];

// Kahan summation (compensated)
let result = dot_kahan(&x, &y);

// Pairwise summation (divide-and-conquer)
let result = dot_pairwise(&x, &y);

// Mixed precision (f32 computation, f64 accumulation)
let x_f32: Vec<f32> = vec![/* ... */];
let y_f32: Vec<f32> = vec![/* ... */];
let result_f64 = dsdot(&x_f32, &y_f32);

Tensor Operations & Einstein Summation

OxiBLAS includes comprehensive tensor contraction support via einsum:

use oxiblas_blas::tensor::einsum;

// Matrix multiplication
let c = einsum("ij,jk->ik", &a, &[m, n], Some((&b, &[n, k])))?;

// Outer product
let c = einsum("i,j->ij", &x, &[m], Some((&y, &[n])))?;

// Transpose
let at = einsum("ij->ji", &a, &[m, n], None)?;

// Matrix trace
let trace = einsum("ii->", &a, &[n, n], None)?;

// Hadamard (element-wise) product
let c = einsum("ij,ij->ij", &a, &[m, n], Some((&b, &[m, n])))?;

// Tensor contractions (24 patterns supported)
let result = einsum("ijk,kl->ijl", &tensor3d, &[i, j, k], Some((&mat, &[k, l])))?;

Supported Patterns:

Matrix operations: matmul, transpose, trace, diagonal
Tensor transposes: 3 variants (ijk→ikj, ijk→jik, ijk→kji)
Reductions: row/column sums, total sum, axis sums
Products: outer, Hadamard, dot, Frobenius
Advanced: tensor-matrix contraction, batched operations

Batched Operations

use oxiblas_blas::tensor::{batched_matmul, Tensor3};

// Batch of matrix multiplications
let a = Tensor3::from_data(data_a, batch_size, m, k);
let b = Tensor3::from_data(data_b, batch_size, k, n);
let c = batched_matmul(&a, &b)?;
// c[i] = a[i] * b[i] for each batch i

Sparse Matrix Support

Formats

CSR - Compressed Sparse Row
CSC - Compressed Sparse Column
COO - Coordinate format
ELL - ELLPACK format
DIA - Diagonal format
BSR - Block Sparse Row
BSC - Block Sparse Column
HYB - Hybrid ELL+COO
SELL - Sliced ELLPACK (GPU-optimized)

Iterative Solvers

GMRES with restart (includes PGMRES, FGMRES)
Conjugate Gradient (CG, PCG, Block-CG)
BiCGStab
MINRES (includes PMINRES)
QMR (Quasi-Minimal Residual)
TFQMR (Transpose-Free QMR)
IDR(s) (Induced Dimension Reduction)
Block-GMRES (multiple RHS)
Lanczos iteration (symmetric)
Arnoldi iteration (general)
IRAM (Implicitly Restarted Arnoldi Method)

Preconditioners

ILU(0) / ILUT (Incomplete LU with threshold)
ILUTP (ILUT with pivoting)
Incomplete Cholesky (IC0, ICT)
Jacobi / Block Jacobi
Gauss-Seidel / SOR / SSOR
AMG (Algebraic Multigrid - classical Ruge-Stüben)
SA-AMG (Smoothed Aggregation AMG)
SPAI (Sparse Approximate Inverse)
AINV (Approximate Inverse)
Additive Schwarz (domain decomposition)
Polynomial preconditioners (Neumann, Chebyshev)

Sparse Eigenvalue & SVD

Lanczos method (symmetric matrices)
Block Lanczos
Arnoldi iteration (general matrices)
Block Arnoldi
Shift-invert spectral transformation
Polynomial filtering (Chebyshev)
Interval eigenvalue computation (Sturm sequence)
Truncated SVD
Randomized SVD
Incremental SVD (Brand algorithm)

Reordering

RCM (Reverse Cuthill-McKee)
AMD (Approximate Minimum Degree)
MMD (Multiple Minimum Degree)
COLAMD (Column AMD for unsymmetric/rectangular)
Nested Dissection (level-set based)

Performance

OxiBLAS implements BLIS-style blocked algorithms with architecture-specific SIMD kernels and platform-aware cache tuning.

Linux x86_64 Performance (Intel Xeon E5-2623 v4 @ 2.60GHz)

Benchmarked: 2025-12-27 CPU: Intel Xeon E5-2623 v4 @ 2.60GHz (Broadwell-EP) Cache: L1D=32KB, L2=256KB, L3=10MB SIMD: AVX2, FMA, SSE4.1, SSE4.2

DGEMM Performance (f64)

Matrix Size	OxiBLAS Time	OxiBLAS Throughput	OpenBLAS Throughput	vs OpenBLAS	Status
64×64	30.84 µs	8.50 Gelem/s (136 Gf/s)	8.49 Gelem/s (136 Gf/s)	100%	🟢 Excellent
128×128	184.92 µs	11.34 Gelem/s (181 Gf/s)	12.62 Gelem/s (202 Gf/s)	90%	🟢 Very Good
256×256	1.220 ms	13.75 Gelem/s (220 Gf/s)	14.48 Gelem/s (232 Gf/s)	95%	🟢 Very Good
512×512	11.05 ms	12.15 Gelem/s (194 Gf/s)	15.24 Gelem/s (244 Gf/s)	80%	🟡 Good
1024×1024	80.68 ms	13.31 Gelem/s (213 Gf/s)	13.01 Gelem/s (208 Gf/s)	102%	🟢 Excellent

Peak Performance: 13.75 Gelem/s = 220 GFLOPS (256×256 f64) Key Achievement: OxiBLAS outperforms OpenBLAS by 2% on very large matrices (1024×1024)

SGEMM Performance (f32)

Matrix Size	OxiBLAS Throughput	OpenBLAS Throughput	vs OpenBLAS	Status
64×64	15.79 Gelem/s (253 Gf/s)	14.06 Gelem/s (225 Gf/s)	112%	🟢 Excellent
128×128	20.50 Gelem/s (328 Gf/s)	20.50 Gelem/s (328 Gf/s)	100%	🟢 Excellent
256×256	23.65 Gelem/s (378 Gf/s)	25.02 Gelem/s (400 Gf/s)	95%	🟢 Very Good
512×512	25.16 Gelem/s (403 Gf/s)	~26.8 Gelem/s (~429 Gf/s)	~94%	🟢 Very Good

Summary: OxiBLAS achieves 80-112% of OpenBLAS performance on Linux x86_64, with superior performance on small f32 and very large f64 matrices.

Linux Optimization Highlights

13-20% performance improvement after cache tuning for 256KB L2 systems
Fine-tuned blocking parameters: KC=192, MC=128 (optimized for 256KB L2 cache)
Increased prefetch distance: 12 iterations (optimized for Intel Xeon E5-2600 memory latency)
Platform-aware cache detection: Linux sysfs, macOS sysctl, x86_64 CPUID fallback
All 2,922 tests passing with zero warnings

macOS AArch64 Performance (Apple M3)

Benchmarked: 2025-12-28 CPU: Apple M3 (ARM64 NEON) Cache: P-cores: L1D=128KB, L2=16MB SIMD: NEON (128-bit)

DGEMM Performance (f64)

Matrix Size	OxiBLAS Time	OxiBLAS Throughput	OpenBLAS Throughput	vs OpenBLAS	Status
64×64	10.14 µs	25.84 Gelem/s (414 Gf/s)	25.59 Gelem/s (409 Gf/s)	101%	🟢 Excellent
128×128	83.63 µs	25.08 Gelem/s (401 Gf/s)	25.88 Gelem/s (414 Gf/s)	97%	🟢 Very Good
256×256	670.89 µs	24.99 Gelem/s (400 Gf/s)	25.47 Gelem/s (408 Gf/s)	98%	🟢 Very Good
512×512	5.21 ms	25.76 Gelem/s (412 Gf/s)	25.44 Gelem/s (407 Gf/s)	101%	🟢 Excellent
1024×1024	40.25 ms	26.68 Gelem/s (427 Gf/s)	26.48 Gelem/s (424 Gf/s)	101%	🟢 Excellent

Peak Performance: 26.68 Gelem/s = 427 GFLOPS (1024×1024 f64) Key Achievement: OxiBLAS matches or exceeds OpenBLAS on Apple M3 for f64 GEMM

SGEMM Performance (f32)

Matrix Size	OxiBLAS Throughput	OpenBLAS Throughput	vs OpenBLAS	Status
64×64	44.74 Gelem/s (716 Gf/s)	44.39 Gelem/s (710 Gf/s)	101%	🟢 Excellent
128×128	28.34 Gelem/s (453 Gf/s)	23.49 Gelem/s (376 Gf/s)	121%	🟢 Outstanding
256×256	49.08 Gelem/s (785 Gf/s)	48.78 Gelem/s (781 Gf/s)	101%	🟢 Excellent
512×512	56.28 Gelem/s (901 Gf/s)	54.36 Gelem/s (870 Gf/s)	104%	🟢 Excellent
1024×1024	55.99 Gelem/s (896 Gf/s)	32.59 Gelem/s (521 Gf/s)	172%	🟢 Outstanding

Breakthrough: OxiBLAS is 72% faster than OpenBLAS for large f32 matrices (1024×1024) on Apple M3!

Summary: OxiBLAS achieves 97-172% of OpenBLAS performance on Apple M3, with superior performance on f32 operations and competitive or better f64 performance.

macOS Optimization Highlights

NEON micro-kernel excellence: 4×6 kernel optimized for f32, achieving 121-172% of OpenBLAS
Large cache utilization: Optimal blocking for 16MB L2 cache on Apple M3 performance cores
DOT product optimization: 65% faster than OpenBLAS (6.0 vs 3.6 Gelem/s)
Production-ready: Competitive or superior performance across all major operations
Platform-aware tuning: macOS sysctl cache detection working perfectly

BLAS Level 3 Performance (via GEMM Optimization)

Operation	Size	Performance	Speedup	Status
SYRK (f64)	1024×1024	40.24 Gf/s	12.53× vs naive	✅
SYR2K (f64)	1024×1024	40.99 Gf/s	14.76× vs naive	✅
HERK (f64)	1024×1024	-	6-12× vs naive	✅
HER2K (f64)	1024×1024	-	6-15× vs naive	✅
SYMM (f64)	512×512	-	1.1-1.8× vs naive	✅
TRMM (f64)	1024×1024	40.60 Gf/s	10.79× vs naive	✅
TRSM (f64)	1024×1024	19.96 Gf/s	10.32× vs naive	✅
Complex GEMM (3M)	1024×1024	40.72 Gf/s	-	✅
Parallel GEMM	1024×1024	130.81 Gf/s	-	✅

Operation Performance Targets

Operation	Linux x86_64	macOS M3	Notes
BLAS Level 1 (DOT)	2-3 Gf/s	165% OpenBLAS ✓	M3: OxiBLAS dominates
BLAS Level 1 (AXPY)	TBD	64% OpenBLAS	Optimization opportunity
BLAS Level 2 (matvec)	TBD	TBD	Cache-aware blocking
BLAS Level 3 (f64 GEMM)	80-102% OpenBLAS ✓	97-101% OpenBLAS ✓	Production-ready
BLAS Level 3 (f32 GEMM)	94-112% OpenBLAS ✓	101-172% OpenBLAS ✓	M3: Outstanding
SYRK/SYR2K	6-15× speedup	-	GEMM-based optimization ✓
HERK/HER2K	6-15× speedup	-	GEMM-based optimization ✓
TRMM	7-11× speedup	-	GEMM-based optimization ✓
TRSM	2.5-10× speedup	-	Blocked algorithm with GEMM ✓
QR Factorization	3-7× vs unblocked ✓	-	WY representation (fixed v0.2.0)
SVD	TBD	TBD	Divide-and-conquer

Achievement: OxiBLAS matches or exceeds OpenBLAS on Apple M3 (97-172%), and is competitive on Linux x86_64 (80-112%). Pure Rust with NEON/AVX2 intrinsics.

LAPACK Factorization Speedups (Blocked vs Unblocked)

Blocked algorithms using GEMM/TRSM as building blocks deliver dramatic speedups over the unblocked Level 2 baseline. All results measured on Apple M3 (f64).

LU Factorization

Matrix Size	Unblocked	Blocked	Speedup
256×256	0.62 Gf/s	9.05 Gf/s	14.5×
512×512	0.45 Gf/s	6.65 Gf/s	14.9×
768×768	0.60 Gf/s	13.84 Gf/s	23.0×
1024×1024	0.35 Gf/s	6.99 Gf/s	19.7×

Cholesky Factorization

Matrix Size	Unblocked	Blocked	Speedup
200×200	2.32 Gelem/s	10.47 Gelem/s	4.5×
256×256	2.16 Gf/s	13.01 Gf/s	6.0×
500×500	1.65 Gelem/s	15.60 Gelem/s	9.5×
512×512	1.65 Gf/s	14.97 Gf/s	9.1×
1024×1024	1.44 Gf/s	14.73 Gf/s	10.2×

Parallelization

Enable multi-core parallelization:

oxiblas = { version = "0.2", features = ["parallel"] }

Parallel operations automatically use all available cores via Rayon for large workloads.

Benchmarking

OxiBLAS includes a comprehensive benchmarking suite with direct comparisons to OpenBLAS:

# Run all benchmarks
cargo bench --package oxiblas-benchmarks

# Compare with OpenBLAS (requires OpenBLAS installed)
cargo bench --package oxiblas-benchmarks --bench comparison --features compare-openblas

# View HTML reports
open target/criterion/report/index.html

See crates/oxiblas-benchmarks/README.md for detailed benchmarking guide.

Platform Support and Performance

Tested Platforms

Platform	CPU	SIMD	DGEMM	vs OpenBLAS
macOS (Apple M3)	ARM64	NEON 128-bit	25.6 GFLOPS	~100%
Linux x86_64	Intel Xeon E5-2623 v4	AVX2 256-bit	220 GFLOPS	~102% (1024x1024)
Any	Scalar fallback	None	varies	~30-50%

SIMD Feature Detection

OxiBLAS automatically detects and uses the best available SIMD:

x86_64: AVX-512 > AVX2 > SSE4.2 > scalar
aarch64: NEON (native 128-bit) + emulated 256-bit
wasm32: SIMD128 (if enabled) + emulated 256-bit
others: Scalar fallback

Runtime detection via SimdCapabilities::detect() with zero overhead after first call.

Performance Regression CLI

OxiBLAS ships a regress binary for tracking performance over time:

# Build the CLI
cargo build --release -p oxiblas-benchmarks

# Capture current performance as a baseline
./target/release/regress capture --output baseline.json

# On a subsequent commit/build, check for regressions (fail if >5% slower)
./target/release/regress check --baseline baseline.json --threshold 5.0

# Print a formatted performance report
./target/release/regress report --baseline baseline.json

# List all available benchmark names
./target/release/regress list

The check subcommand exits with a non-zero status when any operation regresses beyond the threshold, making it suitable for CI pipelines.

Algorithm Selection Guide

OxiBLAS provides multiple computation variants for dense factorizations, each suited to different use cases. The compute_auto() method is the recommended default — it selects between blocked and unblocked algorithms based on matrix size at zero cost.

Dense Factorizations (Cholesky, LU, QR)

Method	When to Use	Notes
`compute()`	Small matrices (n < 64-128), minimal overhead	Unblocked Level 2 BLAS
`compute_blocked(nb)`	Large matrices, explicit block size control	Level 3 BLAS; 6-23× over unblocked
`compute_recursive()`	Very large matrices, cache-oblivious	No explicit block size; auto-adapts to cache
`compute_auto()`	Recommended default	Auto-selects blocked (n >= 128) or unblocked (n < 128)
`compute_blocked_par()`	Very large matrices on multi-core systems	Requires `parallel` feature; Rayon-based

use oxiblas_lapack::{cholesky::Cholesky, lu::Lu, qr::Qr};

// Recommended: automatic algorithm selection
let chol = Cholesky::compute_auto(a.as_ref())?;  // up to 9.75× faster than compute()
let lu   = Lu::compute_auto(a.as_ref())?;         // up to 23× faster than compute()
let qr   = Qr::compute_auto(a.as_ref())?;         // 3-7× faster for large matrices

// Explicit blocked with custom block size
let chol = Cholesky::compute_blocked(a.as_ref(), 64)?;

// Cache-oblivious recursive (no tuning required)
let lu = Lu::compute_recursive(a.as_ref())?;

// Parallel blocked (requires "parallel" feature)
let chol = Cholesky::compute_blocked_par(a.as_ref())?;

Mixed-Precision Iterative Refinement

For applications requiring double-precision accuracy with single-precision speed:

use oxiblas_lapack::refine::{mixed_precision_solve, mixed_precision_solve_cholesky};

// LU-based: factorize in f32, refine residuals in f64
let x = mixed_precision_solve(a.as_ref(), b.as_ref())?;

// Cholesky-based (symmetric positive definite systems)
let x = mixed_precision_solve_cholesky(a.as_ref(), b.as_ref())?;

Batched BLAS Operations

For batches of independent small matrix operations (e.g., neural network layers):

use oxiblas_blas::batched::{gemm_batched, gemm_strided_batched, axpy_batched};

// Batch of independent GEMMs: C[i] = A[i] * B[i]
gemm_batched(1.0, &a_slices, &b_slices, 0.0, &mut c_slices)?;

// Strided batched GEMM (contiguous memory layout)
gemm_strided_batched(1.0, &a, stride_a, &b, stride_b, 0.0, &mut c, stride_c, batch)?;

Feature Flags

Feature	Description
`default`	Core functionality (f32, f64, Complex32, Complex64)
`parallel`	Enable rayon-based parallelization for multi-core
`f16`	Half-precision (16-bit) floating point support
`f128`	Quad-precision (~31 digits) via QuadFloat
`full`	All features enabled
`nightly`	Nightly-only optimizations

Benchmarks-specific features:

Feature	Description
`compare-openblas`	Enable OpenBLAS comparison benchmarks

Requirements

Rust 1.85+ (Edition 2024)
No external C dependencies

Project Status

Version: 0.2.1 (2026-03-16)

Lines of Code: ~223,935 Rust (371 files)
Documentation: ~16,163 lines of comments, 12 comprehensive examples
Tests: 2,922 passing lib tests + 287 doctests (100% success rate)
Coverage:
- ✅ Full BLAS Level 1/2/3 (including packed/banded variants)
- ✅ Extensive LAPACK (LU, Cholesky, QR, SVD, EVD, Schur, Hessenberg)
- ✅ Recursive cache-oblivious factorizations (Cholesky, LU, QR)
- ✅ Parallel blocked factorizations (Cholesky, LU)
- ✅ Fixed blocked QR with WY representation (3-7x speedup realized)
- ✅ Complex bidiagonal reduction (ComplexBidiagFactors)
- ✅ Batched BLAS operations (gemm_batched, axpy_batched, gemv_batched)
- ✅ Mixed-precision iterative refinement (LU, Cholesky, symmetric, QR)
- ✅ Runtime auto-tuning infrastructure (RuntimeAutoTuner)
- ✅ Multifrontal sparse factorizations (MultifrontalCholesky, MultifrontalLU)
- ✅ ndarray parallel GEMM and sparse integration
- ✅ no_std support for oxiblas-core and oxiblas-matrix
- ✅ Advanced sparse LU pivoting (threshold, static, Bunch-Kaufman LDL^T)
- ✅ Sparse matrices (9 formats: CSR, CSC, COO, ELL, DIA, BSR, BSC, HYB, SELL)
- ✅ Iterative solvers (GMRES, FGMRES, CG, BiCGStab, MINRES, QMR, TFQMR, IDR(s))
- ✅ Advanced preconditioners (ILUT, IC, AMG, SA-AMG, SPAI, AINV, Schwarz)
- ✅ Sparse eigenvalue/SVD (Lanczos, Arnoldi, IRAM, polynomial filtering, truncated/randomized SVD)
- ✅ Tensor operations (einsum with 24 patterns, batched matmul)
- ✅ Extended precision (f16, f128, Kahan/pairwise summation)
- ✅ Comprehensive benchmarks with OpenBLAS comparison
- ✅ Complete API documentation with examples

See TODO.md for the development roadmap.

Related Projects

OxiBLAS is part of the SciRS2 scientific computing ecosystem:

SciRS2 - Scientific computing library
NumRS2 - Numerical computing
SkleaRS - Machine learning
ToRSh - Tensor operations
TrustformeRS - Transformers
QuantRS2 - Quantum computing framework
OxiRS - Semantic Web platform (SPARQL 1.2, GraphQL, Industrial Digital Twin, AI-augmented reasoning)

Sponsorship

OxiBLAS is developed and maintained by COOLJAPAN OU (Team Kitasan).

If you find OxiBLAS useful, please consider sponsoring the project to support continued development of the Pure Rust ecosystem.

https://github.com/sponsors/cool-japan

Your sponsorship helps us:

Maintain and improve the COOLJAPAN ecosystem
Keep the entire ecosystem (OxiBLAS, OxiFFT, SciRS2, etc.) 100% Pure Rust
Provide long-term support and security updates

License

Licensed under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0).

See LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
crates		crates
examples		examples
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md

Folders and files

Latest commit

History

Repository files navigation

OxiBLAS

Features

Supported Types

Quick Start

Basic BLAS Example

LU Decomposition

SVD

Examples

Crate Structure

BLAS Operations

Level 1 (Vector-Vector)

Level 2 (Matrix-Vector)

Level 3 (Matrix-Matrix)

LAPACK Operations

Decompositions

Solvers

Utilities

Extended Precision & Tensor Operations

Extended Precision

Tensor Operations & Einstein Summation

Batched Operations

Sparse Matrix Support

Formats

Iterative Solvers

Preconditioners

Sparse Eigenvalue & SVD

Reordering

Performance

Linux x86_64 Performance (Intel Xeon E5-2623 v4 @ 2.60GHz)

DGEMM Performance (f64)

SGEMM Performance (f32)

Linux Optimization Highlights

macOS AArch64 Performance (Apple M3)

DGEMM Performance (f64)

SGEMM Performance (f32)

macOS Optimization Highlights

BLAS Level 3 Performance (via GEMM Optimization)

Operation Performance Targets

LAPACK Factorization Speedups (Blocked vs Unblocked)

LU Factorization

Cholesky Factorization

Parallelization

Benchmarking

Platform Support and Performance

Tested Platforms

SIMD Feature Detection

Performance Regression CLI

Algorithm Selection Guide

Dense Factorizations (Cholesky, LU, QR)

Mixed-Precision Iterative Refinement

Batched BLAS Operations

Feature Flags

Requirements

Project Status

Related Projects

Sponsorship

License

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages