# BigGrep Rust Implementation Architecture (Cargo Workspace) — Design Blueprint and Implementation Roadmap

## Executive Summary and Objectives

BigGrep targets ultra-fast pattern discovery over extremely large corpora by combining carefully engineered data structures with an I/O strategy that minimizes overhead and a workspace design that scales across multiple tools and teams. This document defines a complete Rust implementation architecture organized as a Cargo workspace centered on a shared library crate—biggrep-core—exposing stable, well-typed APIs for the core capabilities: corpus parsing, N-gram tokenization and indexing, search and verification, and file-level extraction. The workspace also defines five command-line interface (CLI) binaries with rs- prefixes—rs-bgindex, rs-bgsearch, rs-bgparse, rs-bgverify, and rs-bgextractfile—that compose those capabilities into task-oriented workflows.

Performance objectives emphasize predictable, high-throughput search backed by compact indexes and verifiable correctness. The architecture pursues zero-copy reads through memory-mapped I/O, parallel search via thread pools, and vectorized routines where beneficial. It treats index building as a batch pipeline: parse tokens, count N-grams, sort, and construct compressed indexes—primarily using Elias–Fano-encoded tries, which provide strong compression with competitive lookup speed in massive N-gram datasets.[^2][^5][^6] The design explicitly supports large files through memory mapping while guarding against spurious effects, particularly on operating systems where page cache eviction and read-ahead behavior can vary.[^3][^8]

Non-goals include a full-text search engine with scoring and relevancy models beyond what BigGrep requires; generic tokenization libraries; and tight integration with specific operating systems beyond standard Rust practices. Where requirements are not yet specified, this blueprint articulates gaps and proposes practical defaults to unblock initial development.

## Cargo Workspace Architecture

The workspace adopts a minimal yet scalable structure: one shared library crate (biggrep-core) that holds core data structures, parsers, indexers, and search engines; and five binary crates (rs-bgindex, rs-bgsearch, rs-bgparse, rs-bgverify, rs-bgextractfile) that implement user-facing CLIs. This separation ensures a stable programmatic API for library consumers while allowing CLI crates to iterate on argument parsing, output formats, and operational ergonomics. Workspace-wide dependency versions and tool settings are centralized using workspace inheritance and shared configuration, enabling consistent builds and simplifying upgrades.[^1][^9]

Workspace dependencies and features are centralized in the root Cargo.toml under [workspace.dependencies], with edition and rust-version enforced at the workspace level. Binary crates import biggrep-core as a dependency, avoiding duplication of index formats, verification logic, and tokenizer code. A shared lints configuration and rustfmt settings enforce uniformity across the codebase.

To make responsibilities and interfaces concrete, the workspace composition is summarized below.

To illustrate the overall decomposition, the following table enumerates each workspace member, its type, primary responsibilities, and its key dependency on biggrep-core.

Table 1. Workspace Members and Responsibilities

| Member              | Type     | Primary Responsibilities                                                                 | Key Dependency on biggrep-core                                   |
|---------------------|----------|-------------------------------------------------------------------------------------------|-------------------------------------------------------------------|
| biggrep-core        | Library  | Tokenization, N-gram processing, index build/search, verification, metadata store API     | N/A                                                               |
| rs-bgparse          | Binary   | Corpus ingestion and parsing; optional chunking; token emission to counts or index build  | Tokenization API, N-gram counting, writer utilities               |
| rs-bgindex          | Binary   | Index builder from parsed N-gram counts (sorting, EF-trie construction, serialization)    | Index build pipeline, EF-trie implementation, writer API          |
| rs-bgsearch         | Binary   | Search interface over built indexes; optional regex filtering and file filtering          | Search engine, index access API, filter combinators               |
| rs-bgverify         | Binary   | Consistency checks for indexes and metadata; optional spot-check with ground-truth scans  | Verification APIs (index integrity, count checks), reporter       |
| rs-bgextractfile    | Binary   | Extract file paths or byte ranges from the index or metadata for post-processing          | File metadata API, extract utilities                              |

Workspace dependency and feature consistency follows best practices: dependencies defined once and referenced via .workspace = true across crates, workspace-level features to control conditional compilation, and patch sections to test local forks when necessary.[^1][^9]

Table 2. Workspace-level Features and Propagation

| Feature              | Default | Description                                                                                  | Member Crates Enabling It             |
|---------------------|---------|----------------------------------------------------------------------------------------------|--------------------------------------|
| simd                | off     | Enables SIMD-accelerated routines for tokenization and scanning where applicable            | biggrep-core                         |
| mmapped_io          | on      | Enables memory-mapped file reads via memmap2 for zero-copy access                           | biggrep-core                         |
| verification_checks | on      | Enables expensive index verification routines and extra integrity tests                      | biggrep-core, rs-bgverify            |
| metrics             | off     | Enables metrics emission (timers, counters) for benchmarking and runtime observability      | biggrep-core, all binaries           |
| parallel_search     | on      | Enables multi-threaded search using scoped thread pools                                      | biggrep-core, rs-bgsearch            |

These features propagate through member crates, ensuring coherent builds and preventing partial feature activation.[^9]

### Dependency and Feature Strategy

Centralizing dependencies is central to reliability and speed. The root Cargo.toml declares workspace dependencies (e.g., clap, regex, byteorder, memmap2) and optional performance or verification features. Crates reference them using .workspace = true to avoid version skew and ensure a single source of truth. Workspace-level features (e.g., simd, mmapped_io) are explicitly coordinated so that turning them on at the workspace level activates the intended sub-features in biggrep-core and downstream binaries. When developing unreleased internal crates or forked crates, patch sections route crates-io or git dependencies to local paths or forks, allowing end-to-end testing without publishing.[^1][^9]

### Module Structure of biggrep-core

biggrep-core exposes four primary modules:

- Tokenization and parsing: streaming tokenization with byte-oriented normalization and optional Unicode support; chunked readers to mitigate mmap caveats and enable parallel parsing.
- N-gram processing: counting, aggregation, and sorting for fixed N (unigrams through N-grams); buffer management for large volumes.
- Indexing: compressed N-gram index construction, primarily Elias–Fano trie (EF-trie), and serialization with controlled memory usage.[^2][^5][^6]
- Search: prefix/range enumeration, regex post-filtering over candidate matches, and file-scope filtering for multi-file corpora.
- Verification: structural and count-based integrity checks; optional spot-checks comparing against scans of raw files to detect edge cases in parsing or encoding.[^4]
- File metadata: lightweight APIs to map byte ranges to file paths and vice versa; optional persistent metadata index for extraction.

This modularity yields a clean public API: index builder orchestrations, search engines with filter combinators, and verification runners that accept reporters and configuration. It also allows swap-in alternative index structures for experimentation without changing the CLI contracts.

## Data Model and Storage Format

BigGrep’s data model follows pragmatic defaults for large-scale corpora. A corpus consists of one or more text files. A token is the atomic unit of search after normalization and optional Unicode handling. An N-gram is a sequence of N tokens; counts record occurrences. The canonical storage includes per-order files—one file per N—using a line-oriented modified Google format: each line contains a gram (tokens separated by spaces) and a count separated by a tab. Files can be compressed with gzip to reduce storage, and the index builder reads gzip when appropriate.[^2]

Indexes are serialized into binary files. The preferred structure is an EF-trie compressed with Elias–Fano codes for efficient prefix/range enumeration at scale. Elias–Fano encoding offers competitive lookup performance with small footprints, making it well-suited for the interactive search workloads BigGrep targets.[^5][^6] Metadata stores maintain minimal per-file records (path, size, optionally checksum) and optional byte-range mappings for extraction.

To provide a concise reference, the index and metadata schema are summarized below.

Table 3. Index File Schema

| Section                  | Contents                                                                                   | Notes                                                                                         |
|--------------------------|--------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|
| Header                   | Magic number, version, N (order), counts of N-grams, flags (e.g., compression options)    | Versioning enables forward compatibility; flags indicate encoding options                    |
| Vocabulary (optional)    | Token map: token string ↔ integer ID                                                       | Remapping can reduce index size; may be embedded or separate                                 |
| EF-trie Nodes            | Pointers, token IDs, count ranks encoded with Elias–Fano                                   | EF encoding provides compressed trie with fast lookups[^5][^6]                               |
| Offset Index             | Byte offsets for per-order index segments                                                  | Supports mmap-friendly random access                                                          |
| Footer                   | Checksums (optional), index statistics (e.g., bytes/gram), alignment padding               | Alignment improves mmap performance on some platforms                                         |

Table 4. Metadata Store Schema

| Field             | Type                 | Purpose                                           | Optional |
|-------------------|----------------------|---------------------------------------------------|---------|
| file_id           | u64                  | Unique identifier for the file                    | No      |
| path              | String               | Canonical file path                               | No      |
| size_bytes        | u64                  | File size                                         | No      |
| checksum          | Option<String>       | Checksum (e.g., SHA-256)                          | Yes     |
| byte_range_map    | Option<Vec<(u64,u64)>> | Optional mapping of token ranges to byte spans    | Yes     |

### N-gram Index Structures

The EF-trie is the primary N-gram index. It encodes a trie of N-grams using Elias–Fano codes to compress the representation while preserving fast enumeration of prefixes and ranges. This balances memory footprint and lookup speed: empirical results on large N-gram datasets demonstrate compact storage and competitive per-gram lookup times.[^2][^5][^6] Alternative structures can be explored later:

- Rusty-DAWG (CDAWG): efficient automata supporting unbounded-length N-gram searches; valuable for certain pattern matching workloads.[^4]
- Partitioned Elias–Fano: variant that can accelerate range enumerations on specific access patterns.[^5][^6]
- Minimal perfect hashing: improves exact-match lookup at the cost of range enumeration and dynamic updates.

We propose EF-trie as the default baseline to anchor the initial implementation and benchmark plan.

### Metadata Store

File-level metadata accompanies the index for operational workflows—listing, filtering, and extraction. For memory efficiency, persistent metadata stores may store only file paths, sizes, and optional checksums, with byte-range maps computed lazily or only for files where extraction is common. These choices can be tuned per corpus characteristics, preserving a thin, predictable API while accommodating scale.

## Core Capabilities and Library API (biggrep-core)

biggrep-core exposes stable APIs for its core workflows:

- Parse: streaming tokenization and N-gram counting; chunked readers to decouple I/O from parsing and enable parallel counting.
- Index: sort counts by gram (optionally using a vocabulary) and build an EF-trie; serialize with headers, offset indices, and optional checksums.[^2][^5][^6]
- Search: execute prefix queries, enumerate candidates by range, and apply regex post-filtering for complex patterns; constrain search by file scope.
- Verify: run structural and count-based integrity checks; perform spot-checks comparing against ground truth scans to detect edge conditions in parsing or encoding.[^4]
- Extract: map candidates to file paths or byte ranges and emit structured results.

Error handling prefers typed results (Result<T, E>) with clear error variants (e.g., ParseError, IoError, IndexError, VerifyError). Public APIs avoid internal panics; errors are reported with context (file, line, byte offset) where feasible.

To orient CLI integration, the following table maps high-level APIs to typical CLI usage.

Table 5. Core API Mapping

| API (biggrep-core)             | Functionality                                              | Used By          | Inputs                                                | Outputs                                           |
|--------------------------------|------------------------------------------------------------|------------------|-------------------------------------------------------|---------------------------------------------------|
| Tokenizer::stream              | Stream tokens from a reader/file with normalization        | rs-bgparse       | Reader or file path, normalization config             | Token stream                                       |
| NgramCounter::count_chunks     | Count N-grams over chunked input                           | rs-bgparse       | Token stream, chunk size, N                           | Chunk counts                                      |
| Sorter::sort_counts            | Sort counts by gram using vocabulary order                 | rs-bgindex       | Chunk counts, vocabulary map                          | Sorted counts                                     |
| IndexBuilder::build_ef_trie    | Build EF-trie from sorted counts                           | rs-bgindex       | Sorted counts, options                                | In-memory EF-trie                                 |
| IndexSerializer::persist       | Serialize EF-trie with header, offset index, footer        | rs-bgindex       | EF-trie, path                                         | Binary index file                                 |
| SearchEngine::prefix_query     | Enumerate candidates by prefix range                       | rs-bgsearch      | Index access, prefix token(s)                         | Candidate N-gram stream                           |
| FilterChain::regex_filter      | Post-filter candidates with regex over token strings       | rs-bgsearch      | Candidate stream, regex engine                        | Filtered candidate stream                         |
| Verifier::run                  | Integrity checks over index and optional ground truth      | rs-bgverify      | Index, metadata, ground truth config                  | Report (errors, warnings, statistics)             |
| Extractor::file_ranges         | Map candidates to file paths or byte ranges                | rs-bgextractfile | Candidates, metadata                                  | Structured results (file, offsets, spans)         |

### Parser/Tokenizer

The tokenizer is byte-oriented to minimize overhead and supports optional Unicode-aware normalization. It reads files in chunks to mitigate mmap caveats and enable controlled parallelism in counting. Chunked readers emit tokens to the counting stage, where counters aggregate per-chunk N-gram occurrences and then merge results. This design aligns with large-file handling guidance and provides predictable memory behavior.[^3]

### Index Builder

Index building proceeds in a deterministic pipeline:

1. Parse tokens and count N-grams, emitting per-order counts with consistent token ordering.
2. Sort counts by gram, optionally remapping tokens using a vocabulary to reduce index size; tools that sort N-gram files before indexing are well-established in this domain.[^2]
3. Construct the EF-trie using sorted counts and serialize it with a header, per-order segments, offset indices, and optional checksums and statistics.[^5][^6]

The builder reports resource usage (e.g., peak memory, bytes per gram) and supports reproducible index versions with a stable schema and flags.

### Search and Verification

Search composes prefix queries, range enumerations, and regex post-filtering. A file-scope filter constrains candidates to specific files or patterns, enabling targeted retrieval across multi-file corpora. Verification offers structural checks (e.g., offset consistency, sortedness, required headers) and count-based validations against ground truth scans; this dual approach catches encoding issues and corpus-specific edge cases.[^4]

## CLI Tools (rs-* Binaries)

The five CLI tools compose biggrep-core capabilities with ergonomic argument handling using clap, which supports subcommands, long options, and validation. Commands share common flags for index paths, concurrency control, and reporting formats, making multi-step workflows predictable and scriptable. Outputs are line-oriented and parseable for downstream integration.

Table 6. CLI Command Matrix

| Tool             | Subcommands/Flags (Illustrative)                                                                                         | Inputs                                 | Outputs                                         | Typical Use Cases                                                 |
|------------------|---------------------------------------------------------------------------------------------------------------------------|----------------------------------------|-------------------------------------------------|-------------------------------------------------------------------|
| rs-bgparse       | --input, --order, --chunk-size, --output (counts), --normalize, --unicode                                                | Text files or directories               | Per-order count files (modified Google format)  | Ingest corpus; generate N-gram counts for index building          |
| rs-bgindex       | --counts, --order, --output, --vocab (optional), --compression, --checksum                                              | Count files, vocabulary (optional)      | Binary index file (EF-trie)                     | Build and persist EF-trie index with headers and offset index     |
| rs-bgsearch      | --index, --prefix, --regex, --file-scope, --concurrency, --format                                                       | Index file, search patterns             | Matches (gram, count, file references)          | Fast prefix search with optional regex post-filtering             |
| rs-bgverify      | --index, --metadata, --ground-truth (optional), --mode (structural/count-based), --report                               | Index and metadata; optional raw scans  | Verification report                             | Validate index integrity and detect encoding/parsing edge cases   |
| rs-bgextractfile | --index, --metadata, --candidates or --pattern, --output (file/byte ranges), --format                                   | Index and metadata; candidates          | Extracted file references or byte spans         | Post-process results; map to files and ranges for pipelines       |

Integration flows are straightforward: rs-bgparse emits counts; rs-bgindex builds the index; rs-bgsearch queries the index; rs-bgverify audits; rs-bgextractfile translates results into actionable file-level artifacts. This decoupled pipeline allows parallel index building and targeted verification without coupling CLI behavior to internal data structures.[^1]

## I/O and Performance Strategy

BigGrep’s performance strategy is anchored in zero-copy reads where safe and efficient, parallel search to saturate CPU cores, and vectorization for hotspots. The system uses memory-mapped I/O to avoid copy overhead and minimize kernel-user space transitions, especially for random access patterns. However, it treats mmap with operational discipline: chunked readers and controlled access mitigate page cache eviction and read-ahead pitfalls across operating systems.[^3][^8]

Parallel search employs scoped thread pools that maintain cache locality and predictable memory behavior. Data-oriented design favors structures of arrays (SoA) for hot loops (e.g., token scanning and candidate filtering), minimizing indirections and improving vectorization opportunities. Vectorized scanning—particularly for ASCII or UTF-8 byte patterns—can significantly accelerate tokenization and regex prefilters, provided robust fallbacks for general Unicode handling.[^11]

### Memory-mapped I/O

Memory-mapped I/O (mmap) maps file contents directly into the process address space, enabling read operations that resemble in-memory access. This reduces I/O overhead and can deliver substantial performance gains for random access workloads.[^3][^8] Risks include:

- Read-ahead amplification on large files.
- Page cache eviction under memory pressure.
- Partial writes and synchronization concerns for multi-writer scenarios.

Controls include advisory flags, chunked readers, and alignment to page boundaries to improve TLB behavior and avoid spurious faults. The design avoids shared writable mmaps in initial versions to reduce complexity.

Table 7. mmap Risk and Mitigation Matrix

| Risk                          | Symptom                                      | Mitigation Strategy                                                                 |
|-------------------------------|-----------------------------------------------|-------------------------------------------------------------------------------------|
| Read-ahead amplification      | High memory usage, slow random access         | Use chunked readers; disable read-ahead for large files; prefetch selectively       |
| Page cache eviction           | Performance collapses under memory pressure   | Use conservative mmap windows; fallback to streaming reads; monitor memory pressure |
| Cross-platform variability    | Behavior differs by OS                        | Conditional compilation per OS; test suite across platforms                         |
| Alignment and TLB misses      | Faults and slower access                      | Align buffers to page boundaries; use huge pages where available                    |
| Write synchronization         | Data races or corruption                      | Avoid shared writable mmaps; use single-writer, multi-reader protocols carefully    |

### Parallel Search and Data Layout

Parallel search decomposes work into batches sized for balanced throughput and minimal contention. Thread pools use scoped threads to avoid dynamic allocation overhead in the hot path and to ensure lifetime safety. Data-oriented layouts favor contiguous storage for candidate lists and SoA representations for token attributes, enabling better compiler auto-vectorization and hand-written SIMD routines when enabled.[^11]

Table 8. Threading and Memory Layout Considerations

| Dimension            | Consideration                                        | Guidance                                                           |
|---------------------|-------------------------------------------------------|--------------------------------------------------------------------|
| Batch size          | Balance overhead vs. cache locality                   | Choose sizes that avoid cache thrashing and keep data hot          |
| Work decomposition  | Prefix/range-based splits                             | Split by token ranges to reduce cross-thread synchronization       |
| Data layout         | SoA vs. AoS                                           | Prefer SoA for hot loops to improve vectorization                  |
| Synchronization     | Minimize locks and atomic operations                  | Use lock-free queues or batched result aggregation                 |
| Fallbacks           | Ensure correctness on non-SIMD hardware               | Provide scalar fallbacks; detect feature availability at runtime   |

## Build, Test, and Benchmark Plan

The workspace standardizes rust-version, edition, and formatting at the root to ensure reproducible builds and uniform developer experience. Selective testing in CI focuses on changed crates, while critical shared components are covered comprehensively. Benchmarks quantify index build times, lookup latency (prefix and range queries), verification coverage, and end-to-end throughput under controlled conditions.

Table 9. Benchmark Plan

| Metric                      | Methodology                                               | Data Sets                              | Target Thresholds                            |
|----------------------------|-----------------------------------------------------------|----------------------------------------|----------------------------------------------|
| Index build time           | Time pipeline from counts to serialized index             | Large corpora, N=1..5                  | Linear scaling; predictable peak memory      |
| Lookup latency (prefix)    | Measure microseconds per query                           | Random and structured prefixes         | Near-microsecond per gram for cached queries[^2] |
| Range enumeration throughput | Queries per second over candidate ranges                 | Mixed workloads                        | High QPS with stable latency                 |
| Verification coverage      | % of index structure validated; false positive rate       | Synthetic and real corpora             | Close to 100% structural coverage            |
| Bytes per gram             | Index size / number of grams                             | N=1..5 word N-gram datasets            | Competitive with EF-trie baseline[^2][^5][^6] |
| End-to-end throughput      | Tokens/sec and N-grams/sec during parsing and search      | Representative corpora                 | High throughput with consistent CPU usage    |

CI strategies prioritize fast feedback: selective tests using -p flags, coverage for biggrep-core, and targeted performance tests on critical paths.[^9] Benchmark integration ensures regressions are detected early and performance claims are grounded in measurable outcomes.

## Security, Safety, and Reliability

BigGrep adopts defensive practices around mmap: avoiding shared writable mappings, carefully handling page alignment and access patterns, and preferring streaming or chunked reads when conditions warrant. Input validation covers token normalization, gram encoding, and count parsing to prevent malformed index entries. Error reporting preserves context for operational troubleshooting: file paths, byte offsets, and structural details are included where feasible.

Concurrency safety rests on scoped threads and careful data sharing. The design avoids interior mutability in hot paths and prefers batch aggregation to minimize synchronization overhead. Data-oriented structures reduce aliasing and improve predictability.

Multi-process access patterns assume single-writer semantics for index construction and read-only behavior for search and verification. If concurrent access is needed, the architecture can adopt single-writer/multi-reader protocols inspired by existing crates where appropriate, while preferring file-level locking or process-level coordination when portability is paramount.[^8]

## Implementation Roadmap

A phased roadmap aligns with the technical risk profile and enables iterative validation:

Table 10. Milestones and Deliverables

| Phase | Deliverables                                                                                 | Acceptance Criteria                                                | Risks                                         | Mitigation                                           |
|------:|-----------------------------------------------------------------------------------------------|--------------------------------------------------------------------|-----------------------------------------------|------------------------------------------------------|
| 1     | Workspace skeleton; biggrep-core tokenization and N-gram counting; rs-bgparse                | Correct tokenization; accurate per-chunk counts                    | Unicode edge cases                            | Configurable normalization; robust tests             |
| 2     | Sorting pipeline and EF-trie index builder; rs-bgindex                                       | Functional index build; baseline compression metrics               | Memory pressure during build                  | Chunked processing; memory monitoring                |
| 3     | Serialization; rs-bgsearch prefix queries and regex post-filter                              | Fast prefix search; correct regex filtering                        | Regex performance variability                 | Vectorized prefilter; fallback strategies            |
| 4     | rs-bgverify structural and count-based checks; metadata extraction (rs-bgextractfile)        | High coverage verification reports; accurate extraction            | False positives/negatives                     | Spot-checks; ground truth integration                |
| 5     | Performance hardening: mmap tuning, SIMD, thread pool calibration; comprehensive benchmarks | Measurable speedups; stable throughput; cross-OS validation        | Platform variability                          | Conditional compilation; platform test matrix        |

## Information Gaps

Several key parameters remain unspecified and will influence implementation choices:

- Corpus characteristics: average and maximum file sizes, tokenization rules (Unicode normalization, whitespace, and punctuation), and whether to support streaming ingestion versus full-file parsing.
- Index constraints: required N-gram orders, persistent versus in-memory index preference, and target memory footprint.
- Operational environment: supported operating systems, whether multi-process access patterns are required, and whether data can reside on network-backed storage.
- Performance targets: acceptable index build times, query latency thresholds, and expected throughput.
- Functional scope: whether regex search is required, the exact semantics of verification, and the precise meaning of “file extraction” in BigGrep’s context.

The roadmap includes decision points to resolve these gaps with measured trade-offs once real datasets and workloads are available.

## Appendices

Glossary:

- Token: Atomic text unit after normalization.
- N-gram: Sequence of N tokens.
- EF-trie (Elias–Fano trie): Trie of N-grams encoded with Elias–Fano codes for compact storage and fast lookups.
- Prefix query: Retrieve all N-grams beginning with a given token sequence.
- Range enumeration: Enumerate N-grams within an ordered range for efficient candidate generation.

ExampleModified Google N-gram Format (per-order file):

- One file per order N.
- Header indicates the number of N-grams.
- Each line: gram (tokens separated by spaces), a horizontal tab, and the count.
- Example line: “the parent” followed by tab and “1”.[^2]

Proposed Public APIs (sketched pseudo-signatures):

- Tokenizer::stream(reader, config) -> impl Iterator<Item=Result<Token, ParseError>>
- NgramCounter::count_chunks(tokens, N, config) -> impl Iterator<Item=ChunkCounts>
- Sorter::sort_counts(chunks, vocab) -> impl Iterator<Item=SortedCounts>
- IndexBuilder::build_ef_trie(sorted_counts, options) -> EFIndex
- IndexSerializer::persist(index, path) -> Result<(), IoError>
- SearchEngine::prefix_query(index, prefix) -> impl Iterator<Item=Candidate>
- FilterChain::regex_filter(candidates, regex) -> impl Iterator<Item=Candidate>
- Verifier::run(index, metadata, config) -> Report
- Extractor::file_ranges(candidates, metadata) -> impl Iterator<Item=ExtractResult>

These signatures illustrate the stable contract between biggrep-core and CLI tools while leaving room for internal evolution.

## References

[^1]: Workspaces — The Cargo Book. https://doc.rust-lang.org/cargo/reference/workspaces.html  
[^2]: tongrams-rs: Tons of N-grams in Rust. https://github.com/kampersanda/tongrams-rs  
[^3]: Memory-mapped files for efficient data processing. https://www.blopig.com/blog/2024/08/memory-mapped-files-for-efficient-data-processing/  
[^4]: Evaluating n-Gram Novelty of Language Models Using Rusty-DAWG. https://aclanthology.org/2024.emnlp-main.800.pdf  
[^5]: Efficient Data Structures for Massive N-gram Datasets (SIGIR 2017). https://doi.org/10.1145/3077136.3080798  
[^6]: Handling Massive N-gram Datasets Efficiently (TOIS 2019). https://doi.org/10.1145/3302913  
[^7]: Text processing — list of Rust libraries/crates // Lib.rs. https://lib.rs/text-processing  
[^8]: Advanced Memory Mapping in Rust: The Hidden Superpower for High-Performance Systems. https://medium.com/@FAANG/advanced-memory-mapping-in-rust-the-hidden-superpower-for-high-performance-systems-a47679aa205e  
[^9]: Advanced Cargo Workspace Patterns: 5 Proven Strategies for Scaling Large Rust Production Codebases. https://medium.com/techkoala-insights/advanced-cargo-workspace-patterns-5-proven-strategies-for-scaling-large-rust-production-codebases-c10862a4e1f4  
[^10]: tongrams — Docs.rs. https://docs.rs/tongrams  
[^11]: Beyond multi-core parallelism: faster Mandelbrot with SIMD. https://pythonspeed.com/articles/optimizing-with-simd/