# rs-bgindex - BigGrep N-gram Index Builder

`rs-bgindex` is a high-performance Rust implementation of the BigGrep index builder, designed for ultra-fast pattern discovery over large binary corpora using N-gram indexing.

## Features

### Core Functionality
- **N-gram Indexing**: Supports 3-gram and 4-gram indexing with automatic mixing for optimal performance
- **Memory-Mapped I/O**: Zero-copy file processing for maximum throughput
- **Producer-Consumer Threading**: Parallel shingling and compression with configurable thread counts
- **Advanced Compression**: PFOR (Patched Frame of Reference) and VarByte encoding with automatic fallback
- **LoserTree Merging**: Efficient N-way merge for sorted N-gram lists
- **Hint-Based Indexing**: Fast seeking with configurable hint granularity

### Command-Line Options

| Option | Description | Default |
|--------|-------------|---------|
| `-n, --ngram` | N-gram size (3 or 4) | 3 |
| `-H, --hint-type` | Hint type (0-2) | 0 |
| `-b, --blocksize` | PFOR blocksize | 32 |
| `-e, --exceptions` | PFOR max exceptions per block | 2 |
| `-m, --minimum` | PFOR minimum entries to consider PFOR | 4 |
| `-M, --max-unique-ngrams` | Maximum unique N-grams per file | None |
| `-p, --prefix` | Index file prefix | "index" |
| `-O, --overflow` | Write overflow filenames to FILE | None |
| `-S, --sthreads` | Number of shingling threads | 4 |
| `-C, --cthreads` | Number of compression threads | 5 |
| `-v, --verbose` | Show additional info | false |
| `-L, --lockfree` | Use lockfree queues | false |
| `-l, --log` | Log file | stderr |
| `-d, --debug` | Show diagnostic information | false |

### Input Format

The tool reads a list of files from stdin, one per line, in the format:
```
file_id:path/to/file
```

Or simply:
```
path/to/file
```

The file_id will be auto-assigned sequentially (starting from 0) if not provided.

## Architecture

### Pipeline Stages

1. **Shingling**: Worker threads memory-map files and extract N-grams
   - Fast 3-gram extraction for little-endian systems (read 4 bytes, mask lower 24 bits)
   - 4-gram extraction for complete byte sequences
   - Overflow detection for dense files

2. **Sorting & Dedup**: N-grams are sorted and deduplicated per file

3. **Merging**: LoserTree-based N-way merge combines sorted lists globally

4. **Compression**: Delta-encoding followed by PFOR or VarByte encoding
   - PFOR for well-behaved lists (uniform bit width, few exceptions)
   - VarByte for small lists or lists with too many exceptions

5. **Writing**: Ordered write with hint generation and index structure

### Data Structures

#### LoserTree
Efficient priority queue for N-way merging:
- `O(log N)` extraction per N-gram
- Maintains sorted order across N input streams
- Updates tree structure in `O(log N)` time

#### Compression Methods

**VarByte Encoding**:
- Variable-length integer encoding
- Most significant bit indicates continuation
- Fast encoding/decoding

**PFOR (Patched Frame of Reference)**:
- Block-based compression with uniform bit width
- Stores exceptions separately
- Optimal for lists with predictable value distribution

### Index File Structure

```
┌─────────────────────────┐
│      Index Header       │  (56 bytes)
│  - Magic: 0x424749...   │
│  - Version              │
│  - N-gram size          │
│  - Num N-grams          │
│  - Num files            │
│  - Compression params   │
│  - Section offsets      │
├─────────────────────────┤
│     Hints Section       │  (Variable)
│  - Prefix mappings      │
│  - Byte offsets         │
├─────────────────────────┤
│   N-gram Index Body     │  (Compressed)
│  - Size (VarByte)       │
│  - N-gram (3/4 bytes)   │
│  - Compressed file IDs  │
├─────────────────────────┤
│    fileid_map           │  (Variable)
│  - File ID → Path       │
│  - Optional metadata    │
└─────────────────────────┘
```

## Usage Examples

### Basic Index Building

Build a 3-gram index for all files listed in a file list:
```bash
cat file_list.txt | rs-bgindex -p /path/to/index
```

### High-Performance Indexing

Build with maximum parallelism and lockfree queues:
```bash
cat file_list.txt | rs-bgindex \
  -S 8 \
  -C 8 \
  -L \
  -v
```

### Mixed 3-gram/4-gram Indexing

Route overflow files to 4-gram indexing:
```bash
cat file_list.txt | rs-bgindex \
  -M 1000000 \
  -O overflow_files.txt \
  -p mixed_index
```

Then process overflow files with 4-gram indexing:
```bash
cat overflow_files.txt | rs-bgindex \
  -n 4 \
  -p overflow_index
```

### Verbose Build with Logging

```bash
cat file_list.txt | rs-bgindex \
  -v \
  -d \
  -l build.log \
  -p indexed_data
```

## Performance Characteristics

### Throughput
- **Shingling**: 100-500 MB/s per core (depends on file type)
- **Compression**: 50-200 MB/s per compression thread
- **Overall**: Scales linearly with available CPU cores

### Memory Usage
- Shingling threads: ~4-8 MB per thread (file buffers)
- Compression threads: ~16-32 MB per thread (compression buffers)
- LoserTree: Minimal overhead (`N * log(N)`)

### Compression Ratios
- **Sparse files**: 5-10x compression (most N-grams unique)
- **Dense files**: 2-5x compression (many repeated N-grams)
- **Mixed corpus**: 3-7x compression (typical workload)

## Error Handling

The tool handles various error conditions gracefully:

- **File not found**: Logs warning and continues
- **Permission denied**: Logs error and skips file
- **Overflow files**: Writes to overflow list if `-O` specified
- **Memory pressure**: Falls back to smaller buffers
- **Compression failures**: Automatic fallback to simpler encoding

## Implementation Notes

### Memory-Mapped I/O
- Uses `memmap2` for zero-copy file reading
- Handles partial reads gracefully
- Cross-platform compatible (Linux, macOS, Windows)

### Threading Model
- Shingling threads: Read files and extract N-grams
- Compression threads: Encode N-gram → file ID lists
- Writer thread: Ordered write with hint generation
- Backpressure: Queue size limits prevent memory overflow

### Lockfree Queues
- Optional lockfree queues with `-L` flag
- Requires compiler support for atomic operations
- Improves throughput on high-contention workloads

### Hint System
- Type 0: 4-gram, no hint (fastest build, slower search)
- Type 1: 3-gram, 16 N-gram hint granularity
- Type 2: 3-gram, 256 N-gram hint granularity

## Integration

### With rs-bgsearch
```bash
# Build index
cat files.txt | rs-bgindex -p myindex

# Search index
rs-bgsearch -p myindex -s "needle in haystack"
```

### With rs-bgverify
```bash
# Verify index integrity
rs-bgverify -p myindex -v
```

## Building and Installing

```bash
# Build
cargo build --package rs-bgindex --release

# Install
cargo install --path crates/rs-bgindex --release

# Run from source
cargo run --package rs-bgindex --release -- [args]
```

## Testing

Run the test suite:
```bash
cargo test --package rs-bgindex
```

## License

MIT License - see LICENSE file for details.

## Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

## References

- [BigGrep: A Scalable Search Index for Binary Files](https://github.com/cmu-sei/BigGrep)
- [AScalableSearchIndexForBinaryFiles_MALWARE2012.pdf](https://github.com/cmu-sei/BigGrep/blob/master/doc/AScalableSearchIndexForBinaryFiles_MALWARE2012.pdf)
- [BigGrep Usage and Implementation Whitepaper](https://github.com/cmu-sei/BigGrep/blob/master/doc/BigGrep-Usage-and-Impl-whitepaper.pdf)
