# BigGrep Index Parser (rs-bgparse)

## Overview

rs-bgparse is the BigGrep index parser that reads N-gram indexes and generates candidate file lists for search verification. This implementation corrects the original generic file parser to match the original BigGrep bgparse functionality.

## Features

- **N-gram Index Reading**: Reads BigGrep index files (.bgi) and parses their structure
- **Hint-based Searching**: Uses 16-N-gram jump hints for efficient searching
- **PFOR/VarByte Decoding**: Supports both compressed posting list formats
- **Candidate Generation**: Generates candidate file lists for search verification
- **Parallel Processing**: Searches multiple index files concurrently

## Command Line Interface

```
rs-bgparse -d <directory> -p <patterns> [options]

Options:
  -d, --directory <DIR>     Index directory to search in
  -p, --patterns <PATTERNS> Search patterns (hex-encoded, comma-separated)
  --max-candidates <N>      Maximum candidate files to return (default: 15000)
  -v, --verbose            Verbose output
  --stats                  Show compression statistics per N-gram
  --debug                  Debug output
  --prefix <STRING>        Index file prefix (default: "index")
  -h, --help               Show help
  -V, --version            Show version
```

## Usage Examples

### Basic Search
```bash
rs-bgparse -d /path/to/indexes -p "68656c6c6f"  # Search for "hello" in hex
```

### Multiple Patterns
```bash
rs-bgparse -d /path/to/indexes -p "68656c6c6f,776f726c64"  # Search for "hello,world"
```

### With Verbose Output
```bash
rs-bgparse -d /path/to/indexes -p "68656c6c6f" -v
```

### Show Statistics
```bash
rs-bgparse -d /path/to/indexes -p "68656c6c6f" --stats
```

## Implementation Details

### Index Structure
The implementation reads BigGrep index files with the following structure:
- **Header**: Magic number, version, N-gram order, file counts
- **Hints**: Prefix-based mappings for fast N-gram lookup
- **N-gram Postings**: Compressed file ID lists
- **File ID Map**: Maps numeric file IDs to file paths

### Search Algorithm
1. **Pattern Conversion**: Hex-encoded patterns → binary data → N-grams
2. **Hint Lookup**: Use N-gram prefixes to find approximate locations
3. **Window Search**: Search within 16-N-gram windows using hints
4. **Posting Decoding**: Decode PFOR/VarByte compressed file ID lists
5. **Set Intersection**: Intersect candidate sets across N-grams
6. **Path Resolution**: Map file IDs to actual file paths

### Compression Support
- **VarByte**: Variable-length integer encoding
- **PFOR**: Patched Frame Of Reference with exception handling
- **Hint Types**: Support for different prefix granularities (0, 1, 2)

## Architecture

The implementation follows the original BigGrep bgparse design:
- Memory-mapped I/O for efficient index access
- Hint-based seeking to minimize I/O operations
- Efficient compressed integer decoding
- Set intersection for multi-N-gram queries
- Early termination for empty candidate sets

## Dependencies

- `biggrep-core`: Core BigGrep functionality
- `clap`: Command-line argument parsing
- `anyhow`: Error handling
- `log`: Logging infrastructure
- `byteorder`: Binary data reading/writing
- `memmap2`: Memory-mapped file access
- `regex`: Regular expression support

## Building

```bash
cd crates/rs-bgparse
cargo build --release
```

## Testing

```bash
cd crates/rs-bgparse
cargo test
```

## Integration

rs-bgparse is designed to work with the broader BigGrep ecosystem:
- Input to `rs-bgverify` for false positive elimination
- Integrated with `rs-bgsearch` for end-to-end search workflows
- Compatible with BigGrep index format specifications

## Differences from Original

This Rust implementation provides the same core functionality as the original C++ bgparse:
- Identical command-line interface
- Same index file format support
- Equivalent search algorithms and performance characteristics
- Enhanced error handling and logging