# rs-bgextractfile - BigGrep Index Management Tool

## Overview

`rs-bgextractfile` is a BigGrep index management utility that replaces the original C++ `bgextractfile` tool. It provides efficient management of BigGrep index files, specifically the `fileid_map` section, allowing file entries to be removed, replaced, or added without requiring full index re-building.

## Key Features

- **Remove files from indexes** - Remove obsolete or outdated file entries
- **Replace file paths** - Update file locations in the index without re-indexing
- **Add new files** - Add file entries with automatic ID assignment
- **Compressed index support** - Handle both legacy and compressed BigGrep index formats
- **Index integrity validation** - Verify index consistency and detect issues
- **Batch operations** - Process multiple operations from a file
- **Atomic updates** - Safe in-place modifications with rollback capabilities

## Installation

```bash
cd /workspace/code/biggrep-rs
cargo build --package rs-bgextractfile
```

## Usage

### Basic Operations

#### Remove files from index
```bash
rs-bgextractfile -i index.bgi -r file1.txt,file2.txt
```

#### Add files to index
```bash
rs-bgextractfile -i index.bgi -a newfile1.txt,newfile2.txt
```

#### Replace file paths
```bash
rs-bgextractfile -i index.bgi --replace /old/path/file.txt:/new/path/file.txt
```

#### List files in index
```bash
rs-bgextractfile list -i index.bgi
rs-bgextractfile list -i index.bgi --pattern "*.log"
```

#### Validate index integrity
```bash
rs-bgextractfile validate -i index.bgi
rs-bgextractfile validate -i index.bgi -v  # verbose mode
```

### Advanced Usage

#### Batch operations via file
Create an operations file (operations.txt):
```
# Remove files
obsolete_file1.txt
obsolete_file2.txt

# Replace paths
/old/path/config.txt:/new/path/config.txt

# Add new files
important_file.log
new_data.csv
```

Execute batch operations:
```bash
rs-bgextractfile -i index.bgi -f operations.txt
```

#### Subcommand interface
```bash
# Remove using subcommand
rs-bgextractfile remove -i index.bgi file1.txt file2.txt

# Add using subcommand
rs-bgextractfile add -i index.bgi file3.txt file4.txt

# Replace using subcommand
rs-bgextractfile replace -i index.bgi old:new
```

## Command-Line Options

### Global Options
- `-i, --index PATH` - Index file path
- `-v, --verbose` - Verbose output
- `-d, --debug` - Debug-level logging

### Operations
- `-r, --remove FILES` - Remove files (comma-separated list)
- `-a, --add FILES` - Add files (comma-separated list)
- `--replace OPS` - Replace operations (old:new format, comma-separated)
- `-f, --file FILE` - File containing operations (one per line)

### Subcommands
- `remove` - Remove files from index
- `add` - Add files to index
- `replace` - Replace files in index
- `list` - List files in index
- `validate` - Validate index integrity

## Index Format Support

### Legacy Format (fmt_minor < 2)
- Uncompressed fileid_map section
- Direct binary operations
- Fast processing

### Modern Format (fmt_minor >= 2)
- Zlib-compressed fileid_map section
- Automatic compression detection
- Memory-efficient operations

### Index Header Fields
```rust
struct BgiHeader {
    magic: [u8; 8],           // "BIGGREP1"
    version: u32,             // Should be 1
    ngram_order: u32,         // 3 or 4
    num_files: u32,           // Updated after operations
    fileid_map_offset: u64,   // Start of fileid_map
    fileid_map_size: u32,     // Size of fileid_map
    fmt_minor: u32,           // Compression flag
}
```

## File ID Map Format

Each line in the fileid_map follows this format:
```
<id>\t<path>\t<optional_metadata>
```

Example:
```
1\t/home/user/documents/report.txt
2\t/home/user/documents/data.csv\tSIZE=1024;TYPE=DATA
3\t/home/user/documents/notes.md
```

## Error Handling

The tool provides comprehensive error handling for:

- **File not found** - When index file doesn't exist
- **Invalid format** - When index has incorrect magic number or version
- **Operation errors** - Invalid remove/replace/add operations
- **Compression errors** - Issues with compressed indexes
- **Integrity errors** - When validation fails

## Safety Features

- **Atomic operations** - File modifications are applied atomically
- **Validation checks** - Index integrity verified before and after operations
- **Backup recommendations** - Always backup index files before operations
- **Rollback capability** - Failed operations leave index unchanged

## Performance Considerations

### Large Indexes (100K+ files)
- Memory usage scales with number of files
- Compression reduces I/O for compressed indexes
- Batch operations recommended for multiple changes

### Concurrent Access
- Use file locking for concurrent access
- Recommended to run operations during maintenance windows
- Validate index after concurrent modifications

## Migration from Original bgextractfile

This Rust implementation provides the same functionality as the original C++ `bgextractfile`:

| Original Option | Rust Equivalent | Notes |
|-----------------|-----------------|-------|
| `-x, --extract` | `-r, --remove` | Remove files from index |
| `-r, --replace` | `--replace` | Replace file paths |
| `-v, --verbose` | `-v, --verbose` | Verbose output |
| `-d, --debug` | `-d, --debug` | Debug logging |

## Testing

Use the provided test script to verify functionality:
```bash
bash /workspace/test_rs_bgextractfile.sh
```

## Examples

### Complete Workflow Example

```bash
# 1. List current files in index
rs-bgextractfile list -i index.bgi

# 2. Remove obsolete files
rs-bgextractfile -i index.bgi -r obsolete.log,old_backup.txt

# 3. Replace moved files
rs-bgextractfile -i index.bgi --replace \
    /old/path/config.txt:/new/path/config.txt \
    /old/path/data.csv:/new/path/data.csv

# 4. Add new files
rs-bgextractfile -i index.bgi -a new_important.log,latest_data.json

# 5. Validate index integrity
rs-bgextractfile validate -i index.bgi -v

# 6. Verify final state
rs-bgextractfile list -i index.bgi --pattern ".log"
```

## Troubleshooting

### Common Issues

1. **"Invalid magic number"**
   - Index file may be corrupted or wrong format
   - Check file with `file` command
   - Validate with `rs-bgextractfile validate`

2. **"File not found" operations**
   - Ensure exact path match (case-sensitive)
   - Use `rs-bgextractfile list` to see current paths

3. **Permission errors**
   - Ensure write permissions on index file
   - Run as user with appropriate permissions

4. **Index corruption after operations**
   - Restore from backup
   - Run `rs-bgextractfile validate` to identify issues

### Validation Errors

- **Duplicate file IDs** - Run integrity validation
- **Non-sequential IDs** - Re-index may be required
- **Path validation failures** - Check for empty or invalid paths

## Contributing

This tool is part of the BigGrep Rust implementation. For issues or contributions, please refer to the main project repository.

## License

MIT License - See project root for details.

## See Also

- [BigGrep Documentation](../docs/)
- [bgindex](../rs-bgindex/) - Index building tool
- [bgsearch](../rs-bgsearch/) - Search tool
- [bgparse](../rs-bgparse/) - Index parsing tool
- [bgverify](../rs-bgverify/) - Verification tool