- Added documentation via mdbook - Created basic VS code extension - Implemented if else blocks and changed some syntax - fixed some issues
53 lines
1.2 KiB
Markdown
53 lines
1.2 KiB
Markdown
# Data Pipeline
|
|
|
|
> ⚠️ This documentation is AI-generated and may contain errors.
|
|
|
|
Data processing pipeline showing how to process multiple batches in parallel.
|
|
|
|
## Script
|
|
|
|
See `/examples/data_pipeline.rsh` in the repository.
|
|
|
|
```rush
|
|
#!/usr/bin/env rush
|
|
|
|
DATASET = "user_analytics"
|
|
INPUT_DIR = "$HOME/data/raw"
|
|
OUTPUT_DIR = "$HOME/data/processed"
|
|
BATCH_SIZE = "1000"
|
|
|
|
echo "Data Processing Pipeline: $DATASET"
|
|
|
|
# Pre-processing stages
|
|
for stage in validate clean normalize {
|
|
STAGE_UPPER = "$stage"
|
|
echo " Stage: $STAGE_UPPER"
|
|
}
|
|
|
|
# Process batches in parallel
|
|
BATCH_1_IN = "$INPUT_DIR/batch_001.csv"
|
|
BATCH_1_OUT = "$OUTPUT_DIR/batch_001.json"
|
|
# ... (define other batches)
|
|
|
|
parallel {
|
|
run {
|
|
echo "[batch_001] Processing $BATCH_1_IN -> $BATCH_1_OUT"
|
|
echo "[batch_001] Transformed 1000 records"
|
|
}
|
|
|
|
run {
|
|
echo "[batch_002] Processing $BATCH_2_IN -> $BATCH_2_OUT"
|
|
echo "[batch_002] Transformed 1000 records"
|
|
}
|
|
# ... (more batches)
|
|
}
|
|
|
|
echo "All batches processed successfully"
|
|
```
|
|
|
|
## Key Concepts
|
|
|
|
- **Parallel data processing**: Process multiple batches simultaneously
|
|
- **Path construction**: Building input/output file paths
|
|
- **Pipeline stages**: Sequential setup, parallel processing, sequential summary
|