Files
2026-02-03 15:10:50 -05:00

150 lines
4.8 KiB
Markdown

# Workday Monthly Admin Activities Audit
A Python tool for processing Workday security admin audit reports. It extracts person names from audit data using NLP, filters relevant records, and generates human-readable summaries using a local LLM.
## Overview
This script automates the monthly review of Workday administrative activities by:
1. Extracting person names from unstructured audit text using spaCy NER
2. Filtering and deduplicating audit records
3. Generating natural language summaries of each admin action using LM Studio
## Requirements
### Python Dependencies
```bash
pip install pandas spacy openai requests tqdm
python -m spacy download en_core_web_trf
```
### External Services
- **LM Studio**: A local LLM server running on port 1234. The script will prompt for the IP address if the default is unreachable.
## Configuration
Edit the constants at the top of `process.py`:
| Variable | Default | Description |
|----------|---------|-------------|
| `DEFAULT_LM_IP` | `100.113.108.121` | IP address of LM Studio server |
| `LLM_MODEL` | `openai/gpt-oss-20b` | Model name to use for summarization |
| `INPUT_CSV` | `test.csv` | Input Workday audit report |
| `OUTPUT_CSV` | `test_with_names.csv` | Processed CSV with extracted names |
| `EVENT_LOG` | `event_log.txt` | Plain text event summaries |
| `EVENT_LOG_MD` | `event_log.md` | Markdown formatted event summaries |
| `FINAL_SNAPSHOT` | `final.csv` | Pre-LLM snapshot of filtered data |
## Input CSV Format
The script expects a Workday audit export with these columns:
| Column | Description |
|--------|-------------|
| `Instance that Changed` | Text describing what changed (contains person names) |
| `Added` | Text describing what was added (contains person names) |
| `In Transaction` | Transaction identifier |
| `By User` | The admin who performed the action |
| `Entered On` | Timestamp of the action |
## Processing Pipeline
### Step 1: LM Studio Connection Check
The script first verifies connectivity to LM Studio. If unreachable, it prompts for a new IP address.
### Step 2: spaCy Model Loading
Loads the `en_core_web_trf` transformer model for Named Entity Recognition (NER). Only the NER component is loaded for efficiency.
### Step 3: Name Extraction (NER)
Uses spaCy to extract PERSON entities from audit text:
1. **Process "Added" column first** - Extracts names and populates both `Added Applied to` and `Applied to` columns
2. **Process "Instance that Changed"** - Only for rows where `Applied to` is still empty
Names are cleaned by:
- Removing job codes and fragments after dashes/slashes
- Keeping only alphabetic characters, apostrophes, periods, and hyphens
- Deduplicating within each cell
### Step 4: Data Filtering
The script applies several filters:
1. **Remove "Automatic Complete" rows** - Filters out system-generated entries
2. **Keep matching rows only** - Retains rows where `Applied to` equals `Added Applied to`
3. **Self-action filter** - Removes rows where the person acted on themselves (name appears in `By User`)
4. **Deduplication** - Removes duplicate rows
### Step 5: LLM Summarization
Groups remaining records by:
- `Applied to` (the person affected)
- `By User` (the admin)
- `Entered On` (timestamp)
For each group, sends a prompt to LM Studio requesting a one-sentence compliance summary including:
- Who performed the action
- Who it applied to
- What roles were assigned/added
- The date
### Step 6: Output Generation
Generates three output files:
1. **`test_with_names.csv`** - Processed CSV with extracted names
2. **`event_log.txt`** - Plain text summaries grouped by admin user
3. **`event_log.md`** - Markdown table format for easier reading
## Output Format
### event_log.txt
```
=== 1000054 / Dennis Cregan ===
01/06 - Dennis Cregan assigned the Academic Advising role to Casey Spelman on 1/6/26.
01/07 - Dennis Cregan assigned the PSID role to PSID on 1/7/26.
=== 1003966 / Kristinn Bjarnason ===
01/07 - Kristinn Bjarnason assigned the Professor role to Mary Perrodin on 1/7/26.
```
### event_log.md
```markdown
# Security Admin Audit Event Log
## 1000054 / Dennis Cregan
| Date | Event Summary |
|------|---------------|
| 01/06 | Dennis Cregan assigned the Academic Advising role to Casey Spelman on 1/6/26. |
```
## Usage
```bash
python process.py
```
### Caching Behavior
If `test_with_names.csv` already exists, the script skips CSV processing and jumps directly to LLM summarization. Delete this file to force full reprocessing.
## Error Handling
- **LM Studio unreachable**: Prompts for new IP address
- **LLM context too long**: Records `[LLM ERROR]` in the output and continues
- **Missing columns**: Gracefully handles missing optional columns
## Performance Notes
- The spaCy transformer model requires significant memory (~1GB)
- Progress bars show NER extraction status with live name counts
- LLM calls are sequential (one per event group)