Introduction
Over the years, I have seen countless data breaches leaking private personal data of customers. For example, Equifax exposed 147 million Americans’ SSNs and birth dates; Facebook leaked 533 million users’ personal details; Yahoo lost 3 billion accounts. This risk of leaking personal data is not unique to large companies but most companies play security chicken. They bet on luck that we haven’t been breached yet, so we must be fine. In many cases, companies don’t even know what PII they have, where it lives, or who can access it.
Unrestrained Production Access
Here’s what I have seen in most companies where I worked: DevOps teams with unrestricted access to production databases “for debugging.” Support engineers who can browse any customer’s SSN, medical records, or financial data. That contractor from six months ago who still has production credentials. Engineers who can query any table, any field, anytime. I’ve witnessed the consequences firsthand:
- Customer service reps browsing financial data of large customers “out of curiosity”
- APIs that return PII data without proper authorization policies
- Devops or support receives permanent permissions to access production data instead of time-bound or customer specific based on the underlying issue
- Engineers accidentally logging credit card numbers in plaintext
This violates OWASP’s principle of least privilege—grant only the minimum access necessary. But there’s an even worse problem: most companies can’t even identify which fields contain PII. They often don’t have policies on how to protect different kind of PII data based on risks.
The Scale Problem
In modern architectures, manual PII identification is impossible:
- Hundreds of microservices, each with dozens of data models
- Tens of thousands of API endpoints
- Constant schema evolution as teams ship daily
- Our single customer proto had 84 fields—multiply that by hundreds of services
Traditional approaches—manual reviews, compliance audits, security questionnaires—can’t keep up. By the time you’ve reviewed everything, the schemas have already changed.
Enter Agentic AI: From 0% to 92% PII Detection
I have been applying AI assistants and agents to solve complex problems for a while and I have been thinking about how can we automatically detect PII? Not just obvious fields like “ssn” or “credit_card_number,” but the subtle ones—employee IDs that could be cross-referenced. I then built an AI-powered system that uses LangChain, LangGraph, and Vertex AI to scan every proto definition, identify PII patterns, and classify sensitivity levels. Though iterative development, I went from:
- 0% accuracy: Naive prompt (“find PII fields”)
- 45% accuracy: Basic rules without specificity
- 92%+ accuracy: Iterative prompt engineering with explicit field mappings
It’s not perfect, but it’s infinitely better than the nothing most companies have.
The Real Problem: It’s Not Just About Compliance
Let me share some uncomfortable truths about PII in modern systems:
The Public API Problem
We had list APIs returning customer data like this:
{
"customers": [
{
"id": "cust_123",
"name": "John Doe",
"email": "john@example.com",
"ssn": "123-45-6789",
"date_of_birth": "1990-01-15",
"credit_score": 750,
}
]
}Someone with the API access could list all customers and capture their private data like ssn and date_of_birth.
The Internal Access Problem
One recurring issue I found with internal access is giving carte blanche access (often permanent) to devops environment or production database for debugging. In other cases, support team needed customer data for tickets. But did they need to see following PII data for all customers:
- Social Security Numbers?
- Medical records?
- Credit card numbers?
- Salary information?
Of course not. I saw often the list APIs return this PII data for all customers or calling GetAccount gave you everything without proper authorization policies.
The Compliance Nightmare
The government regulations like GDPR, CCPA, HIPAA, PCI-DSS have been growing but each has different rules about what constitutes PII, how it should be protected, and what happens if you leak it. Manual compliance checking is impossible at scale.
The RBAC Isn’t Enough Problem
I’ve spent years building authorization systems, believing RBAC was the answer. I wrote about it in Building a Hybrid Authorization System for Granular Access Control and created multiple authorization solutions like:
- PlexRBAC – A comprehensive RBAC library for Java/Scala with dynamic role hierarchies
- PlexRBACJS – JavaScript implementation with fine-grained permissions
- SaaS_RBAC – Multi-tenant RBAC with organization-level isolation
These systems can enforce incredibly sophisticated access controls. They can handle role inheritance, permission delegation, contextual access rules. But here’s what I learned the hard way: RBAC is useless if you don’t know what data needs protection. First, you need to identify PII. Then you can enforce field-level authorization.
The Solution: AI-Powered PII Detection with Proto Annotations
I built an Agentic AI based automation that:
- Automatically scans all proto definitions for PII
- Classifies sensitivity levels (HIGH, MEDIUM, LOW, PUBLIC)
- Generates appropriate annotations for enforcement
- Integrates with CI/CD to prevent PII leaks before deployment
Here’s what it looks like in action:
Before: Unmarked PII Everywhere
message Account {
string id = 1;
string first_name = 2;
string ssn = 3; // No indication this is sensitive!
string email = 4;
string credit_card_number = 5; // Just sitting there, unprotected
repeated string medical_conditions = 6; // HIPAA violation waiting to happen
}After: Fully Annotated with Sensitivity Levels
message Account {
option (pii.v1.message_sensitivity) = HIGH;
string id = 1 [
(pii.v1.sensitivity) = LOW,
(pii.v1.pii_type) = CUSTOMER_ID
];
string first_name = 2 [
(pii.v1.sensitivity) = LOW,
(pii.v1.pii_type) = NAME
];
string ssn = 3 [
(pii.v1.sensitivity) = HIGH,
(pii.v1.pii_type) = SSN
];
string email = 4 [
(pii.v1.sensitivity) = MEDIUM,
(pii.v1.pii_type) = EMAIL_PERSONAL
];
string credit_card_number = 5 [
(pii.v1.sensitivity) = HIGH,
(pii.v1.pii_type) = CREDIT_CARD
];
repeated string medical_conditions = 6 [
(pii.v1.sensitivity) = HIGH,
(pii.v1.pii_type) = MEDICAL_RECORD
];
}Now our authorization system knows exactly what to protect!
Architecture: How It All Works
The system uses a multi-stage pipeline combining LangChain, LangGraph, and Vertex AI:

Technical Implementation Deep Dive
1. The LangGraph State Machine
I used LangGraph to create a deterministic workflow for PII detection:
from langgraph.graph import StateGraph, END
from typing import TypedDict, List, Optional, Dict, Any
from langchain_google_vertexai import ChatVertexAI
from pydantic import BaseModel, Field
class PiiDetectionState(TypedDict):
"""State for PII detection workflow"""
proto_file: str
proto_content: str
parsed_proto: Dict[str, Any]
llm_analysis: Optional[ProtoAnalysis]
final_report: Optional[PiiDetectionReport]
annotated_proto: Optional[str]
errors: List[str]
class PiiDetector:
def __init__(self, model_name: str = "gemini-2.0-flash-exp"):
self.llm = ChatVertexAI(
model_name=model_name,
project=PROJECT_ID,
location=LOCATION,
temperature=0.1, # Low temperature for consistent classification
max_output_tokens=8192,
request_timeout=120 # Handle large protos
)
self.workflow = self._create_workflow()
def _create_workflow(self) -> StateGraph:
"""Create the LangGraph workflow"""
workflow = StateGraph(PiiDetectionState)
# Add nodes for each step
workflow.add_node("parse_proto", self._parse_proto_node)
workflow.add_node("analyze_pii", self._analyze_pii_node)
workflow.add_node("generate_annotations", self._generate_annotations_node)
workflow.add_node("create_report", self._create_report_node)
# Define the flow
workflow.set_entry_point("parse_proto")
workflow.add_edge("parse_proto", "analyze_pii")
workflow.add_edge("analyze_pii", "generate_annotations")
workflow.add_edge("generate_annotations", "create_report")
workflow.add_edge("create_report", END)
return workflow.compile()
async def _analyze_pii_node(self, state: PiiDetectionState) -> PiiDetectionState:
"""Analyze PII using LLM with retry logic"""
max_retries = 3
retry_delay = 2
for attempt in range(max_retries):
try:
# Create structured output chain
analysis_chain = self.llm.with_structured_output(ProtoAnalysis)
# Create the analysis prompt
prompt = self.create_pii_detection_prompt(state['parsed_proto'])
# Get LLM analysis
result = await analysis_chain.ainvoke(prompt)
if result:
state['llm_analysis'] = result
return state
except Exception as e:
if attempt < max_retries - 1:
await asyncio.sleep(retry_delay)
continue
else:
state['errors'].append(f"LLM analysis failed: {str(e)}")
return state2. Pydantic Models for Structured Output
I used Pydantic to ensure consistent, structured responses from the LLM:
class FieldAnalysis(BaseModel):
"""Analysis of a single proto field for PII"""
field_name: str = Field(description="The name of the field")
field_path: str = Field(description="Full path like Message.field")
contains_pii: bool = Field(description="Whether field contains PII")
sensitivity: str = Field(description="HIGH, MEDIUM, LOW, or PUBLIC")
pii_type: Optional[str] = Field(default=None, description="Type of PII")
reasoning: str = Field(description="Explanation for classification")
class MessageAnalysis(BaseModel):
"""Analysis of a proto message"""
message_name: str = Field(description="Name of the message")
overall_sensitivity: str = Field(description="Highest sensitivity in message")
fields: List[FieldAnalysis] = Field(description="Analysis of each field")
class ProtoAnalysis(BaseModel):
"""Complete analysis of a proto file"""
messages: List[MessageAnalysis] = Field(description="All analyzed messages")
services: List[ServiceAnalysis] = Field(default_factory=list)
summary: AnalysisSummary = Field(description="Overall statistics")3. The Critical Prompt Engineering
I found that the key to accurate PII detection is in the prompt. Here’s a battle-tested prompt that achieves 92%+ accuracy after many trial and errors:
def create_pii_detection_prompt(self) -> str:
"""Create the prompt for PII detection"""
return """You are an expert in data privacy and PII detection.
Analyze the Protocol Buffer definition and identify ALL fields that contain PII.
STRICT Classification Rules - YOU MUST FOLLOW THESE EXACTLY:
1. HIGH Sensitivity (MAXIMUM PROTECTION REQUIRED):
ALWAYS classify these field names as HIGH:
- ssn, social_security_number ? HIGH + SSN
- tax_id, tin ? HIGH + TAX_ID
- passport_number, passport ? HIGH + PASSPORT
- drivers_license, driving_license ? HIGH + DRIVERS_LICENSE
- bank_account_number ? HIGH + BANK_ACCOUNT
- credit_card_number ? HIGH + CREDIT_CARD
- credit_card_cvv ? HIGH + CREDIT_CARD
- medical_record_number ? HIGH + MEDICAL_RECORD
- health_insurance_id ? HIGH + HEALTH_INSURANCE
- medical_conditions ? HIGH + MEDICAL_RECORD
- prescriptions ? HIGH + MEDICAL_RECORD
- password_hash, password ? HIGH + PASSWORD
- api_key ? HIGH + API_KEY
- salary, annual_income ? HIGH + null
2. MEDIUM Sensitivity:
- email, personal_email ? MEDIUM + EMAIL_PERSONAL
- phone, mobile_phone ? MEDIUM + PHONE_PERSONAL
- home_address ? MEDIUM + ADDRESS_HOME
- date_of_birth, dob ? MEDIUM + DATE_OF_BIRTH
- username ? MEDIUM + USERNAME
- ip_address ? MEDIUM + IP_ADDRESS
- device_id ? MEDIUM + DEVICE_ID
- geolocation (latitude, longitude) ? MEDIUM + null
3. LOW Sensitivity:
- first_name, last_name, middle_name ? LOW + NAME
- gender ? LOW + GENDER
- work_email ? LOW + EMAIL_WORK
- work_phone ? LOW + PHONE_WORK
- job_title ? LOW + null
- employer_name ? LOW + null
4. PUBLIC (non-PII):
- id (if system-generated)
- status, created_at, updated_at
- counts, totals, metrics
IMPORTANT: Analyze EVERY SINGLE FIELD. Do not skip any.
"""3. Handling the Gotchas
During development, I faced several challenges that required creative solutions:
Challenge 1: Multi-line Proto Annotations
Proto files often have annotations spanning multiple lines:
string ssn = 3 [
(pii.v1.sensitivity) = HIGH,
(pii.v1.pii_type) = SSN
];Solution: Parse with look-ahead:
def extract_annotations(self, lines: List[str]) -> Dict:
i = 0
while i < len(lines):
if '[' in lines[i]:
# Collect until we find ']'
annotation_text = lines[i]
j = i + 1
while j < len(lines) and '];' not in annotation_text:
annotation_text += ' ' + lines[j]
j += 1
# Now parse the complete annotation
self.parse_annotation(annotation_text)
i = j
else:
i += 1Challenge 2: Context-Dependent Classification
A field named id could be:
- PUBLIC if it’s a system-generated UUID
- LOW if it’s a customer ID that could be used for lookups
- MEDIUM if it’s an employee ID with PII implications
Solution: Consider the message context:
def classify_with_context(self, field_name: str, message_name: str) -> str:
if message_name in ['Customer', 'User', 'Account']:
if field_name == 'id':
return 'LOW' # Customer ID has some sensitivity
elif message_name in ['System', 'Config']:
if field_name == 'id':
return 'PUBLIC' # System IDs are not PII
return self.default_classification(field_name)Challenge 3: Handling Nested Messages and Maps
Real protos have complex structures:
message Account {
map<string, string> metadata = 100; // Could contain anything!
repeated Address addresses = 101;
Location last_location = 102;
}Solution: Recursive analysis with inheritance:
def analyze_field(self, field: Field, parent_sensitivity: str = 'PUBLIC'):
if field.type == 'map':
# Maps could contain PII
return 'MEDIUM' if parent_sensitivity != 'HIGH' else 'HIGH'
elif field.is_message:
# Analyze the referenced message
message_sensitivity = self.analyze_message(field.message_type)
return max(parent_sensitivity, message_sensitivity)
else:
return self.classify_field(field.name)Real-World Testing
I tested the system on a test customer account proto with 84 fields. Here’s what happened:
Before: Original Proto Without Annotations
syntax = "proto3";
package pii.v1;
// Account represents a user account - NO PII ANNOTATIONS
message Account {
// System fields
string id = 1;
string account_number = 2;
AccountStatus status = 3;
google.protobuf.Timestamp created_at = 4;
google.protobuf.Timestamp updated_at = 5;
// Personal information - UNPROTECTED PII!
string first_name = 10;
string last_name = 11;
string middle_name = 12;
string date_of_birth = 13; // Format: YYYY-MM-DD
string gender = 14;
// Contact information - MORE UNPROTECTED PII!
string email = 20;
string personal_email = 21;
string work_email = 22;
string phone = 23;
string mobile_phone = 24;
string work_phone = 25;
// Government IDs - CRITICAL PII EXPOSED!
string ssn = 40;
string tax_id = 41;
string passport_number = 42;
string drivers_license = 43;
string national_id = 44;
// Financial information - HIGHLY SENSITIVE!
string bank_account_number = 50;
string routing_number = 51;
string credit_card_number = 52;
string credit_card_cvv = 53;
string credit_card_expiry = 54;
double annual_income = 55;
int32 credit_score = 56;
// Medical information - HIPAA PROTECTED!
string medical_record_number = 70;
string health_insurance_id = 71;
repeated string medical_conditions = 72;
repeated string prescriptions = 73;
// Authentication - SECURITY CRITICAL!
string username = 80;
string password_hash = 81;
string security_question = 82;
string security_answer = 83;
string api_key = 84;
string access_token = 85;
// Device information
string ip_address = 90;
string device_id = 91;
string user_agent = 92;
Location last_location = 93;
// Additional fields
map<string, string> metadata = 100;
repeated string tags = 101;
}
service AccountService {
// All methods exposed without sensitivity annotations!
rpc CreateAccount(CreateAccountRequest) returns (Account);
rpc GetAccount(GetAccountRequest) returns (Account);
rpc UpdateAccount(UpdateAccountRequest) returns (Account);
rpc DeleteAccount(DeleteAccountRequest) returns (google.protobuf.Empty);
rpc ListAccounts(ListAccountsRequest) returns (ListAccountsResponse);
rpc SearchAccounts(SearchAccountsRequest) returns (SearchAccountsResponse);
}After: AI-Generated Annotations (92.3% Accuracy!)
syntax = "proto3";
import "api/proto/pii/v1/sensitivity.proto";
// Account represents a user account - FULLY ANNOTATED WITH PII SENSITIVITY
message Account {
option (pii.v1.message_sensitivity) = HIGH;
// System fields
string id = 1 [(pii.v1.sensitivity) = LOW, (pii.v1.pii_type) = CUSTOMER_ID];
string account_number = 2 [(pii.v1.sensitivity) = MEDIUM];
AccountStatus status = 3; // Enum - no PII
google.protobuf.Timestamp created_at = 4; // PUBLIC
google.protobuf.Timestamp updated_at = 5; // PUBLIC
// Personal information - PROPERLY CLASSIFIED
string first_name = 10 [(pii.v1.sensitivity) = LOW, (pii.v1.pii_type) = NAME];
string last_name = 11 [(pii.v1.sensitivity) = LOW, (pii.v1.pii_type) = NAME];
string middle_name = 12 [(pii.v1.sensitivity) = LOW, (pii.v1.pii_type) = NAME];
string date_of_birth = 13 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = DATE_OF_BIRTH];
string gender = 14 [(pii.v1.sensitivity) = LOW, (pii.v1.pii_type) = GENDER];
// Contact information - MEDIUM SENSITIVITY
string email = 20 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = EMAIL_PERSONAL];
string personal_email = 21 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = EMAIL_PERSONAL];
string work_email = 22 [(pii.v1.sensitivity) = LOW, (pii.v1.pii_type) = EMAIL_WORK];
string phone = 23 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = PHONE_PERSONAL];
string mobile_phone = 24 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = PHONE_PERSONAL];
string work_phone = 25 [(pii.v1.sensitivity) = LOW, (pii.v1.pii_type) = PHONE_WORK];
// Government IDs - ALL HIGH SENSITIVITY ?
string ssn = 40 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = SSN];
string tax_id = 41 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = TAX_ID];
string passport_number = 42 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = PASSPORT];
string drivers_license = 43 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = DRIVERS_LICENSE];
string national_id = 44 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = NATIONAL_ID];
// Financial information - ALL HIGH SENSITIVITY ?
string bank_account_number = 50 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = BANK_ACCOUNT];
string routing_number = 51 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = ROUTING_NUMBER];
string credit_card_number = 52 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = CREDIT_CARD];
string credit_card_cvv = 53 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = CREDIT_CARD];
string credit_card_expiry = 54 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = CREDIT_CARD];
double annual_income = 55 [(pii.v1.sensitivity) = HIGH];
int32 credit_score = 56 [(pii.v1.sensitivity) = HIGH];
// Medical information - ALL HIGH SENSITIVITY ?
string medical_record_number = 70 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = MEDICAL_RECORD];
string health_insurance_id = 71 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = HEALTH_INSURANCE];
repeated string medical_conditions = 72 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = MEDICAL_RECORD];
repeated string prescriptions = 73 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = MEDICAL_RECORD];
// Authentication - ALL HIGH SENSITIVITY ?
string username = 80 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = USERNAME];
string password_hash = 81 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = PASSWORD];
string security_question = 82 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = PASSWORD];
string security_answer = 83 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = PASSWORD];
string api_key = 84 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = API_KEY];
string access_token = 85 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = API_KEY];
// Device information - MEDIUM SENSITIVITY
string ip_address = 90 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = IP_ADDRESS];
string device_id = 91 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = DEVICE_ID];
string user_agent = 92 [(pii.v1.sensitivity) = LOW];
Location last_location = 93; // Location message handled separately
// Additional fields
map<string, string> metadata = 100 [(pii.v1.sensitivity) = MEDIUM];
repeated string tags = 101; // PUBLIC
}
// Service methods also get sensitivity annotations
service AccountService {
rpc CreateAccount(CreateAccountRequest) returns (Account) {
option (pii.v1.method_sensitivity) = HIGH;
option (pii.v1.audit_pii_access) = true;
}
rpc GetAccount(GetAccountRequest) returns (Account) {
option (pii.v1.method_sensitivity) = HIGH;
option (pii.v1.audit_pii_access) = true;
}
// ... all methods properly annotated
}Results: 92.3% Accuracy!
Here’s the actual output from our final test run:
Testing PII detection on: ../api/proto/pii/v1/account_without_annotations.proto ================================================================================ ================================================================================ PII DETECTION REPORT ================================================================================ Total Fields Analyzed: 84 PII Fields Detected: 57 Non-PII Fields: 27 Fields by Sensitivity Level: HIGH: 22 fields MEDIUM: 22 fields LOW: 13 fields PUBLIC: 27 fields HIGH Sensitivity Fields (22): • Account.ssn ? SSN • Account.tax_id ? TAX_ID • Account.passport_number ? PASSPORT • Account.drivers_license ? DRIVERS_LICENSE • Account.national_id ? NATIONAL_ID • Account.bank_account_number ? BANK_ACCOUNT • Account.routing_number ? ROUTING_NUMBER • Account.credit_card_number ? CREDIT_CARD • Account.credit_card_cvv ? CREDIT_CARD • Account.annual_income ? null • Account.credit_score ? null • Account.salary ? null • Account.medical_record_number ? MEDICAL_RECORD • Account.health_insurance_id ? HEALTH_INSURANCE • Account.medical_conditions ? MEDICAL_RECORD • Account.prescriptions ? MEDICAL_RECORD • Account.password_hash ? PASSWORD • Account.security_question ? PASSWORD • Account.security_answer ? PASSWORD • Account.api_key ? API_KEY • Account.access_token ? API_KEY • CreateAccountRequest.account ? null [Additional fields by sensitivity level...] ================================================================================ Annotated proto saved to: output/account_with_detected_annotations.proto ================================================================================ VERIFICATION: Comparing with Reference Implementation ================================================================================ Field Annotations: ? Correct: 60 ? Incorrect: 5 ?? Missing: 0 ? Extra: 0 Message Annotations: ? Correct: 8 ? Incorrect: 0 ?? Missing: 1 Method Annotations: ? Correct: 0 ? Incorrect: 6 ?? Missing: 0 Overall Field Accuracy: 92.3% ? VERIFICATION PASSED (>=80% accuracy) Note: The LLM may classify some fields differently based on context. ================================================================================ SUMMARY ================================================================================ Total fields analyzed: 84 PII fields detected: 57 Fields by sensitivity level: HIGH: 22 fields MEDIUM: 22 fields LOW: 13 fields PUBLIC: 27 fields Test completed successfully!
The system correctly identified:
- ? 100% of HIGH sensitivity fields (SSNs, credit cards, medical records)
- ? 95% of MEDIUM sensitivity fields (personal emails, phone numbers, addresses)
- ? 85% of LOW sensitivity fields (names, work emails, job titles)
- ? 100% of PUBLIC fields (IDs, timestamps, enums)
Why 92.3% Accuracy Matters
- Perfect HIGH Sensitivity Detection: The system caught 100% of the most critical PII – SSNs, credit cards, medical records. These are the fields that can destroy lives if leaked.
- Conservative Classification: When uncertain, the system errs on the side of caution. It’s better to over-protect a field than to expose PII.
- Human Review Still Needed: The 8% difference is where human expertise adds value. The AI does the heavy lifting, humans do the fine-tuning.
- Continuous Improvement: Every correction teaches the system. Our accuracy improved from 0% to 45% to 92% through iterative refinement.
Integration with Field-Level Authorization
I also built a prototype for enforcing field-level authorization and masking PII data outside this project but here is a general approach for enforcement of PII protection policies and masking response fields:
Step 1: Generate Authorization Rules
def generate_authz_rules(proto_with_annotations: str) -> Dict:
"""Generate authorization rules from annotated proto"""
rules = {}
for field in parse_annotated_proto(proto_with_annotations):
if field.sensitivity == 'HIGH':
rules[field.path] = {
'required_roles': ['admin', 'compliance_officer'],
'required_scopes': ['pii.high.read'],
'audit': True,
'mask_in_logs': True
}
elif field.sensitivity == 'MEDIUM':
rules[field.path] = {
'required_roles': ['support', 'admin'],
'required_scopes': ['pii.medium.read'],
'audit': True,
'mask_in_logs': False
}
return rulesStep 2: Runtime Enforcement
// In your gRPC interceptor
func (i *AuthzInterceptor) UnaryInterceptor(
ctx context.Context,
req interface{},
info *grpc.UnaryServerInfo,
handler grpc.UnaryHandler,
) (interface{}, error) {
// Get user's roles and scopes
user := auth.UserFromContext(ctx)
// Check field-level permissions
response, err := handler(ctx, req)
if err != nil {
return nil, err
}
// Filter response based on PII annotations
filtered := i.filterResponse(response, user)
return filtered, nil
}
func (i *AuthzInterceptor) filterResponse(
response interface{},
user *auth.User,
) interface{} {
// Use reflection to check each field's annotation
v := reflect.ValueOf(response)
for i := 0; i < v.NumField(); i++ {
field := v.Type().Field(i)
// Get PII annotation from proto
sensitivity := getPIISensitivity(field)
// Check if user has permission
if !user.HasPermission(sensitivity) {
// Mask or remove the field
v.Field(i).Set(reflect.Zero(field.Type))
}
}
return response
}Step 3: The Magic Moment
Here is an example response from an API with PII data that enforces proper PII data protection:
// Before: Everything exposed
{
"customer": {
"name": "John Doe",
"ssn": "123-45-6789", // They see this!
"credit_card": "4111-1111-1111-1111" // And this!
}
}
// After: Field-level filtering based on PII annotations
{
"customer": {
"name": "John Doe",
"ssn": "[REDACTED]", // Protected!
"credit_card": "[REDACTED]" // Protected!
}
}CI/CD Integration: Catching PII Before Production
This tool can be easily integrated with CI/CD pipelines to identify PII data if proper annotations are missing:
# .github/workflows/pii-detection.yml
name: PII Detection Check
on:
pull_request:
paths:
- '**/*.proto'
jobs:
detect-pii:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install -r check-pii-automation/requirements.txt
- name: Detect PII in Proto Files
env:
GCP_PROJECT: ${{ secrets.GCP_PROJECT }}
run: |
cd check-pii-automation
# Scan all proto files
for proto in $(find ../api/proto -name "*.proto"); do
echo "Scanning $proto"
python pii_detector.py "$proto" \
--output "output/$(basename $proto)" \
--json "output/$(basename $proto .proto).json"
done
- name: Check for Unannotated PII
run: |
# Fail if HIGH sensitivity PII found without annotations
for report in check-pii-automation/output/*.json; do
high_pii=$(jq '.fields[] | select(.sensitivity == "HIGH" and .annotated == false)' $report)
if [ ! -z "$high_pii" ]; then
echo "? ERROR: Unannotated HIGH sensitivity PII detected!"
echo "$high_pii"
exit 1
fi
done
- name: Generate Security Report
if: always()
run: |
python check-pii-automation/generate_security_report.py \
--input output/ \
--output security_report.md
- name: Comment on PR
uses: actions/github-script@v6
with:
script: |
const fs = require('fs');
const report = fs.readFileSync('security_report.md', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: report
});Advanced Features: Learning and Adapting
1. Custom PII Patterns
As, every organization has unique PII, we can support custom patterns:
# custom_pii_rules.yaml
custom_patterns:
- name: "employee_badge_number"
pattern: "badge_.*|.*_badge_id"
sensitivity: "MEDIUM"
pii_type: "EMPLOYEE_ID"
- name: "internal_customer_reference"
pattern: "cust_ref_.*|customer_reference"
sensitivity: "LOW"
pii_type: "CUSTOMER_ID"
- name: "biometric_data"
pattern: "fingerprint.*|face_.*|retina_.*"
sensitivity: "HIGH"
pii_type: "BIOMETRIC"2. Context-Aware Classification
We can also learn from the codebase:
class ContextAwarePiiDetector:
def __init__(self):
self.context_rules = self.learn_from_codebase()
def learn_from_codebase(self):
"""Learn patterns from existing annotated protos"""
patterns = {}
# Scan all existing annotated protos
for proto_file in glob.glob("**/*.proto"):
annotations = self.extract_annotations(proto_file)
for field, annotation in annotations.items():
# Learn the pattern
if field not in patterns:
patterns[field] = []
patterns[field].append({
'context': self.get_message_context(field),
'sensitivity': annotation['sensitivity']
})
return patterns
def classify_with_learned_context(self, field_name: str, context: str):
"""Use learned patterns for classification"""
if field_name in self.context_rules:
# Find similar contexts
for rule in self.context_rules[field_name]:
if self.context_similarity(context, rule['context']) > 0.8:
return rule['sensitivity']
return self.default_classification(field_name)3. Incremental Learning from Corrections
Also, we can apply a RLHF (Reinforcement learning from human feedback) based mechanism to learn from human corrects a classification:
def record_correction(self, field: str, ai_classification: str, human_correction: str):
"""Learn from human corrections"""
correction_record = {
'field': field,
'ai_said': ai_classification,
'human_said': human_correction,
'context': self.get_full_context(field),
'timestamp': datetime.now()
}
# Store in vector database for RAG
self.knowledge_base.add_correction(correction_record)
# Update prompt if pattern emerges
if self.count_similar_corrections(field) > 3:
self.update_classification_rules(field, human_correction)Results: What We Achieved
Before the System
- Hours of manual review for each proto change
- No systematic way to track PII across services
- Compliance audits were nightmares
After Implementation
- Automated detection in under 30 seconds
- Complete PII inventory across all services
- Compliance reports generated automatically
- 92%+ accuracy in classification
Performance Optimization: From 0% to 92%
Above journey to 92% accuracy wasn’t straightforward. Here’s how it was improved:
Iteration 1: Generic Prompt (0% Accuracy)
# Initial naive approach prompt = "Find PII fields in this proto and classify their sensitivity" # Result: LLM returned None or generic responses
Iteration 2: Basic Rules (45% Accuracy)
# Added basic rules but not specific enough prompt = """ Classify fields as: - HIGH: Very sensitive data - MEDIUM: Somewhat sensitive - LOW: Less sensitive """ # Result: Everything classified as MEDIUM
Iteration 3: Explicit Field Mapping (92% Accuracy)
# The breakthrough: explicit field name patterns prompt = """ STRICT Classification Rules - YOU MUST FOLLOW THESE EXACTLY: 1. HIGH Sensitivity: ALWAYS classify these field names as HIGH: - ssn, social_security_number ? HIGH + SSN - credit_card_number ? HIGH + CREDIT_CARD [... explicit mappings ...] """ # Result: 92.3% accuracy!
Key Performance Improvements
- Retry Logic with Exponential Backoff
for attempt in range(max_retries):
try:
result = await self.llm.ainvoke(prompt)
if result:
return result
except RateLimitError:
delay = 2 ** attempt # 2, 4, 8 seconds
await asyncio.sleep(delay)- Request Batching for Multiple Files
async def batch_process(proto_files: List[Path]):
# Process in batches of 5 to avoid rate limits
batch_size = 5
for i in range(0, len(proto_files), batch_size):
batch = proto_files[i:i+batch_size]
tasks = [detect_pii(f) for f in batch]
results = await asyncio.gather(*tasks)
# Add delay between batches
await asyncio.sleep(2)- Caching for Development
@lru_cache(maxsize=100)
def get_cached_analysis(proto_hash: str):
# Cache results during development/testing
return previous_analysisLessons Learned: The Hard Way
1. Start with High-Value PII
Don’t try to classify everything at once. Start with:
- Government IDs (SSN, passport)
- Financial data (credit cards, bank accounts)
- Medical information
- Authentication credentials
Get these right first, then expand.
2. False Positives Are Better Than False Negatives
We tuned for high recall (catching all PII) over precision. Why? It’s better to over-classify a field as sensitive than to leak an SSN.
3. Context Matters More Than Field Names
A field called data could be anything. Look at:
- The message it’s in
- Surrounding fields
- Comments in the proto
- How it’s used in code
4. Make Annotations Actionable
Don’t just mark fields as “sensitive”. Specify:
- Exact sensitivity level (HIGH/MEDIUM/LOW)
- PII type (SSN, CREDIT_CARD, etc.)
- Required protections (encryption, masking, audit)
5. Integrate Early in Development
The best time to annotate PII is when the field is created, not after it’s in production. Make PII detection part of proto creation and API review process.
Getting Started
Here is how you can start with protecting your customers’ data:
Step 1: Install and Configure
# Clone the repository git clone https://github.com/bhatti/todo-api-errors.git cd todo-api-errors/check-pii-automation # Set up Python environment python -m venv venv source venv/bin/activate # Install dependencies pip install -r requirements.txt # Configure GCP export GCP_PROJECT=your-project-id export GCP_REGION=us-central1 # Authenticate with Google Cloud gcloud auth application-default login
Step 2: Run Your First Scan
# Scan a proto file python pii_detector.py path/to/your/file.proto \ --output annotated.proto \ --json report.json # Review the report cat report.json | jq '.fields[] | select(.sensitivity == "HIGH")'
Step 3: Real-World Example
Here’s a complete example using our test proto:
# 1. Scan the proto without annotations
python pii_detector.py ../api/proto/pii/v1/account_without_annotations.proto \
--output output/account_annotated.proto \
--json output/report.json
# 2. View the detection summary
echo "=== PII Detection Summary ==="
cat output/report.json | jq '{
total_fields: .total_fields,
pii_detected: .pii_fields,
high_sensitivity: [.fields[] | select(.sensitivity == "HIGH") | .field_path],
accuracy: "\(.pii_fields) / \(.total_fields) = \((.pii_fields / .total_fields * 100 | floor))%"
}'
# 3. Compare with reference implementation
python test_pii_detection.py
# 4. View the annotated proto
head -50 output/account_annotated.protoExpected output:
=== PII Detection Summary ===
{
"total_fields": 84,
"pii_detected": 57,
"high_sensitivity": [
"Account.ssn",
"Account.tax_id",
"Account.credit_card_number",
"Account.medical_record_number",
"Account.password_hash"
],
"accuracy": "57 / 84 = 67%"
}
Verification Results:
? Correct Classifications: 60
Overall Accuracy: 92.3%Step 4: Integrate with CI/CD
Add the GitHub Action above to your repository. Start with warnings, then move to blocking deployments.
Step 5: Implement Field-Level Authorization
Use the annotations to enforce access control in your services. Start with the highest sensitivity fields.
Step 6: Monitor and Improve
Track false positives/negatives. Update custom rules. Share learnings with your team.
Conclusion: Privacy as Code
I have learned that manual API reviews are insufficient to evaluate risks of sensitive field when dealing with hundreds of services. Also, this responsibility can’t all be delegated to developers as it requires collaboration and feedback from security, legal and product teams. We need tooling and automated processes that understand and protect PII automatically. Every new field, every API change, every refactor is a chance for PII to leak. But with AI-powered detection, we can make privacy protection as automatic as running tests. The system we built isn’t perfect – 92% accuracy means we still miss 8% of PII. But it’s infinitely better than the 0% we were catching before.
The code is at https://github.com/bhatti/todo-api-errors. Star it, fork it, break it, improve it.