Ontology-Driven Architecture for Compliance Software
At Cohera, we connect pharmaceutical quality systems that were never designed to work together. Veeva Vault stores documents one way. SAP QM tracks materials another way. TrackWise manages CAPAs its own way. Each system has its own schema, its own identifiers, its own assumptions.
The challenge isn't just moving data between systems—it's creating a coherent understanding of what that data means.
This is where ontology-driven architecture comes in.
What Is an Ontology in Software?
In philosophy, ontology is the study of what exists. In software, an ontology is a formal description of the concepts in a domain and the relationships between them.
For our purposes, an ontology defines:
- Objects: The things that exist (Suppliers, Materials, Certificates, Products)
- Properties: What we know about those things (name, status, expiry date)
- Relationships: How things connect to each other (Supplier supplies Material, Certificate covers Material)
- Constraints: What must be true (every Certificate must have an expiry date, every Material must have exactly one primary Supplier)
This isn't just database schema design. It's a semantic model that captures the meaning of data, independent of how any particular system stores it.
Why Ontologies Matter for Integration
When you integrate multiple systems, you face a fundamental problem: each system has its own view of reality.
Veeva Vault sees documents:
Document {
id: "DOC-123",
type: "CoA",
supplier: "SUP-456",
status: "Approved"
}
SAP QM sees materials:
Material {
number: "MAT-789",
description: "Sodium Chloride USP",
vendor: "V000123"
}
TrackWise sees quality events:
CAPA {
id: "CAPA-001",
affected_material: "MAT-789",
source: "Supplier Audit"
}
These systems don't share identifiers. They don't agree on terminology. They weren't built to understand each other.
An ontology creates a layer above these systems that captures what we actually care about:
Supplier (SUP-456 = V000123)
├── supplies: Material (Sodium Chloride USP)
├── has_document: Certificate (CoA, DOC-123)
└── related_to: CAPA (CAPA-001)
Now we have a unified model where a Supplier is a Supplier, regardless of whether we're looking at Veeva, SAP, or TrackWise.
The Cohera Ontology
Our ontology captures the pharmaceutical quality domain. Here are some key objects:
Supplier
- Properties: name, status, qualification_date, risk_tier
- Relationships: supplies Materials, has Contacts, has Certificates, has Documents
Material
- Properties: name, description, category, specification
- Relationships: supplied_by Supplier, used_in Products, covered_by Certificates
Certificate
- Properties: type, issue_date, expiry_date, status
- Relationships: covers Materials, issued_by Supplier, stored_in Document_System
Product
- Properties: name, SKU, registration_status
- Relationships: contains Materials, has Documents, subject_to Specifications
Quality_Event
- Properties: type (CAPA, Deviation, OOS), status, due_date
- Relationships: affects Materials, involves Suppliers, documented_in Documents
Schema Evolution and Versioning
Ontologies evolve. When we add new object types or relationships, we need to handle this carefully:
Backward compatibility: Existing data must remain valid. New required properties need defaults.
Forward compatibility: New versions should handle data created by older versions.
Migration paths: Clear procedures for updating the ontology without breaking existing integrations.
We version our ontology like software:
ontology_version: "2.3.0"
- Major (2): Breaking changes to core objects
- Minor (3): New objects or relationships (backward compatible)
- Patch (0): Clarifications or documentation updates
Mapping Systems to the Ontology
For each connected system, we maintain mappings that translate between system-specific schemas and our ontology.
Veeva Vault mapping:
system: veeva_vault
object_mappings:
Certificate:
source_type: 'document'
source_subtype: 'coa__c'
property_mappings:
expiry_date: document_expiry_date__c
status: lifecycle_state__v
covers: related_material__c
SAP QM mapping:
system: sap_qm
object_mappings:
Material:
source_table: MARA
property_mappings:
name: MAKTX
category: MTART
supplied_by: source_from_EORD_table
These mappings are configuration, not code. When a customer has customized Veeva fields, we update the mapping without changing the integration logic.
Query Language
With a unified ontology, we can offer powerful cross-system queries:
"Show me all materials from suppliers with expiring certificates"
In our query language:
Supplier
.where(certificates.any(expiry_date < today() + 90.days))
.materials
.include(supplier.name, supplier.certificates.expiry_date)
This query traverses relationships across Veeva (certificates), SAP (materials), and potentially other systems—all transparently.
"What products are affected by this supplier issue?"
Supplier("SUP-123")
.materials
.products
.include(name, registration_status, materials.name)
This is the kind of query that would take hours to answer manually, requiring export from multiple systems and manual reconciliation.
Audit Trails and the Ontology
In regulated environments, we need to track not just current state but history. Our ontology includes temporal aspects:
Every object has audit metadata:
Material {
id: "MAT-789",
name: "Sodium Chloride USP",
// Current values above, audit trail below
_created_at: "2024-01-15T10:30:00Z",
_created_by: "user:alice@pharma.com",
_modified_at: "2024-06-20T14:22:00Z",
_modified_by: "user:bob@pharma.com",
_version: 3,
_change_reason: "Specification update"
}
Relationships are timestamped:
supplies {
supplier: "SUP-123",
material: "MAT-789",
_valid_from: "2024-01-15",
_valid_to: null, // Current relationship
_created_by: "system:sap_sync"
}
This lets us answer questions like "who was the supplier for this material on March 15th?" even if the supplier has changed since.
Handling Conflicts
When multiple systems are sources of truth for different aspects of an object, conflicts arise. Our ontology defines ownership:
Material:
authoritative_sources:
name: sap_qm # SAP is authoritative for material name
specification: veeva_vault # Veeva is authoritative for specs
supplier_info: supplier_portal # Supplier portal for supplier data
conflict_resolution:
default: authoritative_source_wins
alert_on_mismatch: true
When we detect a conflict—say, a material name differs between SAP and a supplier portal—we log the conflict, use the authoritative source, and alert users for review.
AI and the Ontology
Our AI agents work directly with the ontology. When the certificate intake agent processes a document, it:
- Extracts data from the document (using ML models)
- Maps extracted data to ontology objects and properties
- Identifies relationships (this certificate covers these materials)
- Creates or updates ontology objects
- Propagates changes to connected systems via mappings
The ontology provides the AI with:
- Structure: What kinds of things exist and how they relate
- Validation: What values are valid for each property
- Context: What relationships should exist
Without the ontology, the AI would just be extracting fields. With the ontology, the AI understands the domain.
Implementation Considerations
Storage: We store the ontology in a graph database (we use a combination of PostgreSQL for attributes and a graph layer for relationships). The graph structure makes relationship traversal fast.
Caching: Common queries are cached with invalidation tied to object updates. Cache invalidation follows relationship paths—updating a Supplier invalidates cached queries involving that Supplier's Materials and Certificates.
Performance: For large pharmaceutical companies, we're dealing with tens of thousands of suppliers, hundreds of thousands of certificates, and millions of relationship edges. Index design and query optimization matter.
Multi-tenancy: Each customer has their own ontology namespace. While the base ontology is shared, customers can extend it with custom objects and properties.
Lessons Learned
Start with the domain, not the systems. We initially tried to build the ontology by generalizing from system schemas. This led to a model that was too tied to existing system structures. When we stepped back and modeled the domain directly, we got a cleaner, more useful ontology.
Relationships are as important as objects. Early versions had rich object definitions but thin relationship modeling. This made cross-system queries awkward. We invested heavily in relationship semantics and it paid off.
Plan for extension. Pharmaceutical compliance is complex. We kept discovering new object types and relationships. Building an extensible ontology from the start saved us from painful migrations.
Document the semantics. Every object and property has documentation explaining what it means, not just what type it is. This documentation is essential when onboarding new team members or discussing requirements with customers.
The Payoff
Ontology-driven architecture adds upfront complexity. It takes time to design a good ontology, implement the mapping system, and build the query engine.
But the payoff is substantial:
- Faster integration: Adding a new source system is configuration, not code rewrite
- Powerful queries: Questions that span systems become trivial to answer
- AI readiness: AI agents have a semantic foundation to work from
- Future-proofing: When systems change, we update mappings, not architecture
For anyone building software that connects multiple enterprise systems, I'd strongly recommend investing in semantic modeling early. It's one of the best architectural decisions we made at Cohera.