Best Practices for CorpusIQ

Proven patterns and recommendations for getting the most out of CorpusIQ.

Search Best Practices
Connector Management
Security Best Practices
Performance Optimization
Deployment Best Practices
Development Best Practices
Team Collaboration
Maintenance & Operations

Search Best Practices

Writing Effective Queries

✅ DO: Be Specific

"Find the Q4 2024 budget review document from Finance"

❌ DON’T: Be Too Vague

"Find document"

✅ DO: Use Natural Language

"Show me emails from John about the merger discussion"

❌ DON’T: Use Boolean Operators (unless needed)

"from:john AND subject:merger OR body:acquisition"

Note: Natural language works better with AI. Use operators only for very specific searches.

✅ DO: Specify Time Ranges When Relevant

"Find project updates from last week"
"Show me invoices from Q3 2024"

❌ DON’T: Search Without Context If Time Matters

"Find project updates" (might return years of data)

✅ DO: Name Specific Sources When Appropriate

"Search my Gmail for messages from sarah@example.com"
"Find documents in my OneDrive folder about marketing"

❌ DON’T: Assume AI Knows Your Preferred Source

"Find that email" (which email client?)

Query Patterns That Work Well

Finding Recent Communications

"Show me recent Slack messages about the product launch"
"Find emails from this week about the client presentation"

Looking for Specific Documents

"Find the latest version of the sales deck"
"Show me the contract with Acme Corp"

Researching Topics

"Search all sources for information about our pricing strategy"
"Find everything related to the Phoenix project"

Tracking Decisions

"What did we decide about the new feature in the last meeting?"
"Show me discussions about choosing our tech stack"

Finding Precedents

"Find similar proposals we've sent to healthcare clients"
"Show me past quarterly reports for reference"

Refining Search Results

Start Broad, Then Narrow

1. "Find information about Project X"
2. Review results
3. "Find Project X budget documents from Finance"

Adjust Result Count Based on Need

Quick answer: "Find the latest status report (show 3 results)"
Comprehensive: "Find all docs about API design (show 20 results)"

Use Exact Phrases for Precision

"Find documents mentioning 'quarterly business review'"
"Search for 'phase 2 implementation plan'"

Exclude Irrelevant Content

"Find budget docs but not drafts"
"Show me completed projects not in-progress"

Connector Management

Connection Strategy

✅ DO: Connect Sources You Actually Use

Reduces search noise
Faster search results
Easier to manage

❌ DON’T: Connect Everything “Just in Case”

Slower searches
More irrelevant results
Harder to troubleshoot

✅ DO: Regularly Review Connected Sources

Monthly review of active connections
Disconnect unused sources
Update permissions as needed

❌ DON’T: Set and Forget

Stale connections fail over time
Unused sources waste resources

✅ DO: Test Connections After Setup

"Search my Gmail for messages from today"
"Find a recent document in my Drive"

❌ DON’T: Assume It’s Working

Verify with a test search
Check connector status

Connector Prioritization

High Priority (Connect First)

Email: Primary communication source
File Storage: Documents you reference often
Chat/IM: Team discussions and decisions

Medium Priority 4. Business Tools: CRM, project management, etc. 5. Code/Issues: For development teams 6. Financial: For accounting/finance needs

Lower Priority 7. Specialized Tools: Industry-specific platforms 8. Archive Systems: Historical data

Managing Permissions

Principle of Least Privilege

Only grant necessary permissions
Review requested scopes during OAuth
Deny unnecessary access

Regular Permission Audits

Monthly checklist:
- Review connected apps in Google/Microsoft accounts
- Verify CorpusIQ still needs access
- Remove if no longer used

Separate Personal and Work

Use work accounts for work data
Keep personal accounts separate
Don’t mix contexts

Security Best Practices

Development vs. Production

Development Environment

# .env for development
CORPUSIQ_DEBUG_MODE=true
CORPUSIQ_LOG_LEVEL=DEBUG
CORPUSIQ_CORS_ALLOW_ORIGINS_CSV=https://chat.openai.com,http://localhost:3000

Production Environment

# .env for production
CORPUSIQ_DEBUG_MODE=false
CORPUSIQ_LOG_LEVEL=INFO
CORPUSIQ_CORS_ALLOW_ORIGINS_CSV=https://chat.openai.com
# Add OAuth configuration
CORPUSIQ_OAUTH_RESOURCE_URL=https://your-domain.com
# ... other OAuth settings

Environment Variables

✅ DO: Use Environment Variables for Config

# Good
api_key = os.getenv("API_KEY")

❌ DON’T: Hardcode Secrets

# Bad
api_key = "abc123xyz"

✅ DO: Use .env.example for Documentation

# .env.example (commit to repo)
CORPUSIQ_DEBUG_MODE=false
CORPUSIQ_OAUTH_RESOURCE_URL=https://example.com

❌ DON’T: Commit .env Files

# .gitignore (always ignore)
.env
.env.local
.env.production

HTTPS and SSL

✅ DO: Always Use HTTPS in Production

Use Let’s Encrypt for free SSL
Use tunnels (ngrok/Cloudflare) for testing
Configure HSTS headers

❌ DON’T: Use HTTP for Production

Never deploy without SSL
ChatGPT won’t connect to HTTP

Rate Limiting

Default Settings (Good for most)**

CORPUSIQ_RATE_LIMIT_REQUESTS_PER_MINUTE=60

High-Traffic Adjustments

# For busy deployments
CORPUSIQ_RATE_LIMIT_REQUESTS_PER_MINUTE=120

# For very low traffic (stricter)
CORPUSIQ_RATE_LIMIT_REQUESTS_PER_MINUTE=30

Multi-Instance Deployments

Use Redis for shared rate limiting
Prevents per-instance limits from stacking

Logging

✅ DO: Log Important Events

Authentication attempts
Tool invocations
Errors and warnings
Rate limit hits

❌ DON’T: Log Sensitive Data

Search queries (PII)
OAuth tokens
User data
API keys

Recommended Log Retention

Debug logs: 1-7 days
Info logs: 30 days
Warning/Error logs: 90 days
Audit logs: 1+ years

Performance Optimization

Connector Performance

Parallel Searches

CorpusIQ searches connectors in parallel
No need to optimize query order
Slowest connector determines total time

Minimize Connected Sources

Best: 3-5 frequently-used sources
Good: 6-10 sources
Poor: 15+ sources (slower searches)

Monitor Connector Latency

# Log search times per connector
logger.info(f"Gmail search: {elapsed}ms")
logger.info(f"Drive search: {elapsed}ms")

Caching Strategy

Static Content (cache aggressively)

Widget HTML
OAuth metadata
Configuration

Dynamic Content (cache briefly)

Search results: 5-10 minutes
Connector status: 1-2 minutes

No Caching

Real-time data
User-specific content

Resource Management

Memory

# Monitor memory usage
ps aux | grep corpusiq

# Set limits with Docker
docker run --memory="1g" corpusiq

Connection Pooling

Reuse HTTP connections
Configure connector timeouts
Implement connection limits

CPU Usage

Use async/await for I/O
Avoid blocking operations
Profile slow operations

Deployment Best Practices

Pre-Deployment Checklist

[ ] Environment variables configured
[ ] OAuth provider set up
[ ] SSL certificate valid
[ ] CORS properly configured
[ ] Debug mode disabled
[ ] Logging configured
[ ] Monitoring set up
[ ] Backup strategy in place
[ ] Rollback plan ready

Deployment Strategy

Blue-Green Deployment (Recommended)

Deploy new version (green)
Test thoroughly
Switch traffic from old (blue) to new (green)
Keep old version for quick rollback

Rolling Deployment

Deploy to one instance
Monitor for issues
Deploy to remaining instances gradually
Roll back if problems occur

Canary Deployment

Deploy to small percentage of traffic (5%)
Monitor metrics
Gradually increase (10%, 25%, 50%, 100%)
Roll back at any sign of issues

Health Checks

Implement Health Endpoints

@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "timestamp": datetime.now().isoformat(),
        "version": "0.1.0"
    }

Deep Health Checks

@app.get("/health/detailed")
async def detailed_health():
    return {
        "status": "healthy",
        "connectors": await check_connectors(),
        "database": await check_database(),
        "redis": await check_redis(),
    }

Configure Load Balancer

Check /health every 30 seconds
Mark unhealthy after 3 failures
Remove from rotation

Monitoring

Essential Metrics

Request rate (requests/second)
Response time (p50, p95, p99)
Error rate (percentage)
Connector latency
Memory usage
CPU usage

Alerting Thresholds

Critical:
- Error rate > 5%
- p99 latency > 5000ms
- Memory > 90%

Warning:
- Error rate > 2%
- p95 latency > 2000ms
- Memory > 75%

Tools

Uptime: UptimeRobot, Pingdom
APM: New Relic, Datadog, AppDynamics
Errors: Sentry, Rollbar
Logs: ELK, Splunk, CloudWatch

Development Best Practices

Code Organization

Follow Existing Structure

src/corpusiq/
├── __init__.py
├── __main__.py          # Entry point
├── app.py               # FastAPI app
├── mcp_server.py        # MCP logic
├── settings.py          # Configuration
└── api/
    ├── __init__.py
    ├── app.py           # API routes
    └── ...

Separation of Concerns

MCP protocol logic → mcp_server.py
HTTP handling → app.py
Business logic → separate modules
Configuration → settings.py

Code Quality

Use Type Hints

# Good
def search(query: str, max_results: int) -> List[SearchResult]:
    ...

# Better with Pydantic
class SearchArgs(BaseModel):
    query: str
    max_results: int = 5

def search(args: SearchArgs) -> List[SearchResult]:
    ...

Input Validation

# Always validate inputs
class ToolInput(BaseModel):
    query: str = Field(..., max_length=1000)
    max_results: int = Field(default=5, ge=1, le=20)
    
    @field_validator("query")
    @classmethod
    def validate_query(cls, v: str) -> str:
        if not v.strip():
            raise ValueError("Query cannot be empty")
        return v.strip()

Error Handling

# Catch specific exceptions
try:
    result = connector.search(query)
except ConnectorError as e:
    logger.error(f"Connector failed: {e}")
    return error_response("Search failed")
except Exception as e:
    logger.exception("Unexpected error")
    return error_response("Internal error")

Testing

Write Tests for New Features

def test_corpus_search_validation():
    """Test input validation."""
    # Valid input
    args = CorpusSearchArgs(query="test", max_results=5)
    assert args.query == "test"
    
    # Invalid input
    with pytest.raises(ValidationError):
        CorpusSearchArgs(query="", max_results=5)

Test Coverage Goals

Core functionality: 90%+
API endpoints: 80%+
Utilities: 70%+

Run Tests Before Committing

# Run all tests
pytest

# Run with coverage
pytest --cov=corpusiq --cov-report=html

# Run specific tests
pytest tests/test_mcp_server.py -v

Documentation

Document Public APIs

def corpus_search(query: str, max_results: int = 5) -> SearchResult:
    """
    Search across all connected data sources.
    
    Args:
        query: Natural language search query
        max_results: Maximum number of results to return (1-20)
        
    Returns:
        SearchResult containing matched items
        
    Raises:
        ValidationError: If inputs are invalid
        ConnectorError: If search fails
    """
    ...

Keep Docs Updated

Update README when adding features
Document breaking changes
Provide migration guides
Update API reference

Team Collaboration

For Teams Using CorpusIQ

Establish Search Conventions

Team agreement:
- Use project codes in queries: "Find docs about [PROJECT-123]"
- Tag important docs: Add "TEAM-SHARED" to file names
- Standard date format: YYYY-MM-DD

Share Useful Queries

Create a team wiki with common searches:
- "Find sprint planning notes from last month"
- "Show me open action items from team meetings"
- "Search for API documentation updates"

Connector Governance

Define which sources to connect:
- Required: Company email, shared drive
- Recommended: Team Slack, project management
- Optional: Personal notebooks, archives

For Development Teams

Code Review Checklist

[ ] Follows existing code style
[ ] Has tests for new functionality
[ ] Documentation updated
[ ] No security vulnerabilities
[ ] Error handling implemented
[ ] Logging added for important events

Branching Strategy

main                # Production code
├── develop        # Integration branch
│   ├── feature/x  # Feature branches
│   └── fix/y      # Bug fixes
└── hotfix/z       # Production hotfixes

Commit Messages

# Good
git commit -m "Add Gmail connector with OAuth support"
git commit -m "Fix rate limiting bug when using Redis"
git commit -m "Update docs for new search syntax"

# Bad
git commit -m "fix stuff"
git commit -m "wip"
git commit -m "updates"

Maintenance & Operations

Regular Maintenance Tasks

Daily

[ ] Check error logs
[ ] Monitor uptime
[ ] Review alerts

Weekly

[ ] Review performance metrics
[ ] Check for security updates
[ ] Test backup/restore process

Monthly

[ ] Update dependencies
[ ] Review connector permissions
[ ] Audit access logs
[ ] Update documentation

Quarterly

[ ] Security audit
[ ] Capacity planning
[ ] Disaster recovery test
[ ] User feedback review

Upgrade Strategy

Before Upgrading

Read changelog
Review breaking changes
Test in staging
Backup configuration
Plan rollback

Upgrading

# 1. Backup current version
git tag v0.1.0-backup

# 2. Pull latest
git pull origin main

# 3. Update dependencies
pip install -e . --upgrade

# 4. Review configuration changes
diff .env.example .env

# 5. Test locally
python -m corpusiq

# 6. Deploy to staging
# (test thoroughly)

# 7. Deploy to production

After Upgrading

Monitor for issues
Check logs for errors
Test key functionality
Update documentation
Notify team

Troubleshooting Workflow

Step 1: Reproduce the Issue

Get exact steps to reproduce
Note environment (dev/staging/prod)
Capture error messages

Step 2: Check Logs

# Recent errors
grep ERROR server.log | tail -20

# Specific request
grep request-id-123 server.log

# Pattern analysis
grep "rate limit" server.log | wc -l

Step 3: Verify Configuration

# Check environment
cat .env

# Verify settings loaded
python -c "from corpusiq.settings import Settings; print(Settings())"

# Test connectivity
curl http://localhost:8000/health

Step 4: Isolate the Problem

Disable features one by one
Test with minimal config
Use debug mode
Review recent changes

Step 5: Fix and Verify

Implement fix
Add test to prevent regression
Deploy and monitor
Document the issue

Backup and Recovery

What to Backup

Configuration files (.env)
Custom code modifications
User preferences (if stored)
Connector configurations
SSL certificates

Backup Schedule

Daily: Configuration files
Weekly: Full application state
Monthly: Archived snapshots

Test Restores

# Quarterly disaster recovery test
# 1. Restore from backup
# 2. Verify configuration
# 3. Test functionality
# 4. Document any issues

Common Pitfalls to Avoid

❌ Don’t

Commit secrets to Git
- Use .env files
- Add .env to .gitignore
Run production without OAuth
- Always implement authentication
- Test OAuth flow thoroughly
Ignore rate limits
- Configure appropriate limits
- Monitor for abuse
Skip testing after changes
- Test locally first
- Use staging environment
- Monitor after deployment
Forget about logging
- Log important events
- Don’t log sensitive data
- Monitor logs regularly
Neglect documentation
- Update docs with code
- Document breaking changes
- Keep examples current
Over-optimize prematurely
- Measure first
- Optimize bottlenecks
- Test performance impact
Deploy without monitoring
- Set up alerts
- Monitor key metrics
- Have rollback plan

Success Metrics

User Adoption

Number of active users
Searches per user per day
Connector usage rates
User satisfaction scores

Technical Health

Uptime percentage (target: 99.9%)
Average response time (target: <500ms)
Error rate (target: <1%)
Search success rate (target: >95%)

Business Impact

Time saved per user
Productivity improvements
Support ticket reduction
User retention

Resources

Learn More:

User Guide - Complete usage guide
API Reference - Technical reference
Architecture - System design
Troubleshooting - Problem solving

Get Help:

FAQ - Common questions
GitHub Issues
Discussions

Remember: These are guidelines, not rules. Adapt them to your specific context and needs.

Have better practices? Contribute to this document! Open a pull request.