CorpusIQ

Best Practices for CorpusIQ

Proven patterns and recommendations for getting the most out of CorpusIQ.

Table of Contents

Search Best Practices

Writing Effective Queries

✅ DO: Be Specific

"Find the Q4 2024 budget review document from Finance"

❌ DON’T: Be Too Vague

"Find document"

✅ DO: Use Natural Language

"Show me emails from John about the merger discussion"

❌ DON’T: Use Boolean Operators (unless needed)

"from:john AND subject:merger OR body:acquisition"

Note: Natural language works better with AI. Use operators only for very specific searches.


✅ DO: Specify Time Ranges When Relevant

"Find project updates from last week"
"Show me invoices from Q3 2024"

❌ DON’T: Search Without Context If Time Matters

"Find project updates" (might return years of data)

✅ DO: Name Specific Sources When Appropriate

"Search my Gmail for messages from sarah@example.com"
"Find documents in my OneDrive folder about marketing"

❌ DON’T: Assume AI Knows Your Preferred Source

"Find that email" (which email client?)

Query Patterns That Work Well

Finding Recent Communications

"Show me recent Slack messages about the product launch"
"Find emails from this week about the client presentation"

Looking for Specific Documents

"Find the latest version of the sales deck"
"Show me the contract with Acme Corp"

Researching Topics

"Search all sources for information about our pricing strategy"
"Find everything related to the Phoenix project"

Tracking Decisions

"What did we decide about the new feature in the last meeting?"
"Show me discussions about choosing our tech stack"

Finding Precedents

"Find similar proposals we've sent to healthcare clients"
"Show me past quarterly reports for reference"

Refining Search Results

Start Broad, Then Narrow

1. "Find information about Project X"
2. Review results
3. "Find Project X budget documents from Finance"

Adjust Result Count Based on Need

Quick answer: "Find the latest status report (show 3 results)"
Comprehensive: "Find all docs about API design (show 20 results)"

Use Exact Phrases for Precision

"Find documents mentioning 'quarterly business review'"
"Search for 'phase 2 implementation plan'"

Exclude Irrelevant Content

"Find budget docs but not drafts"
"Show me completed projects not in-progress"

Connector Management

Connection Strategy

✅ DO: Connect Sources You Actually Use

  • Reduces search noise
  • Faster search results
  • Easier to manage

❌ DON’T: Connect Everything “Just in Case”

  • Slower searches
  • More irrelevant results
  • Harder to troubleshoot

✅ DO: Regularly Review Connected Sources

  • Monthly review of active connections
  • Disconnect unused sources
  • Update permissions as needed

❌ DON’T: Set and Forget

  • Stale connections fail over time
  • Unused sources waste resources

✅ DO: Test Connections After Setup

"Search my Gmail for messages from today"
"Find a recent document in my Drive"

❌ DON’T: Assume It’s Working

  • Verify with a test search
  • Check connector status

Connector Prioritization

High Priority (Connect First)

  1. Email: Primary communication source
  2. File Storage: Documents you reference often
  3. Chat/IM: Team discussions and decisions

Medium Priority 4. Business Tools: CRM, project management, etc. 5. Code/Issues: For development teams 6. Financial: For accounting/finance needs

Lower Priority 7. Specialized Tools: Industry-specific platforms 8. Archive Systems: Historical data


Managing Permissions

Principle of Least Privilege

  • Only grant necessary permissions
  • Review requested scopes during OAuth
  • Deny unnecessary access

Regular Permission Audits

Monthly checklist:
- Review connected apps in Google/Microsoft accounts
- Verify CorpusIQ still needs access
- Remove if no longer used

Separate Personal and Work

  • Use work accounts for work data
  • Keep personal accounts separate
  • Don’t mix contexts

Security Best Practices

Development vs. Production

Development Environment

# .env for development
CORPUSIQ_DEBUG_MODE=true
CORPUSIQ_LOG_LEVEL=DEBUG
CORPUSIQ_CORS_ALLOW_ORIGINS_CSV=https://chat.openai.com,http://localhost:3000

Production Environment

# .env for production
CORPUSIQ_DEBUG_MODE=false
CORPUSIQ_LOG_LEVEL=INFO
CORPUSIQ_CORS_ALLOW_ORIGINS_CSV=https://chat.openai.com
# Add OAuth configuration
CORPUSIQ_OAUTH_RESOURCE_URL=https://your-domain.com
# ... other OAuth settings

Environment Variables

✅ DO: Use Environment Variables for Config

# Good
api_key = os.getenv("API_KEY")

❌ DON’T: Hardcode Secrets

# Bad
api_key = "abc123xyz"

✅ DO: Use .env.example for Documentation

# .env.example (commit to repo)
CORPUSIQ_DEBUG_MODE=false
CORPUSIQ_OAUTH_RESOURCE_URL=https://example.com

❌ DON’T: Commit .env Files

# .gitignore (always ignore)
.env
.env.local
.env.production

HTTPS and SSL

✅ DO: Always Use HTTPS in Production

  • Use Let’s Encrypt for free SSL
  • Use tunnels (ngrok/Cloudflare) for testing
  • Configure HSTS headers

❌ DON’T: Use HTTP for Production

  • Never deploy without SSL
  • ChatGPT won’t connect to HTTP

Rate Limiting

Default Settings (Good for most)**

CORPUSIQ_RATE_LIMIT_REQUESTS_PER_MINUTE=60

High-Traffic Adjustments

# For busy deployments
CORPUSIQ_RATE_LIMIT_REQUESTS_PER_MINUTE=120

# For very low traffic (stricter)
CORPUSIQ_RATE_LIMIT_REQUESTS_PER_MINUTE=30

Multi-Instance Deployments

  • Use Redis for shared rate limiting
  • Prevents per-instance limits from stacking

Logging

✅ DO: Log Important Events

  • Authentication attempts
  • Tool invocations
  • Errors and warnings
  • Rate limit hits

❌ DON’T: Log Sensitive Data

  • Search queries (PII)
  • OAuth tokens
  • User data
  • API keys

Recommended Log Retention

  • Debug logs: 1-7 days
  • Info logs: 30 days
  • Warning/Error logs: 90 days
  • Audit logs: 1+ years

Performance Optimization

Connector Performance

Parallel Searches

  • CorpusIQ searches connectors in parallel
  • No need to optimize query order
  • Slowest connector determines total time

Minimize Connected Sources

Best: 3-5 frequently-used sources
Good: 6-10 sources
Poor: 15+ sources (slower searches)

Monitor Connector Latency

# Log search times per connector
logger.info(f"Gmail search: {elapsed}ms")
logger.info(f"Drive search: {elapsed}ms")

Caching Strategy

Static Content (cache aggressively)

  • Widget HTML
  • OAuth metadata
  • Configuration

Dynamic Content (cache briefly)

  • Search results: 5-10 minutes
  • Connector status: 1-2 minutes

No Caching

  • Real-time data
  • User-specific content

Resource Management

Memory

# Monitor memory usage
ps aux | grep corpusiq

# Set limits with Docker
docker run --memory="1g" corpusiq

Connection Pooling

  • Reuse HTTP connections
  • Configure connector timeouts
  • Implement connection limits

CPU Usage

  • Use async/await for I/O
  • Avoid blocking operations
  • Profile slow operations

Deployment Best Practices

Pre-Deployment Checklist

  • [ ] Environment variables configured
  • [ ] OAuth provider set up
  • [ ] SSL certificate valid
  • [ ] CORS properly configured
  • [ ] Debug mode disabled
  • [ ] Logging configured
  • [ ] Monitoring set up
  • [ ] Backup strategy in place
  • [ ] Rollback plan ready

Deployment Strategy

Blue-Green Deployment (Recommended)

  1. Deploy new version (green)
  2. Test thoroughly
  3. Switch traffic from old (blue) to new (green)
  4. Keep old version for quick rollback

Rolling Deployment

  1. Deploy to one instance
  2. Monitor for issues
  3. Deploy to remaining instances gradually
  4. Roll back if problems occur

Canary Deployment

  1. Deploy to small percentage of traffic (5%)
  2. Monitor metrics
  3. Gradually increase (10%, 25%, 50%, 100%)
  4. Roll back at any sign of issues

Health Checks

Implement Health Endpoints

@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "timestamp": datetime.now().isoformat(),
        "version": "0.1.0"
    }

Deep Health Checks

@app.get("/health/detailed")
async def detailed_health():
    return {
        "status": "healthy",
        "connectors": await check_connectors(),
        "database": await check_database(),
        "redis": await check_redis(),
    }

Configure Load Balancer

  • Check /health every 30 seconds
  • Mark unhealthy after 3 failures
  • Remove from rotation

Monitoring

Essential Metrics

  • Request rate (requests/second)
  • Response time (p50, p95, p99)
  • Error rate (percentage)
  • Connector latency
  • Memory usage
  • CPU usage

Alerting Thresholds

Critical:
- Error rate > 5%
- p99 latency > 5000ms
- Memory > 90%

Warning:
- Error rate > 2%
- p95 latency > 2000ms
- Memory > 75%

Tools

  • Uptime: UptimeRobot, Pingdom
  • APM: New Relic, Datadog, AppDynamics
  • Errors: Sentry, Rollbar
  • Logs: ELK, Splunk, CloudWatch

Development Best Practices

Code Organization

Follow Existing Structure

src/corpusiq/
├── __init__.py
├── __main__.py          # Entry point
├── app.py               # FastAPI app
├── mcp_server.py        # MCP logic
├── settings.py          # Configuration
└── api/
    ├── __init__.py
    ├── app.py           # API routes
    └── ...

Separation of Concerns

  • MCP protocol logic → mcp_server.py
  • HTTP handling → app.py
  • Business logic → separate modules
  • Configuration → settings.py

Code Quality

Use Type Hints

# Good
def search(query: str, max_results: int) -> List[SearchResult]:
    ...

# Better with Pydantic
class SearchArgs(BaseModel):
    query: str
    max_results: int = 5

def search(args: SearchArgs) -> List[SearchResult]:
    ...

Input Validation

# Always validate inputs
class ToolInput(BaseModel):
    query: str = Field(..., max_length=1000)
    max_results: int = Field(default=5, ge=1, le=20)
    
    @field_validator("query")
    @classmethod
    def validate_query(cls, v: str) -> str:
        if not v.strip():
            raise ValueError("Query cannot be empty")
        return v.strip()

Error Handling

# Catch specific exceptions
try:
    result = connector.search(query)
except ConnectorError as e:
    logger.error(f"Connector failed: {e}")
    return error_response("Search failed")
except Exception as e:
    logger.exception("Unexpected error")
    return error_response("Internal error")

Testing

Write Tests for New Features

def test_corpus_search_validation():
    """Test input validation."""
    # Valid input
    args = CorpusSearchArgs(query="test", max_results=5)
    assert args.query == "test"
    
    # Invalid input
    with pytest.raises(ValidationError):
        CorpusSearchArgs(query="", max_results=5)

Test Coverage Goals

  • Core functionality: 90%+
  • API endpoints: 80%+
  • Utilities: 70%+

Run Tests Before Committing

# Run all tests
pytest

# Run with coverage
pytest --cov=corpusiq --cov-report=html

# Run specific tests
pytest tests/test_mcp_server.py -v

Documentation

Document Public APIs

def corpus_search(query: str, max_results: int = 5) -> SearchResult:
    """
    Search across all connected data sources.
    
    Args:
        query: Natural language search query
        max_results: Maximum number of results to return (1-20)
        
    Returns:
        SearchResult containing matched items
        
    Raises:
        ValidationError: If inputs are invalid
        ConnectorError: If search fails
    """
    ...

Keep Docs Updated

  • Update README when adding features
  • Document breaking changes
  • Provide migration guides
  • Update API reference

Team Collaboration

For Teams Using CorpusIQ

Establish Search Conventions

Team agreement:
- Use project codes in queries: "Find docs about [PROJECT-123]"
- Tag important docs: Add "TEAM-SHARED" to file names
- Standard date format: YYYY-MM-DD

Share Useful Queries

Create a team wiki with common searches:
- "Find sprint planning notes from last month"
- "Show me open action items from team meetings"
- "Search for API documentation updates"

Connector Governance

Define which sources to connect:
- Required: Company email, shared drive
- Recommended: Team Slack, project management
- Optional: Personal notebooks, archives

For Development Teams

Code Review Checklist

  • [ ] Follows existing code style
  • [ ] Has tests for new functionality
  • [ ] Documentation updated
  • [ ] No security vulnerabilities
  • [ ] Error handling implemented
  • [ ] Logging added for important events

Branching Strategy

main                # Production code
├── develop        # Integration branch
│   ├── feature/x  # Feature branches
│   └── fix/y      # Bug fixes
└── hotfix/z       # Production hotfixes

Commit Messages

# Good
git commit -m "Add Gmail connector with OAuth support"
git commit -m "Fix rate limiting bug when using Redis"
git commit -m "Update docs for new search syntax"

# Bad
git commit -m "fix stuff"
git commit -m "wip"
git commit -m "updates"

Maintenance & Operations

Regular Maintenance Tasks

Daily

  • [ ] Check error logs
  • [ ] Monitor uptime
  • [ ] Review alerts

Weekly

  • [ ] Review performance metrics
  • [ ] Check for security updates
  • [ ] Test backup/restore process

Monthly

  • [ ] Update dependencies
  • [ ] Review connector permissions
  • [ ] Audit access logs
  • [ ] Update documentation

Quarterly

  • [ ] Security audit
  • [ ] Capacity planning
  • [ ] Disaster recovery test
  • [ ] User feedback review

Upgrade Strategy

Before Upgrading

  1. Read changelog
  2. Review breaking changes
  3. Test in staging
  4. Backup configuration
  5. Plan rollback

Upgrading

# 1. Backup current version
git tag v0.1.0-backup

# 2. Pull latest
git pull origin main

# 3. Update dependencies
pip install -e . --upgrade

# 4. Review configuration changes
diff .env.example .env

# 5. Test locally
python -m corpusiq

# 6. Deploy to staging
# (test thoroughly)

# 7. Deploy to production

After Upgrading

  1. Monitor for issues
  2. Check logs for errors
  3. Test key functionality
  4. Update documentation
  5. Notify team

Troubleshooting Workflow

Step 1: Reproduce the Issue

  • Get exact steps to reproduce
  • Note environment (dev/staging/prod)
  • Capture error messages

Step 2: Check Logs

# Recent errors
grep ERROR server.log | tail -20

# Specific request
grep request-id-123 server.log

# Pattern analysis
grep "rate limit" server.log | wc -l

Step 3: Verify Configuration

# Check environment
cat .env

# Verify settings loaded
python -c "from corpusiq.settings import Settings; print(Settings())"

# Test connectivity
curl http://localhost:8000/health

Step 4: Isolate the Problem

  • Disable features one by one
  • Test with minimal config
  • Use debug mode
  • Review recent changes

Step 5: Fix and Verify

  • Implement fix
  • Add test to prevent regression
  • Deploy and monitor
  • Document the issue

Backup and Recovery

What to Backup

  • Configuration files (.env)
  • Custom code modifications
  • User preferences (if stored)
  • Connector configurations
  • SSL certificates

Backup Schedule

Daily: Configuration files
Weekly: Full application state
Monthly: Archived snapshots

Test Restores

# Quarterly disaster recovery test
# 1. Restore from backup
# 2. Verify configuration
# 3. Test functionality
# 4. Document any issues

Common Pitfalls to Avoid

❌ Don’t

  1. Commit secrets to Git

    • Use .env files
    • Add .env to .gitignore
  2. Run production without OAuth

    • Always implement authentication
    • Test OAuth flow thoroughly
  3. Ignore rate limits

    • Configure appropriate limits
    • Monitor for abuse
  4. Skip testing after changes

    • Test locally first
    • Use staging environment
    • Monitor after deployment
  5. Forget about logging

    • Log important events
    • Don’t log sensitive data
    • Monitor logs regularly
  6. Neglect documentation

    • Update docs with code
    • Document breaking changes
    • Keep examples current
  7. Over-optimize prematurely

    • Measure first
    • Optimize bottlenecks
    • Test performance impact
  8. Deploy without monitoring

    • Set up alerts
    • Monitor key metrics
    • Have rollback plan

Success Metrics

User Adoption

  • Number of active users
  • Searches per user per day
  • Connector usage rates
  • User satisfaction scores

Technical Health

  • Uptime percentage (target: 99.9%)
  • Average response time (target: <500ms)
  • Error rate (target: <1%)
  • Search success rate (target: >95%)

Business Impact

  • Time saved per user
  • Productivity improvements
  • Support ticket reduction
  • User retention

Resources

Learn More:

Get Help:


Remember: These are guidelines, not rules. Adapt them to your specific context and needs.

Have better practices? Contribute to this document! Open a pull request.