Here's the complete architecture diagram first, then the full step-by-step implementation.Now here's the complete step-by-step implementation:
composer require pgvector/pgvector
composer require openai-php/laravel # used for Groq too (OpenAI-compatible)
composer require anthropic-php/client # fallback LLM
pip install sentence-transformers # on your Python sidecar
# .env
GROQ_API_KEY=gsk_...
GROQ_MODEL=llama-3.3-70b-versatile
ANTHROPIC_API_KEY=sk-ant-...
EMBEDDING_SERVICE_URL=http://localhost:8001 # your Python sidecar
VECTOR_DIMENSIONS=1024
// database/migrations/xxxx_create_documents_table.php
public function up(): void
{
DB::statement('CREATE EXTENSION IF NOT EXISTS vector');
Schema::create('document_chunks', function (Blueprint $table) {
$table->id();
$table->foreignId('document_id')->constrained()->cascadeOnDelete();
$table->string('document_title');
$table->text('content');
$table->integer('chunk_index');
$table->integer('token_count');
$table->json('metadata')->nullable();
$table->timestamps();
});
// Add the vector column separately (pgvector syntax)
DB::statement('ALTER TABLE document_chunks ADD COLUMN embedding vector(1024)');
// HNSW index — best for production (fast approximate search)
DB::statement('
CREATE INDEX document_chunks_embedding_hnsw
ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
');
}
bge-m3 runs best as a small Python service. This keeps Laravel clean.
# embedding_service/main.py
from fastapi import FastAPI
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer
from typing import List
import uvicorn
app = FastAPI()
model = SentenceTransformer("BAAI/bge-m3")
class EmbedRequest(BaseModel):
texts: List[str]
is_query: bool = False # bge-m3 has separate query/passage modes
@app.post("/embed")
def embed(req: EmbedRequest):
instruction = "Represent this sentence for searching relevant passages: " if req.is_query else ""
inputs = [instruction + t for t in req.texts] if req.is_query else req.texts
vectors = model.encode(inputs, normalize_embeddings=True).tolist()
return {"embeddings": vectors}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8001)
# Run it
pip install fastapi uvicorn sentence-transformers
python embedding_service/main.py
On GPU: add
device="cuda"toSentenceTransformer(...). Inference drops from ~800ms to ~40ms per batch.
app/Services/EmbeddingService.phpnamespace App\Services;
use Illuminate\Support\Facades\Http;
use Illuminate\Support\Facades\Cache;
class EmbeddingService
{
private string $url;
public function __construct()
{
$this->url = config('services.embedding.url', env('EMBEDDING_SERVICE_URL'));
}
public function embedPassages(array $texts): array
{
return $this->call($texts, isQuery: false);
}
public function embedQuery(string $text): array
{
$cacheKey = 'emb_query_' . md5($text);
return Cache::remember($cacheKey, 3600, fn () =>
$this->call([$text], isQuery: true)[0]
);
}
private function call(array $texts, bool $isQuery): array
{
$response = Http::timeout(30)->post("{$this->url}/embed", [
'texts' => $texts,
'is_query' => $isQuery,
]);
return $response->json('embeddings');
}
}
app/Services/DocumentChunker.phpnamespace App\Services;
class DocumentChunker
{
private int $chunkSize;
private int $overlap;
public function __construct(int $chunkSize = 512, int $overlap = 64)
{
$this->chunkSize = $chunkSize;
$this->overlap = $overlap;
}
public function chunk(string $text): array
{
// Split by sentence boundary first, then merge to chunkSize tokens
$sentences = preg_split('/(?<=[.!?؟])\s+/u', trim($text), -1, PREG_SPLIT_NO_EMPTY);
$chunks = [];
$buffer = [];
$tokenCount = 0;
foreach ($sentences as $sentence) {
$tokens = $this->estimateTokens($sentence);
if ($tokenCount + $tokens > $this->chunkSize && !empty($buffer)) {
$chunks[] = implode(' ', $buffer);
// Keep overlap: last N words
$words = explode(' ', implode(' ', $buffer));
$buffer = array_slice($words, -$this->overlap);
$tokenCount = count($buffer);
}
$buffer[] = $sentence;
$tokenCount += $tokens;
}
if (!empty($buffer)) {
$chunks[] = implode(' ', $buffer);
}
return $chunks;
}
private function estimateTokens(string $text): int
{
// Rough estimate: 1 token ≈ 4 chars for Latin, 2 chars for Arabic
return (int) (mb_strlen($text) / 3.5);
}
}
app/Services/VectorSearchService.phpnamespace App\Services;
use Illuminate\Support\Facades\DB;
class VectorSearchService
{
public function search(array $queryEmbedding, int $topK = 5, ?int $documentId = null): array
{
$vector = '[' . implode(',', $queryEmbedding) . ']';
$query = DB::table('document_chunks')
->select([
'id', 'document_id', 'document_title',
'content', 'chunk_index', 'metadata',
DB::raw("1 - (embedding <=> '{$vector}'::vector) AS similarity"),
])
->where('1 - (embedding <=> \'' . $vector . '\'::vector)', '>=', 0.35);
if ($documentId) {
$query->where('document_id', $documentId);
}
return $query
->orderByDesc('similarity')
->limit($topK)
->get()
->toArray();
}
}
app/Services/LlmService.phpnamespace App\Services;
use Illuminate\Support\Facades\Http;
use Illuminate\Support\Facades\Log;
class LlmService
{
public function complete(string $systemPrompt, array $messages, int $maxTokens = 1024): string
{
try {
return $this->groq($systemPrompt, $messages, $maxTokens);
} catch (\Throwable $e) {
Log::warning('Groq failed, falling back to Claude', ['error' => $e->getMessage()]);
return $this->claude($systemPrompt, $messages, $maxTokens);
}
}
private function groq(string $system, array $messages, int $maxTokens): string
{
$response = Http::withToken(config('services.groq.key'))
->timeout(20)
->post('https://api.groq.com/openai/v1/chat/completions', [
'model' => config('services.groq.model', 'llama-3.3-70b-versatile'),
'max_tokens' => $maxTokens,
'temperature' => 0.2,
'messages' => array_merge(
[['role' => 'system', 'content' => $system]],
$messages
),
]);
return $response->json('choices.0.message.content');
}
private function claude(string $system, array $messages, int $maxTokens): string
{
$response = Http::withHeaders([
'x-api-key' => config('services.anthropic.key'),
'anthropic-version' => '2023-06-01',
])->timeout(30)->post('https://api.anthropic.com/v1/messages', [
'model' => 'claude-haiku-4-5-20251001',
'max_tokens' => $maxTokens,
'system' => $system,
'messages' => $messages,
]);
return $response->json('content.0.text');
}
}
app/Services/PromptBuilder.phpnamespace App\Services;
class PromptBuilder
{
public function system(): string
{
return <<<PROMPT
You are a precise, helpful assistant. Answer ONLY using the provided context chunks.
If the answer is not in the context, say "I don't have enough information to answer this."
Always cite which source (document title + chunk) you used.
Respond in the same language as the user's question (Arabic or English).
PROMPT;
}
public function buildUserMessage(string $question, array $chunks): string
{
$context = collect($chunks)
->map(fn ($c, $i) =>
"[{$i}] Source: {$c->document_title}\n{$c->content}\nSimilarity: " . round($c->similarity, 3)
)
->implode("\n\n---\n\n");
return <<<MSG
Context:
{$context}
---
Question: {$question}
MSG;
}
}
// app/Jobs/IngestDocumentJob.php
namespace App\Jobs;
use App\Models\Document;
use App\Models\DocumentChunk;
use App\Services\DocumentChunker;
use App\Services\EmbeddingService;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
class IngestDocumentJob implements ShouldQueue
{
use Dispatchable, InteractsWithQueue, Queueable;
public function __construct(public Document $document) {}
public function handle(DocumentChunker $chunker, EmbeddingService $embedder): void
{
$text = $this->document->raw_text;
$chunks = $chunker->chunk($text);
// Batch embed for efficiency
$embeddings = $embedder->embedPassages($chunks);
$records = [];
foreach ($chunks as $i => $chunk) {
$vector = '[' . implode(',', $embeddings[$i]) . ']';
$records[] = [
'document_id' => $this->document->id,
'document_title' => $this->document->title,
'content' => $chunk,
'chunk_index' => $i,
'token_count' => (int)(mb_strlen($chunk) / 3.5),
'embedding' => $vector, // raw SQL insert below
'metadata' => json_encode(['lang' => $this->document->language]),
'created_at' => now(),
'updated_at' => now(),
];
}
// Must use raw insert because Eloquent doesn't handle vector type
foreach ($records as $record) {
$embedding = $record['embedding'];
unset($record['embedding']);
$id = \DB::table('document_chunks')->insertGetId($record);
\DB::statement(
"UPDATE document_chunks SET embedding = ?::vector WHERE id = ?",
[$embedding, $id]
);
}
$this->document->update(['status' => 'indexed']);
}
}
// app/Http/Controllers/RagController.php
namespace App\Http\Controllers;
use App\Services\{EmbeddingService, VectorSearchService, PromptBuilder, LlmService};
use Illuminate\Http\Request;
use Illuminate\Support\Facades\Cache;
class RagController extends Controller
{
public function __construct(
private EmbeddingService $embedder,
private VectorSearchService $search,
private PromptBuilder $prompt,
private LlmService $llm,
) {}
public function ask(Request $request)
{
$request->validate(['question' => 'required|string|max:2000']);
$question = $request->input('question');
$cacheKey = 'rag_answer_' . md5($question);
if ($cached = Cache::get($cacheKey)) {
return response()->json(array_merge($cached, ['cached' => true]));
}
$embedding = $this->embedder->embedQuery($question);
$chunks = $this->search->search($embedding, topK: 5);
if (empty($chunks)) {
return response()->json(['answer' => "No relevant information found.", 'sources' => []]);
}
$userMessage = $this->prompt->buildUserMessage($question, $chunks);
$answer = $this->llm->complete(
$this->prompt->system(),
[['role' => 'user', 'content' => $userMessage]]
);
$result = [
'answer' => $answer,
'sources' => collect($chunks)->map(fn ($c) => [
'title' => $c->document_title,
'chunk' => $c->chunk_index,
'similarity' => round($c->similarity, 3),
'excerpt' => mb_substr($c->content, 0, 200) . '…',
])->toArray(),
];
Cache::put($cacheKey, $result, 3600);
return response()->json($result);
}
}
// routes/api.php
Route::post('/ask', [RagController::class, 'ask'])->middleware('throttle:30,1');
Route::post('/documents', [DocumentController::class, 'store']);
// config/services.php (additions)
'groq' => [
'key' => env('GROQ_API_KEY'),
'model' => env('GROQ_MODEL', 'llama-3.3-70b-versatile'),
],
'anthropic' => [
'key' => env('ANTHROPIC_API_KEY'),
],
'embedding' => [
'url' => env('EMBEDDING_SERVICE_URL', 'http://localhost:8001'),
],
app/
├── Http/Controllers/
│ ├── RagController.php
│ └── DocumentController.php
├── Jobs/
│ └── IngestDocumentJob.php
├── Models/
│ ├── Document.php
│ └── DocumentChunk.php
├── Services/
│ ├── EmbeddingService.php
│ ├── DocumentChunker.php
│ ├── VectorSearchService.php
│ ├── PromptBuilder.php
│ └── LlmService.php
embedding_service/
└── main.py ← Python FastAPI bge-m3
database/migrations/
└── xxxx_create_document_chunks_table.php
| Model | Cost (input) | Speed | Quality | Best for |
|---|---|---|---|---|
llama-3.3-70b-versatile on Groq | $0.00008/1K | ~200 tok/s | Good | Default, high volume |
claude-haiku-4-5 | $0.001/1K | Fast | Very good | Fallback, complex reasoning |
gemini-2.0-flash | $0.0001/1K | Fast | Very good | Alternative if on Google Cloud |
The LlmService above already implements the Groq-primary → Claude-fallback pattern. llama-3.3-70b-versatile on Groq is genuinely the best cost/speed ratio available right now — about 10× cheaper than GPT-4o-mini and significantly faster. For complex multi-step reasoning or when Groq is down, Claude Haiku is the right safety net.
The next logical steps are: adding conversation history (store messages in a conversations table and pass last 4 turns into LlmService), adding a reranker (a second pass with a cross-encoder model after the vector search), and adding a document management API (DocumentController with file upload → dispatch IngestDocumentJob). Let me know which piece you want to build next.