AI Engineering Basic

What is LLM 

At its core, a Large Language Model (LLM) is a type of artificial intelligence trained to understand, generate, and manipulate human language. Think of it as a highly advanced version of "autocomplete" that has read nearly the entire internet to learn the patterns of how humans communicate.

Type of LLM 

Commercial Model VS Open Source Model 

FeatureCommercial (GPT-5.5, Claude 4)Open-Source (Llama 4, Mistral)
SetupImmediate (API Key)Complex (Hardware/Cloud)
PrivacyShared with ProviderComplete Privacy
CustomizationLimited (Fine-tuning APIs)Total (Full weight access)
CostPay-per-tokenUpfront hardware/server costs


History 

ChatGpt lunched in 30 Nov 2022 

 AI engineer VS Machine Learning 

FeatureML EngineerAI Engineer
Primary GoalImprove model accuracy and performance.Build a functional AI-powered product.
Data TypeMostly structured (tabular, logs, DBs).Mostly unstructured (text, images, audio).
Key ToolsPyTorch, TensorFlow, Scikit-learn, SQL.LangChain, LlamaIndex, Vector DBs, APIs.
Math DepthHigh (Linear Algebra, Calculus, Stats).Moderate (Probabilities, Logic, RAG patterns).
Day-to-DayFeature engineering, training, MLOps.Prompt engineering, RAG pipelines, Agents.

How to Utilize AI skill in codeing 

Architectural Guidance: Instead of "write a function," use LLMs to design complex database schemas or multi-tenant architectures for your school management app.

Migration Support: Speed up framework upgrades (like .NET 10 migrations) by using AI to identify breaking changes and automate boilerplate refactoring across large codebases.

Context-Aware Development: Use AI-native IDEs (like Cursor) or terminal agents. These tools "read" your entire repository, allowing you to ask, "Where should I add the parent-teacher notification logic?"

Automated Testing: Feed your business logic to an LLM to generate comprehensive unit and integration tests, ensuring your SaaS remains stable.

Documentation: Generate high-quality README.md or API docs from your code instantly.

The Golden Rule: Treat AI like a fast, tireless Junior Developer. It handles the "how" (syntax), while you maintain the "why" (business logic, security, and performance). Always review its output—AI is great at writing code, but you are responsible for the architecture.

What is token 

In AI, a token is the fundamental unit of text that a Large Language Model processes. Instead of reading sentences as words or letters, the model breaks text into these manageable chunks.
How It Works

The Breakdown: A token can be a single character, a part of a word (like "ing"), or a whole common word (like "apple"). For example, the word "friendship" might be split into two tokens: friend and ship.

Numerical Conversion: Computers can't understand text directly. Each token is mapped to a specific ID (number). The model then performs complex math on these numbers to predict which token should come next.


Efficiency: On average, 1,000 tokens represent about 750 words.

Why It Matters

Cost: Most commercial APIs (like GPT-4 or Gemini) charge you based on how many tokens you send (input) and receive (output).

Context Window: Every model has a "memory limit" called a context window (e.g., 128k tokens). Once a conversation exceeds this limit, the model starts "forgetting" the earliest parts of the chat.

Language Patterns: Tokenization helps the model recognize patterns across different languages and even within code snippets.

Context Window 

The Context Window is the "short-term memory" of an AI model. It defines the maximum amount of information (measured in tokens) the model can "see" and process at one single time.

How it works: When you chat with an AI, the window contains your current prompt, the history of that conversation, and any uploaded documents.

The Limit: Once the conversation exceeds this limit, the model "forgets" the earliest parts of the chat to make room for new data.

Impact:
A larger window allows for analyzing entire codebases or long books without losing track of the details.

Choosing Right Model 

1. Task Complexity (The Reasoning Tier)

Identify where your task sits on the "Reasoning Spectrum":

Low Complexity: For boilerplate, unit tests, or simple CSS, use "Flash" models (e.g., Gemini 3.1 Flash). They are near-instant and cost pennies.

High Complexity: For multi-file refactors or complex database migrations (like .NET 8 to 10), use "Reasoning" models (e.g., Claude Opus 4.7 or GPT-5.5). These "think" before they code, reducing expensive logic errors.

2. Context Window Needs

How much of your codebase does the model need to "see" at once?

Small (up to 128k tokens): Sufficient for single-file debugging or small snippets.

Massive (1M - 10M tokens): Essential for whole-repo analysis. Models like Gemini 3 Pro or Llama 4 Scout allow you to upload your entire SaaS project so the AI understands all cross-file dependencies.

3. Agentic Reliability (Tool Use)

If you want the AI to actually run code, fix bugs in the terminal, or check Jira tickets, you need high Tool-Use accuracy.

Check benchmarks like SWE-bench Verified. In 2026, models optimized for "Agentic Workflows" (like Claude 4.5 Sonnet) lead because they can handle "looping" tasks—trying a fix, running a test, failing, and trying again—without human intervention.

4. Privacy vs. Performance

Commercial (API): Best for cutting-edge features and zero server maintenance. However, data privacy is governed by the provider.

Open-Source (Local): Best for total data sovereignty. Running Llama 4 locally ensures your proprietary school management logic never leaves your machine, though it requires powerful GPUs (e.g., NVIDIA H100s).

Important Core Parameters

When making an API call, these parameters act as the "control knobs" for the AI's behavior. Mastering them is essential for balancing creativity with reliability in your software.

model: Specifies which "brain" to use (e.g., gpt-5.5 or claude-4.7). Higher models are smarter but more expensive.

messages / input: The conversation history. Includes roles like system (instructions), user (queries), and assistant (previous replies).

temperature: Controls randomness.
  • 0.0: Deterministic (ideal for coding/math).
  • 0.7+: Creative (ideal for brainstorming/storytelling).
max_tokens / max_output: Sets a hard limit on the response length to control costs and prevent "runaway" text.

top_p (Nucleus Sampling): An alternative to temperature. It limits the model to a percentage of the most likely word choices (e.g., 0.1 means only consider the top 10% of possibilities).

frequency_penalty & presence_penalty: Prevents the model from repeating the same words or topics too often.

stop: Sequences (like "\n" or "End") that tell the model to stop generating immediately.

response_format: Forces the output into a specific structure, such as JSON mode, which is critical for parsing data into your app’s frontend.

0 Comments