Bharat MiniGPT 350M: A Custom GPT-Style LLM Built from Scratch in India
The AI industry today is dominated by massive language models from companies like OpenAI, Google, and Meta. Most public AI projects are either fine-tuned versions of existing models or lightweight wrappers built on top of already available architectures.
However, some developers are taking a far more challenging route — building transformer architectures and training pipelines from scratch.
One such project is Bharat MiniGPT 350M, developed by HVM Smart Solutions (Harshvardhan Mishra).
Unlike many “custom AI models” available online, Bharat MiniGPT 350M is not simply a fine-tuned GPT-2 or LLaMA derivative. Its transformer architecture, training logic, attention system, normalization layers, and dataset streaming pipeline were manually implemented in PyTorch before later being integrated into the Hugging Face ecosystem.
What is Bharat MiniGPT 350M?
Bharat MiniGPT 350M is a custom decoder-only Transformer-based causal language model designed for foundational LLM experimentation, architecture research, and large-scale language model training.
The project focuses on understanding and implementing the core mechanics behind modern GPT-style systems, including:
- Transformer architecture engineering
- Attention optimization
- Language model pretraining
- Training stability
- Efficient inference systems
- KV-cache compatible generation
- Gradient checkpointing
- Streaming datasets
- Hugging Face compatibility
Rather than being a production chatbot, the project is currently an evolving base pretrained model focused on foundational AI engineering.
Model Specifications
| Feature | Details |
|---|---|
| Model Name | Bharat MiniGPT 350M |
| Parameters | ~350 Million |
| Architecture | Custom Decoder-only Transformer |
| Training Tokens | 3 Billion |
| Framework | PyTorch |
| HF Compatibility | Added Later |
| Developer | Harshvardhan Mishra |
| Organization | HVM Smart Solutions |
Architecture Overview
The model uses several modern transformer design concepts commonly seen in advanced LLM architectures.
| Component | Details |
|---|---|
| Transformer Layers | 24 |
| Attention Heads | 16 |
| Embedding Size | 1024 |
| Context Length | 768 Tokens |
| Vocabulary Size | 50,257 |
| Positional Encoding | RoPE |
| Normalization | RMSNorm |
| Feed Forward Network | SwiGLU |
| Attention | SDPA / Flash Attention Compatible |
| Weight Tying | Yes |
| Precision | FP16 |
A Truly Custom Transformer Implementation
One of the most important aspects of Bharat MiniGPT 350M is that many core transformer systems were implemented manually instead of relying entirely on prebuilt abstractions.
This includes:
- Custom RMSNorm implementation
- Manual RoPE positional embedding logic
- Custom SwiGLU feed-forward blocks
- Self-written attention modules
- Decoder transformer blocks
- Custom token generation pipeline
- Streaming dataset architecture
- Manual cosine LR scheduler
- Gradient checkpointing integration
The project later added Hugging Face compatibility for easier deployment and ecosystem support.
This distinction is important because many public “custom models” are actually fine-tunes of existing transformer implementations, while Bharat MiniGPT involved architecture-level engineering from the ground up.
Custom RMSNorm Implementation
The project includes a manually implemented RMSNorm layer instead of relying solely on built-in transformer utilities.
RMSNorm has become increasingly popular in modern LLMs because it is computationally lightweight and can improve training stability.
Manual RoPE Positional Embeddings
Rotary Position Embedding (RoPE) was also manually implemented inside the project.
RoPE is widely used in modern transformer architectures because it helps models better capture positional relationships within sequences and improves long-context behavior.
SwiGLU Feed Forward Layers
The feed-forward network uses SwiGLU activation logic implemented directly in PyTorch.
SwiGLU-based architectures are commonly used in newer generation language models because they improve expressiveness and learning efficiency.
Attention System and Flash Attention Compatibility
The attention module was manually implemented using scaled dot-product attention combined with RoPE integration.
The architecture is also compatible with Flash Attention-style optimizations, which can significantly improve inference and training efficiency.
Hugging Face Compatibility Was Added Later
A key technical detail of the project is that Bharat MiniGPT was initially built as a standalone PyTorch transformer system.
Later, Hugging Face compatibility was integrated to support:
- Easier inference
- Standardized loading
generate()support- Deployment workflows
- Community model sharing
- Integration with HF tooling
This means the original focus of the project was architecture engineering and training infrastructure rather than simply wrapping an existing HF model.
Training Data
The model was trained using a weighted mixture of large-scale datasets:
| Dataset | Weight |
|---|---|
| HuggingFaceFW/fineweb (sample-10BT) | 40% |
| HuggingFaceFW/fineweb-edu (sample-10BT) | 30% |
| Wikimedia Wikipedia | 30% |
| TinyStories and some book corpus | 5-10% (short time period) |
The project also includes a custom streaming dataset pipeline for handling large-scale token generation efficiently.
Training Configuration
| Setting | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 3e-4 |
| Minimum LR | 3e-5 |
| Warmup Steps | 51,200 |
| LR Scheduler | Cosine Decay |
| Gradient Accumulation | 128 |
| Mixed Precision | FP16 |
| Gradient Clipping | 1.0 |
Engineering Features
The project includes several advanced engineering features:
- Custom GPT architecture
- RoPE positional embeddings
- RMSNorm normalization
- SwiGLU feed-forward layers
- Flash Attention compatible SDPA
- Gradient checkpointing
- Weight tying
- Streaming datasets
- KV-cache compatible generation
- Mixed precision FP16 training
- Manual checkpoint recovery system
Current Stage: Base Pretrained Model
It is important to understand that Bharat MiniGPT 350M is currently:
- A base pretrained model
- Trained on 3B tokens
- Not instruction-tuned yet
- Not RLHF aligned
- Still under active experimentation
This means the model is not intended to directly compete with systems like ChatGPT, Gemini, or Claude at its current stage.
The focus right now is foundational language learning and transformer experimentation.
Benchmark Results
The project was evaluated using the EleutherAI LM Evaluation Harness.
| Task | Metric | Score |
|---|---|---|
| ARC Easy | acc | 0.3312 |
| HellaSwag | acc | 0.2650 |
| PIQA | acc | 0.5631 |
These results represent the current 3B-token pretrained checkpoint.
Why Projects Like This Matter
Building a transformer model from scratch is significantly more difficult than simply fine-tuning an existing model.
It requires solving multiple engineering challenges, including:
- Training stability
- Memory optimization
- Gradient scaling
- Precision handling
- Attention efficiency
- Dataset streaming
- Checkpoint recovery
- Generation stability
- GPU memory management
Independent projects like Bharat MiniGPT help expand practical AI engineering knowledge and experimentation.
Future Improvements Planned
Several future improvements are planned for the project:
Better Tokenizer Strategy
Tokenizer quality directly affects language understanding and output coherence.
Larger Training Token Count
Additional pretraining beyond 3B tokens could significantly improve model capability.
Instruction Tuning
Future conversational fine-tuning may improve assistant-like behavior.
Better Inference Optimization
Future ONNX, quantization, and KV-cache optimizations are possible.
Indian Language Expansion
Support for Hindi and other Indian languages may improve over time.
Lightweight Models Still Matter
While the AI industry is focused heavily on massive multi-billion parameter models, smaller models still offer important advantages:
- Lower hardware requirements
- Faster experimentation
- Easier debugging
- Edge AI deployment potential
- Browser inference possibilities
- Lower inference costs
This is one reason compact transformer research remains valuable.
Building a 350M Parameter LLM on Free Kaggle T4 GPUs
One of the most impressive aspects of Bharat MiniGPT 350M is that the model was trained without expensive AI supercomputers or large enterprise GPU clusters. Instead, the project was developed primarily using the free-tier environment on Kaggle with NVIDIA T4 GPUs. Despite limited hardware resources, the project achieved 3 billion token pretraining through heavy optimization techniques such as gradient accumulation, FP16 mixed precision training, streaming datasets, checkpoint recovery systems, and memory-efficient transformer engineering. This demonstrates that modern LLM experimentation is no longer limited only to large corporations with massive budgets — independent developers can still build meaningful AI systems by combining efficient engineering with persistence and smart optimization strategies.
Final Thoughts
Bharat MiniGPT 350M represents an interesting independent AI engineering effort coming from India.
Its transformer architecture, RoPE implementation, RMSNorm layers, attention system, and training pipeline were manually developed in PyTorch before later being adapted for Hugging Face compatibility.
Although the model is still in its early pretrained stage and requires further refinement, it demonstrates how independent developers can explore foundational LLM engineering beyond simple fine-tuning workflows.

