NLP Problems: Detailed Standard Solution Procedures

Overview

This document provides detailed, industry-standard procedures for solving different NLP problems. Each problem type has specific challenges and established best practices used in production systems.

1. Text Classification

Problem Description

Classify text into predefined categories (sentiment, topic, spam, etc.)

Standard Solution Procedure

Phase 1: Data Preparation

1. Data Collection:

Collect labeled dataset
Ensure class balance (or handle imbalance)
Split: Train (70%), Validation (15%), Test (15%)

2. Text Preprocessing:

Steps:
1. Lowercasing (usually)
2. Remove special characters (optional, domain-dependent)
3. Remove URLs, emails, phone numbers
4. Handle contractions ("don't" → "do not")
5. Remove stop words (optional, depends on task)
6. Stemming/Lemmatization (optional)

3. Handle Class Imbalance:

Oversampling: SMOTE, ADASYN
Undersampling: Random undersampling
Class weights: Weight loss by class frequency
Data augmentation: Paraphrasing, back-translation

Phase 2: Feature Extraction

Option A: Traditional ML (TF-IDF, Count Vectors)

1. Create vocabulary from training data
2. Compute TF-IDF for documents
3. Feature matrix: (n_documents, vocab_size)
4. Use: Naive Bayes, SVM, Logistic Regression

Option B: Word Embeddings (Word2Vec, GloVe)

1. Pre-trained embeddings (Word2Vec, GloVe)
2. Average embeddings for document
3. Or use embeddings as features
4. Use: Traditional ML or simple neural networks

Option C: Contextual Embeddings (BERT, etc.)

1. Fine-tune BERT/RoBERTa on task
2. Use [CLS] token embedding
3. Or average all token embeddings
4. Use: Fine-tuned transformer

Phase 3: Model Selection

For Small Datasets (< 10K samples):

TF-IDF + Naive Bayes: Fast, interpretable
TF-IDF + SVM: Good performance
TF-IDF + Logistic Regression: Interpretable, good baseline

For Medium Datasets (10K - 100K):

TF-IDF + XGBoost: Strong performance
Word Embeddings + LSTM/CNN: Neural approach
Fine-tuned BERT: Best performance

For Large Datasets (> 100K):

Fine-tuned BERT/RoBERTa: State-of-the-art
DistilBERT: Faster, smaller
Large language models: GPT-3.5, Claude (few-shot)

Phase 4: Training

Traditional ML:

1. Train on TF-IDF features
2. Hyperparameter tuning (C, kernel for SVM)
3. Cross-validation
4. Select best model

Neural Networks:

1. Initialize embeddings (pre-trained or random)
2. Train with:
   - Loss: Cross-entropy
   - Optimizer: Adam
   - Learning rate: 1e-3 to 1e-5
   - Batch size: 32-128
3. Early stopping on validation
4. Regularization: Dropout, L2

Fine-tuning BERT:

1. Load pre-trained BERT
2. Add classification head
3. Fine-tune with:
   - Learning rate: 2e-5 to 5e-5
   - Batch size: 16-32
   - Epochs: 3-5
   - Warmup steps: 10% of total
4. Use learning rate scheduling

Phase 5: Evaluation

Metrics:

Accuracy: Overall correctness
Precision, Recall, F1: Per class
Confusion Matrix: Error analysis
ROC-AUC: For binary classification

Multi-class:

Macro F1: Average F1 across classes
Micro F1: Overall F1
Weighted F1: Weighted by class frequency

Phase 6: Deployment

Production Considerations:

Latency: TF-IDF + SVM is fast
Scalability: Batch processing for large volumes
Monitoring: Track accuracy, drift detection
A/B testing: Compare models

Industry Example: Sentiment Analysis

Problem: Classify movie reviews as positive/negative

Solution:

Data: IMDB dataset (50K reviews)
Preprocessing: Lowercase, remove HTML, tokenize
Features: TF-IDF or BERT embeddings
Model: Fine-tuned BERT (accuracy ~95%)
Deployment: API endpoint, batch processing

2. Named Entity Recognition (NER)

Problem Description

Identify and classify entities (person, location, organization, etc.) in text

Standard Solution Procedure

Phase 1: Data Format

BIO Tagging:

Sentence: "John Smith works at Google in California"

Tags:
John → B-PER (Beginning Person)
Smith → I-PER (Inside Person)
works → O (Outside)
at → O
Google → B-ORG (Beginning Organization)
in → O
California → B-LOC (Beginning Location)

Tag Set:

B-{label}: Beginning of entity
I-{label}: Inside entity
O: Outside (not an entity)

Phase 2: Feature Engineering

Traditional Features:

1. Word features:
   - Current word
   - Previous word
   - Next word
   - Word shape (capitalization pattern)
   - Prefixes/suffixes

2. Context features:
   - Surrounding words
   - Position in sentence
   - Sentence length

3. Lexical features:
   - Is capitalized?
   - Is number?
   - Is punctuation?
   - Contains digits?

Embedding Features:

1. Word embeddings (Word2Vec, GloVe)
2. Character-level embeddings (for OOV words)
3. Context embeddings (ELMo, BERT)

Phase 3: Model Selection

Option A: CRF (Conditional Random Fields)

1. Features: Word + context features
2. Model: Linear chain CRF
3. Training: Maximum likelihood
4. Inference: Viterbi algorithm
5. Use: Traditional approach, interpretable

Option B: BiLSTM-CRF

1. BiLSTM: Captures context (bidirectional)
2. CRF: Ensures valid tag sequences
3. Architecture:
   - Embedding layer
   - BiLSTM layer(s)
   - CRF layer
4. Use: Better than CRF alone

Option C: Fine-tuned BERT

1. Fine-tune BERT for token classification
2. Add classification head per token
3. Use: State-of-the-art performance
4. Example: spaCy transformers, HuggingFace

Phase 4: Training

CRF Training:

1. Define feature functions
2. Maximum likelihood estimation
3. L-BFGS optimization
4. Regularization (L1/L2)

BiLSTM-CRF Training:

1. Initialize embeddings (pre-trained)
2. Train with:
   - Loss: Negative log-likelihood
   - Optimizer: Adam
   - Learning rate: 0.001
   - Dropout: 0.5
3. Early stopping

BERT Fine-tuning:

1. Load pre-trained BERT
2. Add token classification head
3. Fine-tune with:
   - Learning rate: 3e-5
   - Batch size: 16
   - Epochs: 3-5
4. Use token-level labels

Phase 5: Evaluation

Metrics:

Entity-level F1: Exact match required
Token-level F1: Per-token accuracy
Precision, Recall: Per entity type

Evaluation:

Strict: Exact match (boundaries + type)
Partial: Partial overlap counted
Type: Type must match

Phase 6: Handling Challenges

Out-of-Vocabulary (OOV) Words:

Solution: Character-level embeddings
Subword tokenization (BPE, WordPiece)
Contextual embeddings (BERT handles OOV)

Nested Entities:

Problem: "New York University" (location + organization)
Solution: Multi-label tagging, span-based models

Ambiguity:

Problem: "Apple" (company vs fruit)
Solution: Use context, larger window

Industry Example: Medical NER

Problem: Extract medical entities from clinical notes

Solution:

Data: Annotated clinical notes
Entities: Disease, Medication, Symptom, etc.
Model: Fine-tuned BioBERT (domain-specific BERT)
Features: Medical terminology, context
Evaluation: Entity-level F1 ~90%

3. Question Answering (QA)

Problem Description

Answer questions based on given context (reading comprehension)

Standard Solution Procedure

Phase 1: Data Format

SQuAD Format:

{
  "context": "The cat sat on the mat.",
  "question": "Where did the cat sit?",
  "answers": [
    {"text": "on the mat", "answer_start": 15}
  ]
}

Types:

Extractive QA: Answer is span in context
Abstractive QA: Generate answer (not in context)
Multiple choice: Select from options
Open-domain: No context provided (retrieval needed)

Phase 2: Model Architecture

Extractive QA (Most Common):

Option A: BERT-based (Standard)

1. Input: [CLS] question [SEP] context [SEP]
2. BERT encoder
3. Two output heads:
   - Start position: Probability for each token being start
   - End position: Probability for each token being end
4. Training: Cross-entropy for start/end positions

Option B: BiDAF (Bidirectional Attention Flow)

1. Context and question encoders
2. Attention flow layer (bidirectional)
3. Modeling layer
4. Output layer (start/end)

Option C: Fine-tuned BERT/RoBERTa

1. Load pre-trained model
2. Add QA head (start/end positions)
3. Fine-tune on QA dataset
4. Use: State-of-the-art

Phase 3: Training

BERT QA Training:

1. Load pre-trained BERT
2. Add QA head:
   - Start logits: Linear(context_hidden_size, 1)
   - End logits: Linear(context_hidden_size, 1)
3. Loss:
   - Start loss: CrossEntropy(start_logits, start_label)
   - End loss: CrossEntropy(end_logits, end_label)
   - Total: start_loss + end_loss
4. Training:
   - Learning rate: 3e-5
   - Batch size: 16-32
   - Max sequence length: 512
   - Epochs: 2-3

Inference:

1. Encode: [CLS] question [SEP] context [SEP]
2. Get start/end logits
3. Find valid span (start < end, within context)
4. Select span with highest start_score + end_score
5. Extract text from context

Phase 4: Handling Long Contexts

Problem: Context longer than model limit (512 tokens)

Solutions:

1. Sliding Window:

- Split context into overlapping windows
- Answer each window
- Aggregate results

2. Hierarchical:

- Split into paragraphs
- Rank paragraphs by relevance
- Answer top-K paragraphs

3. Long-Context Models:

- Use models with larger context (32K, 100K+)
- More expensive but better

Phase 5: Evaluation

Metrics:

Exact Match (EM): Exact string match
F1 Score: Token-level overlap
Per-question-type: Accuracy by question type

SQuAD 2.0:

Also handles unanswerable questions
Model must detect when answer not in context

Phase 6: Production Considerations

Challenges:

Long contexts: Use sliding window or long-context models
Unanswerable: Train to detect unanswerable
Multi-hop: Need reasoning across sentences

Solutions:

Retrieval: For open-domain QA
Re-ranking: Better context selection
Ensemble: Combine multiple models

Industry Example: Customer Support QA

Problem: Answer customer questions from knowledge base

Solution:

Retrieval: Find relevant KB articles (BM25 + Dense)
QA Model: Fine-tuned BERT for extractive QA
Pipeline: Retrieve → Rank → Answer
Fallback: Human agent if confidence low

4. Machine Translation

Problem Description

Translate text from one language to another

Standard Solution Procedure

Phase 1: Data Preparation

Parallel Corpus:

Source-target sentence pairs
Example: English-French pairs
Quality: High-quality translations

Data Requirements:

Size: Millions of sentence pairs
Domain: Match target domain if possible
Quality: Professional translations preferred

Preprocessing:

1. Sentence segmentation
2. Tokenization (language-specific)
3. Subword tokenization (BPE, SentencePiece)
4. Normalization

Phase 2: Subword Tokenization

Why Subword?

Handle rare words
Reduce vocabulary size
Better generalization

BPE (Byte Pair Encoding):

1. Start with character vocabulary
2. Iteratively merge most frequent pairs
3. Create subword vocabulary
4. Example: "unhappiness" → ["un", "happiness"]

SentencePiece:

1. Similar to BPE
2. Handles multiple languages
3. Used in mT5, mBERT

Phase 3: Model Architecture

Option A: Seq2Seq with Attention

Encoder:
- Bidirectional LSTM/GRU
- Encodes source sentence
- Output: Hidden states

Decoder:
- LSTM/GRU with attention
- Attends to encoder states
- Generates target sentence

Option B: Transformer (State-of-the-art)

1. Encoder: Self-attention on source
2. Decoder: Self-attention + cross-attention
3. Multi-head attention
4. Position encoding
5. Use: Best performance

Option C: Pre-trained Models

- mBART: Multilingual BART
- mT5: Multilingual T5
- Fine-tune on translation task

Phase 4: Training

Seq2Seq Training:

1. Teacher forcing: Use ground truth during training
2. Loss: Cross-entropy per token
3. Optimizer: Adam
4. Learning rate: 1e-3 to 1e-4

Transformer Training:

1. Pre-train on large corpus (optional)
2. Fine-tune on translation data
3. Training:
   - Learning rate: 1e-4
   - Warmup steps
   - Label smoothing
   - Dropout: 0.1

Decoding Strategies:

1. Greedy: Always pick highest probability
2. Beam search: Keep top-K candidates
3. Sampling: Sample from distribution
4. Length penalty: Prevent too short/long

Phase 5: Evaluation

Metrics:

BLEU: N-gram precision (most common)
METEOR: Considers synonyms
Human evaluation: Best but expensive

BLEU Calculation:

1. N-gram precision (n=1,2,3,4)
2. Brevity penalty
3. Geometric mean

Phase 6: Production Considerations

Challenges:

Rare words: Use subword tokenization
Long sequences: Hierarchical attention
Low-resource languages: Multilingual models, transfer

Solutions:

Multilingual models: Train on multiple languages
Transfer learning: High-resource → low-resource
Back-translation: Generate synthetic data

Industry Example: Google Translate

Problem: Translate between 100+ languages

Solution:

Model: Large transformer (billions of parameters)
Data: Billions of parallel sentences
Multilingual: Single model for all languages
Zero-shot: Translate between languages not seen together

5. Text Summarization

Problem Description

Generate concise summary of long text

Types

Extractive:

Select important sentences from source
Preserves original wording
Easier, more factual

Abstractive:

Generate new sentences
More flexible, can paraphrase
Harder, risk of hallucination

Standard Solution Procedure

Extractive Summarization

Phase 1: Feature Extraction

Features for each sentence:
1. Position: Early sentences more important
2. Length: Medium-length sentences preferred
3. TF-IDF: High TF-IDF words
4. Sentence similarity: Similar to other sentences
5. Named entities: Contains important entities

Phase 2: Scoring

Score(sentence) = w₁×position + w₂×length + w₃×tfidf + ...

Or use learned weights (supervised)

Phase 3: Selection

1. Score all sentences
2. Select top-K sentences
3. Order by original position
4. Combine into summary

Methods:

TextRank: Graph-based (PageRank on sentences)
LSTM-based: Learn to score sentences
BERT-based: Use BERT to score sentences

Abstractive Summarization

Phase 1: Model Architecture

Option A: Seq2Seq

Encoder: Encodes source document
Decoder: Generates summary
Attention: Focuses on relevant parts

Option B: Transformer

Encoder-Decoder transformer
Pre-trained: BART, T5
Fine-tune on summarization

Option C: Pre-trained Models

- BART: Denoising autoencoder
- T5: Text-to-text
- GPT-3.5: Few-shot summarization

Phase 2: Training

BART/T5 Fine-tuning:

1. Load pre-trained model
2. Fine-tune on summarization dataset
3. Training:
   - Loss: Cross-entropy
   - Learning rate: 3e-5
   - Max source: 1024 tokens
   - Max target: 128 tokens
   - Epochs: 3-5

Phase 3: Generation

Decoding:

1. Beam search (usually best)
2. Length penalty
3. Repetition penalty
4. Min/max length constraints

Phase 4: Post-processing

1. Remove repetition
2. Fix grammar
3. Ensure coherence
4. Validate facts (optional)

Phase 5: Evaluation

Metrics:

ROUGE-1/2/L: Recall-oriented
BLEU: Precision-oriented
Human evaluation: Best

ROUGE:

ROUGE-1: Word overlap
ROUGE-2: Bigram overlap
ROUGE-L: Longest common subsequence

Phase 6: Challenges and Solutions

Long Documents:

Problem: Exceeds model context
Solution: Hierarchical encoding, chunking

Factual Consistency:

Problem: Model may generate incorrect facts
Solution: Fact checking, constrained generation

Repetition:

Problem: Model repeats phrases
Solution: Repetition penalty, coverage mechanism

Industry Example: News Summarization

Problem: Summarize news articles

Solution:

Model: Fine-tuned BART
Data: CNN/DailyMail dataset
Input: Article (up to 1024 tokens)
Output: Summary (3-4 sentences)
Evaluation: ROUGE-L ~40

6. Natural Language to Code (NL2Code)

Problem Description

Generate code from natural language description

Standard Solution Procedure

See nl2code_detailed.py for complete implementation!

Key Challenges:

Large schemas: Schema pruning
Complex queries: Multi-hop reasoning
Code correctness: Syntax validation
Domain-specific: API patterns

Standard Procedure:

Query → Schema Pruning → Schema Encoding → Code Generation → Validation

7. Text Generation

Problem Description

Generate coherent text (story, dialogue, etc.)

Standard Solution Procedure

Phase 1: Model Selection

Option A: Autoregressive Language Models

- GPT-style models
- Predict next token given previous
- Examples: GPT-2, GPT-3, GPT-4

Option B: Encoder-Decoder

- T5, BART
- Encoder: Understand input
- Decoder: Generate output

Phase 2: Training

Language Modeling:

1. Pre-train on large corpus
2. Objective: Next token prediction
3. Loss: Cross-entropy
4. Training: Millions/billions of tokens

Fine-tuning:

1. Load pre-trained model
2. Fine-tune on task-specific data
3. Examples: Story generation, dialogue

Phase 3: Decoding Strategies

Greedy:

Always pick highest probability token
- Fast but repetitive
- Not diverse

Beam Search:

Keep top-K candidates at each step
- Better quality
- More diverse
- Slower

Sampling:

1. Top-k: Sample from top-k tokens
2. Top-p (nucleus): Sample from tokens with cumulative probability p
3. Temperature: Control randomness
   - Low temp: More deterministic
   - High temp: More random

Parameters:

- Temperature: 0.7-1.0 (common)
- Top-k: 50-100
- Top-p: 0.9-0.95
- Repetition penalty: 1.0-1.2

Phase 4: Control and Conditioning

Prompt Engineering:

- System prompts
- Few-shot examples
- Instructions
- Format specifications

Conditional Generation:

- Control length
- Control style
- Control topic
- Control sentiment

Phase 5: Evaluation

Metrics:

BLEU: For translation-like tasks
ROUGE: For summarization
Perplexity: For language modeling
Human evaluation: Coherence, fluency, relevance

Challenges:

No single metric captures quality
Need human evaluation
Task-specific metrics

Industry Example: ChatGPT

Problem: Generate human-like conversations

Solution:

Model: GPT-3.5/GPT-4
Training: Pre-train + fine-tune + RLHF
Decoding: Temperature sampling
Control: System prompts, few-shot examples

8. Sentiment Analysis

Problem Description

Determine sentiment (positive, negative, neutral) of text

Standard Solution Procedure

Phase 1: Problem Types

Binary: Positive vs Negative Multi-class: Positive, Negative, Neutral Fine-grained: 1-5 stars, very positive to very negative

Phase 2: Approaches

Lexicon-based:

1. Sentiment dictionaries (positive/negative words)
2. Count positive/negative words
3. Score = positive_count - negative_count
4. Use: Fast, no training needed

ML-based:

1. Features: TF-IDF, embeddings
2. Model: Naive Bayes, SVM, Logistic Regression
3. Training: Supervised learning

Deep Learning:

1. LSTM/CNN with embeddings
2. Fine-tuned BERT
3. Better performance

Phase 3: Challenges

Sarcasm:

Problem: "This movie is so bad it's good"
Solution: Context understanding, BERT helps

Context:

Problem: "This movie is bad" (review vs description)
Solution: Use context, domain adaptation

Domain:

Problem: Sentiment varies by domain
Solution: Domain-specific training, transfer learning

Phase 4: Evaluation

Metrics:

Accuracy: Overall correctness
F1-score: Per class
Confusion matrix: Error analysis

Multi-class:

Macro F1: Average across classes
Weighted F1: Weighted by frequency

Problem: Analyze sentiment of tweets

Solution:

Data: Labeled tweets
Preprocessing: Handle hashtags, mentions, URLs
Model: Fine-tuned BERT
Deployment: Real-time API
Monitoring: Track accuracy, handle drift

9. Information Extraction

Problem Description

Extract structured information from unstructured text

Types

1. Named Entity Recognition (NER)

Extract entities (person, location, etc.)
See NER section above

2. Relation Extraction

Extract relationships between entities
Example: "John works at Google" → (John, works_at, Google)

3. Event Extraction

Extract events and participants
Example: "Apple acquired Beats" → (acquire, Apple, Beats)

Standard Solution Procedure

Relation Extraction

Phase 1: Data Format

Sentence: "John works at Google"
Entities: John (PER), Google (ORG)
Relation: works_at

Phase 2: Approaches

Supervised:

1. Labeled data: (sentence, entity1, entity2, relation)
2. Features: Words, POS tags, dependency parse
3. Model: SVM, Neural networks, BERT

Distant Supervision:

1. Use knowledge base (Freebase, Wikidata)
2. Automatically label sentences
3. Train on noisy labels
4. Use: When labeled data scarce

BERT-based:

1. Input: [CLS] entity1 [SEP] entity2 [SEP] sentence [SEP]
2. Fine-tune for relation classification
3. Use: State-of-the-art

Phase 3: Evaluation

Metrics:

Precision, Recall, F1: Per relation type
Strict: Both entities and relation correct
Partial: Partial credit

Industry Example: Knowledge Graph Construction

Problem: Build knowledge graph from text

Solution:

NER: Extract entities
Relation Extraction: Extract relations
Linking: Link to knowledge base
Validation: Verify facts
Graph: Build knowledge graph

10. Dialogue Systems

Problem Description

Build conversational AI systems (chatbots, assistants)

Types

Task-oriented:

Specific goal (booking, ordering)
Structured, limited domain

Open-domain:

General conversation
No specific goal
More challenging

Standard Solution Procedure

Task-Oriented Dialogue

Components:

1. Natural Language Understanding (NLU)
   - Intent classification
   - Slot filling (entity extraction)
   
2. Dialogue State Tracking
   - Track conversation state
   - Update based on user input
   
3. Dialogue Policy
   - Decide next action
   - Based on current state
   
4. Natural Language Generation (NLG)
   - Generate response
   - Template-based or neural

Pipeline:

User Input → NLU → State Tracking → Policy → NLG → Response

Training:

1. Intent classification: Multi-class classification
2. Slot filling: Sequence labeling (NER)
3. State tracking: State update model
4. Policy: Reinforcement learning or supervised
5. NLG: Template or neural generation

Open-Domain Dialogue

Approaches:

Retrieval-based:

1. Store response candidates
2. Match user input to candidates
3. Return best match
4. Use: Simple, controllable

Generation-based:

1. Train language model on dialogues
2. Generate response
3. Use: More flexible, can be inconsistent

Hybrid:

1. Generate multiple candidates
2. Retrieve similar responses
3. Rank and select best
4. Use: Best of both

Modern Approach:

1. Fine-tune large language model (GPT-3.5, Claude)
2. Few-shot learning
3. Instruction tuning
4. RLHF for alignment

Industry Example: Customer Support Chatbot

Problem: Handle customer inquiries

Solution:

NLU: Intent + entities
Knowledge Base: FAQ, documentation
Retrieval: Find relevant answers
Generation: Generate response
Fallback: Human agent if needed

Summary: Standard Procedures

Common Patterns:

Data Preparation: Preprocessing, splitting, handling imbalance
Feature Extraction: Traditional (TF-IDF) or embeddings
Model Selection: Based on data size, task complexity
Training: Supervised learning, fine-tuning
Evaluation: Task-specific metrics
Deployment: API, monitoring, A/B testing

Key Principles:

Start simple, iterate
Use pre-trained models when possible
Evaluate with multiple metrics
Monitor in production
Handle edge cases

Industry Best Practices:

Pre-trained models (BERT, GPT)
Fine-tuning for specific tasks
Hybrid approaches (traditional + neural)
Evaluation and monitoring
Production considerations (latency, cost)

ML & LLM Interview Prep — Deep Dives