Bài 7: Regular Expressions - Advanced Patterns & Groups
Mục Tiêu Bài Học
Sau khi hoàn thành bài này, bạn sẽ:
- ✅ Sử dụng capturing groups
- ✅ Hiểu backreferences
- ✅ Áp dụng lookahead và lookbehind
- ✅ Sử dụng regex flags
- ✅ Tạo advanced validation patterns
- ✅ Xử lý text substitution
Capturing Groups
Groups cho phép extract và reuse parts of matched text.
Basic Groups
import re # Parentheses create groupstext = "John Doe, age 30"pattern = r'(\w+) (\w+), age (\d+)' match = re.search(pattern, text)if match: print(match.group(0)) # John Doe, age 30 (entire match) print(match.group(1)) # John (first group) print(match.group(2)) # Doe (second group) print(match.group(3)) # 30 (third group) # All groups at once print(match.groups()) # ('John', 'Doe', '30') # Extract multiple matchestext2 = "Email: [email protected], Phone: 123-456-7890"pattern2 = r'(\w+): ([\[email protected]]+)' matches = re.findall(pattern2, text2)print(matches) # [('Email', '[email protected]'), ('Phone', '123-456-7890')]
Named Groups
import re # Named groups with (?P<name>...)text = "2025-10-27"pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})' match = re.search(pattern, text)if match: print(match.group('year')) # 2025 print(match.group('month')) # 10 print(match.group('day')) # 27 # As dictionary print(match.groupdict()) # {'year': '2025', 'month': '10', 'day': '27'} # Parse log entrieslog_entry = "2025-10-27 10:30:00 ERROR Database connection failed"log_pattern = r'(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) (?P<level>\w+) (?P<message>.*)' match = re.search(log_pattern, log_entry)if match: log_data = match.groupdict() print(log_data) # {'date': '2025-10-27', 'time': '10:30:00', 'level': 'ERROR', 'message': 'Database connection failed'}
Non-capturing Groups
import re # Non-capturing group with (?:...)# Use when you need grouping but don't need to extract # Without non-capturingtext = "http://example.com"pattern1 = r'(http|https)://(\w+)\.(\w+)'match1 = re.search(pattern1, text)print(match1.groups()) # ('http', 'example', 'com') # With non-capturing for protocolpattern2 = r'(?:http|https)://(\w+)\.(\w+)'match2 = re.search(pattern2, text)print(match2.groups()) # ('example', 'com') - protocol not captured # Example: Extract domain without protocolurls = [ "http://google.com", "https://github.com", "https://python.org"] pattern = r'(?:https?://)(\w+\.\w+)'for url in urls: match = re.search(pattern, url) if match: print(match.group(1))# google.com# github.com# python.org
Backreferences
Backreferences cho phép reference captured groups trong pattern hoặc replacement.
Backreferences in Pattern
import re # \1, \2, etc. refer to captured groups# Find repeated wordstext = "Hello hello world world"pattern = r'\b(\w+)\s+\1\b' # \1 refers to first group matches = re.findall(pattern, text, re.IGNORECASE)print(matches) # ['hello', 'world'] # Find HTML tags with matching closing taghtml = "<div>content</div><span>text</span><p>invalid</div>"pattern = r'<(\w+)>.*?</\1>' # \1 must match opening tag tags = re.findall(pattern, html)print(tags) # ['div', 'span'] (not 'p' because closing is 'div') # Find duplicate consecutive characterstext2 = "bookkeeper mississippi"pattern2 = r'(\w)\1+' # Character followed by itself for match in re.finditer(pattern2, text2): print(f"Found '{match.group()}' at position {match.start()}")# Found 'oo' at position 1# Found 'kk' at position 4# Found 'ee' at position 6# Found 'ss' at position 12# Found 'ss' at position 14# Found 'pp' at position 17
Backreferences in Replacement
import re # Use \1, \2 or \g<1>, \g<2> in replacementtext = "John Doe"pattern = r'(\w+) (\w+)'result = re.sub(pattern, r'\2, \1', text)print(result) # Doe, John # Format phone numbersphones = ["1234567890", "9876543210"]pattern = r'(\d{3})(\d{3})(\d{4})'for phone in phones: formatted = re.sub(pattern, r'(\1) \2-\3', phone) print(formatted)# (123) 456-7890# (987) 654-3210 # Named group backreferences with \g<name>text2 = "2025-10-27"pattern2 = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'result2 = re.sub(pattern2, r'\g<day>/\g<month>/\g<year>', text2)print(result2) # 27/10/2025
Lookahead and Lookbehind
Assertions match positions without consuming characters.
Positive Lookahead (?=...)
import re # Match only if followed by patterntext = "Python3 Java11 Go" # Find language names followed by versionpattern = r'\w+(?=\d+)'matches = re.findall(pattern, text)print(matches) # ['Python', 'Java'] # Password must contain digit (lookahead)def has_digit(password): return bool(re.search(r'(?=.*\d)', password)) print(has_digit('password')) # Falseprint(has_digit('password1')) # True # Complex: USD currency without consuming $text2 = "Price: $100, Cost: $50, Total: 150"pattern2 = r'\d+(?=\s|,|\.|$)' # Number followed by space, comma, period, or end # Only get numbers that are followed by specific charsfor match in re.finditer(r'(?<=\$)\d+', text2): print(match.group())# 100# 50
Negative Lookahead (?!...)
import re # Match only if NOT followed by patterntext = "Python3 Java11 Go Ruby" # Find language names NOT followed by versionpattern = r'\w+(?!\d)'matches = re.findall(pattern, text)print(matches) # ['Python', 'Java', 'Go', 'Ruby']# Note: Python and Java match because after the letters comes a digit, # but 'Python' without the '3' is what's matched # Better example: words not followed by numberstext2 = "test123 hello world456 python"pattern2 = r'\b\w+\b(?!\d)'matches2 = re.findall(pattern2, text2)print(matches2) # ['hello', 'python'] # Password must NOT contain usernamedef password_not_contains_username(username, password): pattern = f'(?!.*{re.escape(username)})' return bool(re.match(pattern, password, re.IGNORECASE)) print(password_not_contains_username('john', 'password123')) # Trueprint(password_not_contains_username('john', 'john123')) # False
Positive Lookbehind (?<=...)
import re # Match only if preceded by patterntext = "Price: $100, Cost: €50, Value: ¥200" # Find numbers preceded by $pattern = r'(?<=\$)\d+'dollars = re.findall(pattern, text)print(dollars) # ['100'] # Find numbers preceded by any currencypattern2 = r'(?<=[\$€¥])\d+'amounts = re.findall(pattern2, text)print(amounts) # ['100', '50', '200'] # Extract file extensionsfiles = "image.jpg document.pdf script.py data.csv"pattern3 = r'(?<=\.)\w+'extensions = re.findall(pattern3, files)print(extensions) # ['jpg', 'pdf', 'py', 'csv']
Negative Lookbehind (?<!...)
import re # Match only if NOT preceded by patterntext = "$100 €50 100 200" # Find numbers NOT preceded by currency symbolpattern = r'(?<![\$€])\b\d+\b'matches = re.findall(pattern, text)print(matches) # ['100', '200'] # Extract words not preceded by hashtagtext2 = "python #coding #programming language"pattern2 = r'(?<!#)\b\w+\b'words = re.findall(pattern2, text2)print(words) # ['python', 'language'] # Extract standalone numbers (not part of identifier)text3 = "var1 = 10; var2 = 20; result = 30"pattern3 = r'(?<![a-zA-Z_])\d+(?![a-zA-Z_])'numbers = re.findall(pattern3, text3)print(numbers) # ['10', '20', '30']
Combined Assertions
import re # Password validation with multiple lookaheadsdef validate_strong_password(password): """ Password must: - Be 8-20 characters - Contain at least one lowercase - Contain at least one uppercase - Contain at least one digit - Contain at least one special char """ pattern = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,20}$' return bool(re.match(pattern, password)) passwords = [ "weak", # False - too short, missing requirements "WeakPassword", # False - no digit, no special "Strong1", # False - no special char "Strong1!", # True - meets all requirements "VeryStr0ng!Pass" # True - meets all requirements] for pwd in passwords: print(f"{pwd}: {validate_strong_password(pwd)}")
Regex Flags
Flags modify regex behavior.
Common Flags
import re text = "Python PYTHON python" # re.IGNORECASE (re.I) - case-insensitivematches = re.findall(r'python', text, re.IGNORECASE)print(matches) # ['Python', 'PYTHON', 'python'] # re.MULTILINE (re.M) - ^ and $ match line boundariestext2 = """line 1: startline 2: middleline 3: end""" # Without MULTILINE - ^ matches only start of stringmatches2 = re.findall(r'^line', text2)print(matches2) # ['line'] # With MULTILINE - ^ matches start of each linematches3 = re.findall(r'^line', text2, re.MULTILINE)print(matches3) # ['line', 'line', 'line'] # re.DOTALL (re.S) - . matches newlines tootext3 = "Hello\nWorld" # Without DOTALL - . doesn't match newlinematch1 = re.search(r'Hello.World', text3)print(match1) # None # With DOTALL - . matches newlinematch2 = re.search(r'Hello.World', text3, re.DOTALL)print(match2.group()) # Hello\nWorld # re.VERBOSE (re.X) - allow comments and whitespacepattern = r''' ^ # Start of string (?P<protocol>https?) # HTTP or HTTPS :// # Separator (?P<domain>[\w.]+) # Domain name (?P<path>/[\w./]*)? # Optional path $ # End of string'''url = "https://example.com/path"match = re.search(pattern, url, re.VERBOSE)if match: print(match.groupdict())# {'protocol': 'https', 'domain': 'example.com', 'path': '/path'}
Combining Flags
import re # Combine with | (pipe)text = """EMAIL: [email protected]email: [email protected]EmAiL: [email protected]""" # Case-insensitive + Multilinepattern = r'^email:\s*(\S+)$'matches = re.findall(pattern, text, re.IGNORECASE | re.MULTILINE)print(matches)# ['[email protected]', '[email protected]', '[email protected]'] # Inline flags with (?imsx)pattern2 = r'(?im)^email:\s*(\S+)$' # Same as abovematches2 = re.findall(pattern2, text)print(matches2)# ['[email protected]', '[email protected]', '[email protected]']
Advanced Text Substitution
sub() and subn()
import re # Simple substitutiontext = "Hello World"result = re.sub(r'World', 'Python', text)print(result) # Hello Python # subn() returns tuple (result, count)text2 = "cat dog cat bird cat"result2, count = re.subn(r'cat', 'mouse', text2)print(result2) # mouse dog mouse bird mouseprint(count) # 3 # Substitution with functiondef uppercase_match(match): return match.group().upper() text3 = "hello world python"result3 = re.sub(r'\w+', uppercase_match, text3)print(result3) # HELLO WORLD PYTHON # Advanced: Calculate in replacementtext4 = "The price is $50 and $30" def add_tax(match): amount = float(match.group(1)) with_tax = amount * 1.1 # 10% tax return f"${with_tax:.2f}" result4 = re.sub(r'\$(\d+)', add_tax, text4)print(result4) # The price is $55.00 and $33.00
Conditional Replacement
import re # Replace based on conditiondef smart_replace(text): """Replace 'color' with 'colour' only in British context.""" def replacer(match): word = match.group() # Check context if 'British' in text or 'UK' in text: return word.replace('color', 'colour') return word return re.sub(r'\bcolor\w*\b', replacer, text) text1 = "I like the color red. British English."print(smart_replace(text1)) # I like the colour red. British English. text2 = "I like the color red. American English."print(smart_replace(text2)) # I like the color red. American English.
Real-world Examples
1. SQL Injection Prevention
import re def sanitize_sql_input(user_input): """Remove potentially dangerous SQL characters.""" # Remove SQL keywords and special chars dangerous_patterns = [ r'\b(DROP|DELETE|INSERT|UPDATE|EXEC|EXECUTE)\b', r'[;\'\"\\]', r'--', r'/\*.*?\*/', ] sanitized = user_input for pattern in dangerous_patterns: sanitized = re.sub(pattern, '', sanitized, flags=re.IGNORECASE) return sanitized.strip() # Testinputs = [ "John Doe", "Robert'); DROP TABLE users;--", "admin' OR '1'='1",] for inp in inputs: print(f"Input: {inp}") print(f"Sanitized: {sanitize_sql_input(inp)}") print()
2. Markdown to HTML Converter
import re def markdown_to_html(markdown): """Convert basic Markdown to HTML.""" html = markdown # Headers (# to ######) for i in range(6, 0, -1): pattern = r'^' + '#' * i + r'\s+(.+)$' replacement = rf'<h{i}>\1</h{i}>' html = re.sub(pattern, replacement, html, flags=re.MULTILINE) # Bold (**text** or __text__) html = re.sub(r'\*\*(.+?)\*\*', r'<strong>\1</strong>', html) html = re.sub(r'__(.+?)__', r'<strong>\1</strong>', html) # Italic (*text* or _text_) html = re.sub(r'\*(.+?)\*', r'<em>\1</em>', html) html = re.sub(r'_(.+?)_', r'<em>\1</em>', html) # Links [text](url) html = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', r'<a href="\2">\1</a>', html) # Images  html = re.sub(r'!\[([^\]]*)\]\(([^)]+)\)', r'<img src="\2" alt="\1">', html) # Code `code` html = re.sub(r'`([^`]+)`', r'<code>\1</code>', html) return html markdown = """# TitleThis is **bold** and *italic* text.Check out [Python](https://python.org)!""" print(markdown_to_html(markdown))
3. Email Template Engine
import re def render_template(template, context): """Render template with {{variable}} placeholders.""" def replacer(match): var_name = match.group(1).strip() # Support nested access like user.name keys = var_name.split('.') value = context try: for key in keys: value = value[key] return str(value) except (KeyError, TypeError): return match.group(0) # Keep original if not found return re.sub(r'\{\{\s*([^}]+)\s*\}\}', replacer, template) # Testtemplate = """Hello {{user.name}}, Your order #{{order.id}} for {{order.product}} has been {{order.status}}. Total: ${{order.total}} Thank you!""" context = { 'user': {'name': 'John Doe'}, 'order': { 'id': '12345', 'product': 'Python Book', 'status': 'shipped', 'total': '29.99' }} print(render_template(template, context))
4. URL Slug Generator
import re def generate_slug(title): """Convert title to URL-friendly slug.""" # Convert to lowercase slug = title.lower() # Replace spaces and special chars with hyphens slug = re.sub(r'[^\w\s-]', '', slug) slug = re.sub(r'[\s_]+', '-', slug) # Remove leading/trailing hyphens slug = re.sub(r'^-+|-+$', '', slug) # Remove duplicate hyphens slug = re.sub(r'-+', '-', slug) return slug titles = [ "Hello World!", "Python Programming: A Complete Guide", "10 Tips & Tricks for Django", " Spaces Everywhere ",] for title in titles: print(f"{title} -> {generate_slug(title)}")# Hello World! -> hello-world# Python Programming: A Complete Guide -> python-programming-a-complete-guide# 10 Tips & Tricks for Django -> 10-tips-tricks-for-django# Spaces Everywhere -> spaces-everywhere
5. Log Analyzer
import refrom collections import defaultdict def analyze_logs(log_text): """Analyze log file for errors and statistics.""" # Pattern for log lines pattern = r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (?P<level>\w+) (?P<message>.*)' stats = { 'total': 0, 'by_level': defaultdict(int), 'errors': [], } for line in log_text.split('\n'): match = re.search(pattern, line) if match: stats['total'] += 1 level = match.group('level') stats['by_level'][level] += 1 if level in ['ERROR', 'CRITICAL']: stats['errors'].append({ 'timestamp': match.group('timestamp'), 'level': level, 'message': match.group('message') }) return stats log_text = """2025-10-27 10:00:00 INFO Server started2025-10-27 10:01:00 DEBUG Processing request2025-10-27 10:02:00 ERROR Database connection failed2025-10-27 10:03:00 INFO Request completed2025-10-27 10:04:00 CRITICAL System crash""" stats = analyze_logs(log_text)print(f"Total logs: {stats['total']}")print(f"By level: {dict(stats['by_level'])}")print(f"Errors found: {len(stats['errors'])}")for error in stats['errors']: print(f" [{error['timestamp']}] {error['level']}: {error['message']}")
Performance Tips
import reimport time # 1. Compile patterns for reusepattern = re.compile(r'\d+') text = "Test 123 456 789" * 1000 # Without compilationstart = time.time()for _ in range(1000): re.findall(r'\d+', text)no_compile_time = time.time() - start # With compilationstart = time.time()for _ in range(1000): pattern.findall(text)compile_time = time.time() - start print(f"Without compile: {no_compile_time:.4f}s")print(f"With compile: {compile_time:.4f}s")print(f"Speedup: {no_compile_time/compile_time:.2f}x") # 2. Use non-capturing groups when possible# (?:...) is faster than (...) # 3. Anchor patterns when possible# ^pattern$ is faster than pattern # 4. Use specific patterns instead of greedy .+# \w+ is faster than .+ for word matching # 5. Avoid catastrophic backtracking# Bad: r'(a+)+b' with input 'aaaaaaaaaa'# Good: r'a+b'
Best Practices
import re # 1. Use raw stringspattern = r'\d+' # Good# pattern = '\\d+' # Bad # 2. Compile for reuseEMAIL_RE = re.compile(r'^[\w\.-]+@[\w\.-]+\.\w+$') # 3. Use named groups for claritypattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})' # 4. Document complex patternsPHONE_RE = re.compile(r''' ^ # Start \(?\d{3}\)? # Area code (optional parens) [-.\s]? # Optional separator \d{3} # Exchange [-.\s]? # Optional separator \d{4} # Number $ # End''', re.VERBOSE) # 5. Validate and sanitize user inputdef safe_search(pattern, text): """Safely search with timeout protection.""" try: return re.search(pattern, text, timeout=1) except TimeoutError: return None # 6. Use specific character classes# Good: r'\w+@\w+\.\w+'# Bad: r'.+@.+\..+' # 7. Test thoroughlydef test_email_validation(): EMAIL_RE = re.compile(r'^[\w\.-]+@[\w\.-]+\.\w{2,}$') valid = [ '[email protected]', '[email protected]', '[email protected]' ] invalid = [ 'invalid', '@example.com', 'user@', 'user@domain', ] for email in valid: assert EMAIL_RE.match(email), f"Should match: {email}" for email in invalid: assert not EMAIL_RE.match(email), f"Should not match: {email}" print("All tests passed!") test_email_validation()
Bài Tập Thực Hành
Bài 1: Advanced Email Validator
Tạo email validator với:
- Support subdomains
- Validate TLD length
- Check for consecutive dots
- Named groups cho parts
Bài 2: XML/HTML Parser
Extract attributes từ tags:
- Parse
<tag attr="value"> - Handle single/double quotes
- Extract tag name và all attributes
Bài 3: Natural Language Parser
Parse dates từ text:
- "Oct 27, 2025"
- "27/10/2025"
- "2025-10-27"
- "next Monday"
Bài 4: Code Formatter
Format Python code:
- Fix indentation
- Remove trailing spaces
- Normalize line endings
- Add missing spaces around operators
Bài 5: Data Extractor
Extract structured data từ unstructured text:
- Names (First Last)
- Addresses (Street, City, ZIP)
- Multiple formats
- Validation
Tóm Tắt
✅ Capturing Groups: (), (?P
✅ Non-capturing: (?:...)
✅ Lookahead: (?=...) positive, (?!...) negative
✅ Lookbehind: (?<=...) positive, (?<!...) negative
✅ Flags: re.I, re.M, re.S, re.X, combine with |
✅ Substitution: re.sub(), re.subn(), function replacers
✅ Named groups: Clean code with groupdict()
✅ Performance: Compile patterns, use specific patterns
Bài Tiếp Theo
Remember:
- Use named groups for readability
- Test patterns thoroughly
- Compile for performance
- Avoid catastrophic backtracking
- Document complex patterns! 🎯