Bài 7: Regular Expressions - Advanced Patterns & Groups

Mục Tiêu Bài Học

Sau khi hoàn thành bài này, bạn sẽ:

  • ✅ Sử dụng capturing groups
  • ✅ Hiểu backreferences
  • ✅ Áp dụng lookahead và lookbehind
  • ✅ Sử dụng regex flags
  • ✅ Tạo advanced validation patterns
  • ✅ Xử lý text substitution

Capturing Groups

Groups cho phép extract và reuse parts of matched text.

Basic Groups

import re # Parentheses create groupstext = "John Doe, age 30"pattern = r'(\w+) (\w+), age (\d+)' match = re.search(pattern, text)if match:    print(match.group(0))  # John Doe, age 30 (entire match)    print(match.group(1))  # John (first group)    print(match.group(2))  # Doe (second group)    print(match.group(3))  # 30 (third group)        # All groups at once    print(match.groups())  # ('John', 'Doe', '30') # Extract multiple matchestext2 = "Email: [email protected], Phone: 123-456-7890"pattern2 = r'(\w+): ([\[email protected]]+)' matches = re.findall(pattern2, text2)print(matches)  # [('Email', '[email protected]'), ('Phone', '123-456-7890')]

Named Groups

import re # Named groups with (?P<name>...)text = "2025-10-27"pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})' match = re.search(pattern, text)if match:    print(match.group('year'))   # 2025    print(match.group('month'))  # 10    print(match.group('day'))    # 27        # As dictionary    print(match.groupdict())    # {'year': '2025', 'month': '10', 'day': '27'} # Parse log entrieslog_entry = "2025-10-27 10:30:00 ERROR Database connection failed"log_pattern = r'(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) (?P<level>\w+) (?P<message>.*)' match = re.search(log_pattern, log_entry)if match:    log_data = match.groupdict()    print(log_data)    # {'date': '2025-10-27', 'time': '10:30:00', 'level': 'ERROR', 'message': 'Database connection failed'}

Non-capturing Groups

import re # Non-capturing group with (?:...)# Use when you need grouping but don't need to extract # Without non-capturingtext = "http://example.com"pattern1 = r'(http|https)://(\w+)\.(\w+)'match1 = re.search(pattern1, text)print(match1.groups())  # ('http', 'example', 'com') # With non-capturing for protocolpattern2 = r'(?:http|https)://(\w+)\.(\w+)'match2 = re.search(pattern2, text)print(match2.groups())  # ('example', 'com') - protocol not captured # Example: Extract domain without protocolurls = [    "http://google.com",    "https://github.com",    "https://python.org"] pattern = r'(?:https?://)(\w+\.\w+)'for url in urls:    match = re.search(pattern, url)    if match:        print(match.group(1))# google.com# github.com# python.org

Backreferences

Backreferences cho phép reference captured groups trong pattern hoặc replacement.

Backreferences in Pattern

import re # \1, \2, etc. refer to captured groups# Find repeated wordstext = "Hello hello world world"pattern = r'\b(\w+)\s+\1\b'  # \1 refers to first group matches = re.findall(pattern, text, re.IGNORECASE)print(matches)  # ['hello', 'world'] # Find HTML tags with matching closing taghtml = "<div>content</div><span>text</span><p>invalid</div>"pattern = r'<(\w+)>.*?</\1>'  # \1 must match opening tag tags = re.findall(pattern, html)print(tags)  # ['div', 'span'] (not 'p' because closing is 'div') # Find duplicate consecutive characterstext2 = "bookkeeper mississippi"pattern2 = r'(\w)\1+'  # Character followed by itself for match in re.finditer(pattern2, text2):    print(f"Found '{match.group()}' at position {match.start()}")# Found 'oo' at position 1# Found 'kk' at position 4# Found 'ee' at position 6# Found 'ss' at position 12# Found 'ss' at position 14# Found 'pp' at position 17

Backreferences in Replacement

import re # Use \1, \2 or \g<1>, \g<2> in replacementtext = "John Doe"pattern = r'(\w+) (\w+)'result = re.sub(pattern, r'\2, \1', text)print(result)  # Doe, John # Format phone numbersphones = ["1234567890", "9876543210"]pattern = r'(\d{3})(\d{3})(\d{4})'for phone in phones:    formatted = re.sub(pattern, r'(\1) \2-\3', phone)    print(formatted)# (123) 456-7890# (987) 654-3210 # Named group backreferences with \g<name>text2 = "2025-10-27"pattern2 = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'result2 = re.sub(pattern2, r'\g<day>/\g<month>/\g<year>', text2)print(result2)  # 27/10/2025

Lookahead and Lookbehind

Assertions match positions without consuming characters.

Positive Lookahead (?=...)

import re # Match only if followed by patterntext = "Python3 Java11 Go" # Find language names followed by versionpattern = r'\w+(?=\d+)'matches = re.findall(pattern, text)print(matches)  # ['Python', 'Java'] # Password must contain digit (lookahead)def has_digit(password):    return bool(re.search(r'(?=.*\d)', password)) print(has_digit('password'))   # Falseprint(has_digit('password1'))  # True # Complex: USD currency without consuming $text2 = "Price: $100, Cost: $50, Total: 150"pattern2 = r'\d+(?=\s|,|\.|$)'  # Number followed by space, comma, period, or end # Only get numbers that are followed by specific charsfor match in re.finditer(r'(?<=\$)\d+', text2):    print(match.group())# 100# 50

Negative Lookahead (?!...)

import re # Match only if NOT followed by patterntext = "Python3 Java11 Go Ruby" # Find language names NOT followed by versionpattern = r'\w+(?!\d)'matches = re.findall(pattern, text)print(matches)  # ['Python', 'Java', 'Go', 'Ruby']# Note: Python and Java match because after the letters comes a digit, # but 'Python' without the '3' is what's matched # Better example: words not followed by numberstext2 = "test123 hello world456 python"pattern2 = r'\b\w+\b(?!\d)'matches2 = re.findall(pattern2, text2)print(matches2)  # ['hello', 'python'] # Password must NOT contain usernamedef password_not_contains_username(username, password):    pattern = f'(?!.*{re.escape(username)})'    return bool(re.match(pattern, password, re.IGNORECASE)) print(password_not_contains_username('john', 'password123'))  # Trueprint(password_not_contains_username('john', 'john123'))      # False

Positive Lookbehind (?<=...)

import re # Match only if preceded by patterntext = "Price: $100, Cost: €50, Value: ¥200" # Find numbers preceded by $pattern = r'(?<=\$)\d+'dollars = re.findall(pattern, text)print(dollars)  # ['100'] # Find numbers preceded by any currencypattern2 = r'(?<=[\$€¥])\d+'amounts = re.findall(pattern2, text)print(amounts)  # ['100', '50', '200'] # Extract file extensionsfiles = "image.jpg document.pdf script.py data.csv"pattern3 = r'(?<=\.)\w+'extensions = re.findall(pattern3, files)print(extensions)  # ['jpg', 'pdf', 'py', 'csv']

Negative Lookbehind (?<!...)

import re # Match only if NOT preceded by patterntext = "$100 €50 100 200" # Find numbers NOT preceded by currency symbolpattern = r'(?<![\$€])\b\d+\b'matches = re.findall(pattern, text)print(matches)  # ['100', '200'] # Extract words not preceded by hashtagtext2 = "python #coding #programming language"pattern2 = r'(?<!#)\b\w+\b'words = re.findall(pattern2, text2)print(words)  # ['python', 'language'] # Extract standalone numbers (not part of identifier)text3 = "var1 = 10; var2 = 20; result = 30"pattern3 = r'(?<![a-zA-Z_])\d+(?![a-zA-Z_])'numbers = re.findall(pattern3, text3)print(numbers)  # ['10', '20', '30']

Combined Assertions

import re # Password validation with multiple lookaheadsdef validate_strong_password(password):    """    Password must:    - Be 8-20 characters    - Contain at least one lowercase    - Contain at least one uppercase    - Contain at least one digit    - Contain at least one special char    """    pattern = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,20}$'    return bool(re.match(pattern, password)) passwords = [    "weak",              # False - too short, missing requirements    "WeakPassword",      # False - no digit, no special    "Strong1",           # False - no special char    "Strong1!",          # True - meets all requirements    "VeryStr0ng!Pass"    # True - meets all requirements] for pwd in passwords:    print(f"{pwd}: {validate_strong_password(pwd)}")

Regex Flags

Flags modify regex behavior.

Common Flags

import re text = "Python PYTHON python" # re.IGNORECASE (re.I) - case-insensitivematches = re.findall(r'python', text, re.IGNORECASE)print(matches)  # ['Python', 'PYTHON', 'python'] # re.MULTILINE (re.M) - ^ and $ match line boundariestext2 = """line 1: startline 2: middleline 3: end""" # Without MULTILINE - ^ matches only start of stringmatches2 = re.findall(r'^line', text2)print(matches2)  # ['line'] # With MULTILINE - ^ matches start of each linematches3 = re.findall(r'^line', text2, re.MULTILINE)print(matches3)  # ['line', 'line', 'line'] # re.DOTALL (re.S) - . matches newlines tootext3 = "Hello\nWorld" # Without DOTALL - . doesn't match newlinematch1 = re.search(r'Hello.World', text3)print(match1)  # None # With DOTALL - . matches newlinematch2 = re.search(r'Hello.World', text3, re.DOTALL)print(match2.group())  # Hello\nWorld # re.VERBOSE (re.X) - allow comments and whitespacepattern = r'''    ^                 # Start of string    (?P<protocol>https?)  # HTTP or HTTPS    ://               # Separator    (?P<domain>[\w.]+)    # Domain name    (?P<path>/[\w./]*)?   # Optional path    $                 # End of string'''url = "https://example.com/path"match = re.search(pattern, url, re.VERBOSE)if match:    print(match.groupdict())# {'protocol': 'https', 'domain': 'example.com', 'path': '/path'}

Combining Flags

import re # Combine with | (pipe)text = """EMAIL: [email protected]email: [email protected]EmAiL: [email protected]""" # Case-insensitive + Multilinepattern = r'^email:\s*(\S+)$'matches = re.findall(pattern, text, re.IGNORECASE | re.MULTILINE)print(matches)# ['[email protected]', '[email protected]', '[email protected]'] # Inline flags with (?imsx)pattern2 = r'(?im)^email:\s*(\S+)$'  # Same as abovematches2 = re.findall(pattern2, text)print(matches2)# ['[email protected]', '[email protected]', '[email protected]']

Advanced Text Substitution

sub() and subn()

import re # Simple substitutiontext = "Hello World"result = re.sub(r'World', 'Python', text)print(result)  # Hello Python # subn() returns tuple (result, count)text2 = "cat dog cat bird cat"result2, count = re.subn(r'cat', 'mouse', text2)print(result2)  # mouse dog mouse bird mouseprint(count)    # 3 # Substitution with functiondef uppercase_match(match):    return match.group().upper() text3 = "hello world python"result3 = re.sub(r'\w+', uppercase_match, text3)print(result3)  # HELLO WORLD PYTHON # Advanced: Calculate in replacementtext4 = "The price is $50 and $30" def add_tax(match):    amount = float(match.group(1))    with_tax = amount * 1.1  # 10% tax    return f"${with_tax:.2f}" result4 = re.sub(r'\$(\d+)', add_tax, text4)print(result4)  # The price is $55.00 and $33.00

Conditional Replacement

import re # Replace based on conditiondef smart_replace(text):    """Replace 'color' with 'colour' only in British context."""    def replacer(match):        word = match.group()        # Check context        if 'British' in text or 'UK' in text:            return word.replace('color', 'colour')        return word        return re.sub(r'\bcolor\w*\b', replacer, text) text1 = "I like the color red. British English."print(smart_replace(text1))  # I like the colour red. British English. text2 = "I like the color red. American English."print(smart_replace(text2))  # I like the color red. American English.

Real-world Examples

1. SQL Injection Prevention

import re def sanitize_sql_input(user_input):    """Remove potentially dangerous SQL characters."""    # Remove SQL keywords and special chars    dangerous_patterns = [        r'\b(DROP|DELETE|INSERT|UPDATE|EXEC|EXECUTE)\b',        r'[;\'\"\\]',        r'--',        r'/\*.*?\*/',    ]        sanitized = user_input    for pattern in dangerous_patterns:        sanitized = re.sub(pattern, '', sanitized, flags=re.IGNORECASE)        return sanitized.strip() # Testinputs = [    "John Doe",    "Robert'); DROP TABLE users;--",    "admin' OR '1'='1",] for inp in inputs:    print(f"Input: {inp}")    print(f"Sanitized: {sanitize_sql_input(inp)}")    print()

2. Markdown to HTML Converter

import re def markdown_to_html(markdown):    """Convert basic Markdown to HTML."""    html = markdown        # Headers (# to ######)    for i in range(6, 0, -1):        pattern = r'^' + '#' * i + r'\s+(.+)$'        replacement = rf'<h{i}>\1</h{i}>'        html = re.sub(pattern, replacement, html, flags=re.MULTILINE)        # Bold (**text** or __text__)    html = re.sub(r'\*\*(.+?)\*\*', r'<strong>\1</strong>', html)    html = re.sub(r'__(.+?)__', r'<strong>\1</strong>', html)        # Italic (*text* or _text_)    html = re.sub(r'\*(.+?)\*', r'<em>\1</em>', html)    html = re.sub(r'_(.+?)_', r'<em>\1</em>', html)        # Links [text](url)    html = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', r'<a href="\2">\1</a>', html)        # Images ![alt](url)    html = re.sub(r'!\[([^\]]*)\]\(([^)]+)\)', r'<img src="\2" alt="\1">', html)        # Code `code`    html = re.sub(r'`([^`]+)`', r'<code>\1</code>', html)        return html markdown = """# TitleThis is **bold** and *italic* text.Check out [Python](https://python.org)!""" print(markdown_to_html(markdown))

3. Email Template Engine

import re def render_template(template, context):    """Render template with {{variable}} placeholders."""    def replacer(match):        var_name = match.group(1).strip()        # Support nested access like user.name        keys = var_name.split('.')        value = context                try:            for key in keys:                value = value[key]            return str(value)        except (KeyError, TypeError):            return match.group(0)  # Keep original if not found        return re.sub(r'\{\{\s*([^}]+)\s*\}\}', replacer, template) # Testtemplate = """Hello {{user.name}}, Your order #{{order.id}} for {{order.product}} has been {{order.status}}. Total: ${{order.total}} Thank you!""" context = {    'user': {'name': 'John Doe'},    'order': {        'id': '12345',        'product': 'Python Book',        'status': 'shipped',        'total': '29.99'    }} print(render_template(template, context))

4. URL Slug Generator

import re def generate_slug(title):    """Convert title to URL-friendly slug."""    # Convert to lowercase    slug = title.lower()        # Replace spaces and special chars with hyphens    slug = re.sub(r'[^\w\s-]', '', slug)    slug = re.sub(r'[\s_]+', '-', slug)        # Remove leading/trailing hyphens    slug = re.sub(r'^-+|-+$', '', slug)        # Remove duplicate hyphens    slug = re.sub(r'-+', '-', slug)        return slug titles = [    "Hello World!",    "Python Programming: A Complete Guide",    "10 Tips & Tricks for Django",    "  Spaces   Everywhere  ",] for title in titles:    print(f"{title} -> {generate_slug(title)}")# Hello World! -> hello-world# Python Programming: A Complete Guide -> python-programming-a-complete-guide# 10 Tips & Tricks for Django -> 10-tips-tricks-for-django#   Spaces   Everywhere   -> spaces-everywhere

5. Log Analyzer

import refrom collections import defaultdict def analyze_logs(log_text):    """Analyze log file for errors and statistics."""    # Pattern for log lines    pattern = r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (?P<level>\w+) (?P<message>.*)'        stats = {        'total': 0,        'by_level': defaultdict(int),        'errors': [],    }        for line in log_text.split('\n'):        match = re.search(pattern, line)        if match:            stats['total'] += 1            level = match.group('level')            stats['by_level'][level] += 1                        if level in ['ERROR', 'CRITICAL']:                stats['errors'].append({                    'timestamp': match.group('timestamp'),                    'level': level,                    'message': match.group('message')                })        return stats log_text = """2025-10-27 10:00:00 INFO Server started2025-10-27 10:01:00 DEBUG Processing request2025-10-27 10:02:00 ERROR Database connection failed2025-10-27 10:03:00 INFO Request completed2025-10-27 10:04:00 CRITICAL System crash""" stats = analyze_logs(log_text)print(f"Total logs: {stats['total']}")print(f"By level: {dict(stats['by_level'])}")print(f"Errors found: {len(stats['errors'])}")for error in stats['errors']:    print(f"  [{error['timestamp']}] {error['level']}: {error['message']}")

Performance Tips

import reimport time # 1. Compile patterns for reusepattern = re.compile(r'\d+') text = "Test 123 456 789" * 1000 # Without compilationstart = time.time()for _ in range(1000):    re.findall(r'\d+', text)no_compile_time = time.time() - start # With compilationstart = time.time()for _ in range(1000):    pattern.findall(text)compile_time = time.time() - start print(f"Without compile: {no_compile_time:.4f}s")print(f"With compile: {compile_time:.4f}s")print(f"Speedup: {no_compile_time/compile_time:.2f}x") # 2. Use non-capturing groups when possible# (?:...) is faster than (...) # 3. Anchor patterns when possible# ^pattern$ is faster than pattern # 4. Use specific patterns instead of greedy .+# \w+ is faster than .+ for word matching # 5. Avoid catastrophic backtracking# Bad: r'(a+)+b' with input 'aaaaaaaaaa'# Good: r'a+b'

Best Practices

import re # 1. Use raw stringspattern = r'\d+'  # Good# pattern = '\\d+'  # Bad # 2. Compile for reuseEMAIL_RE = re.compile(r'^[\w\.-]+@[\w\.-]+\.\w+$') # 3. Use named groups for claritypattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})' # 4. Document complex patternsPHONE_RE = re.compile(r'''    ^                 # Start    \(?\d{3}\)?       # Area code (optional parens)    [-.\s]?           # Optional separator    \d{3}             # Exchange    [-.\s]?           # Optional separator    \d{4}             # Number    $                 # End''', re.VERBOSE) # 5. Validate and sanitize user inputdef safe_search(pattern, text):    """Safely search with timeout protection."""    try:        return re.search(pattern, text, timeout=1)    except TimeoutError:        return None # 6. Use specific character classes# Good: r'\w+@\w+\.\w+'# Bad: r'.+@.+\..+' # 7. Test thoroughlydef test_email_validation():    EMAIL_RE = re.compile(r'^[\w\.-]+@[\w\.-]+\.\w{2,}$')        valid = [        '[email protected]',        '[email protected]',        '[email protected]'    ]        invalid = [        'invalid',        '@example.com',        'user@',        'user@domain',    ]        for email in valid:        assert EMAIL_RE.match(email), f"Should match: {email}"        for email in invalid:        assert not EMAIL_RE.match(email), f"Should not match: {email}"        print("All tests passed!") test_email_validation()

Bài Tập Thực Hành

Bài 1: Advanced Email Validator

Tạo email validator với:

  • Support subdomains
  • Validate TLD length
  • Check for consecutive dots
  • Named groups cho parts

Bài 2: XML/HTML Parser

Extract attributes từ tags:

  • Parse <tag attr="value">
  • Handle single/double quotes
  • Extract tag name và all attributes

Bài 3: Natural Language Parser

Parse dates từ text:

  • "Oct 27, 2025"
  • "27/10/2025"
  • "2025-10-27"
  • "next Monday"

Bài 4: Code Formatter

Format Python code:

  • Fix indentation
  • Remove trailing spaces
  • Normalize line endings
  • Add missing spaces around operators

Bài 5: Data Extractor

Extract structured data từ unstructured text:

  • Names (First Last)
  • Addresses (Street, City, ZIP)
  • Multiple formats
  • Validation

Tóm Tắt

Capturing Groups: (), (?P...), \1 backreferences
Non-capturing: (?:...)
Lookahead: (?=...) positive, (?!...) negative
Lookbehind: (?<=...) positive, (?<!...) negative
Flags: re.I, re.M, re.S, re.X, combine with |
Substitution: re.sub(), re.subn(), function replacers
Named groups: Clean code with groupdict()
Performance: Compile patterns, use specific patterns

Bài Tiếp Theo

Bài 8: Working với JSON! 🚀


Remember:

  • Use named groups for readability
  • Test patterns thoroughly
  • Compile for performance
  • Avoid catastrophic backtracking
  • Document complex patterns! 🎯