diff --git a/docs/18_regular_expressions.md b/docs/18_regular_expressions.md index 4492c91..968f4ed 100644 --- a/docs/18_regular_expressions.md +++ b/docs/18_regular_expressions.md @@ -1,113 +1,72 @@ -
-

30 Days Of Python: Day 18 - Regular Expressions

- - - - - Twitter Follow - - - Author: - Asabeneh Yetayeh
- First Edition: Nov 22 - Dec 22, 2019 -
-
- - -[<< Day 17](../17_Day_Exception_handling/17_exception_handling.md) | [Day 19>>](../19_Day_File_handling/19_file_handling.md) - -![30DaysOfPython](../images/30DaysOfPython_banner3@2x.png) - -- [📘 Day 18](#-day-18) - - [Regular Expressions](#regular-expressions) - - [The *re* Module](#the-re-module) - - [Methods in *re* Module](#methods-in-re-module) - - [Match](#match) - - [Search](#search) - - [Searching for All Matches Using *findall*](#searching-for-all-matches-using-findall) - - [Replacing a Substring](#replacing-a-substring) - - [Splitting Text Using RegEx Split](#splitting-text-using-regex-split) - - [Writing RegEx Patterns](#writing-regex-patterns) - - [Square Bracket](#square-bracket) - - [Escape character(\\) in RegEx](#escape-character-in-regex) - - [One or more times(+)](#one-or-more-times) - - [Period(.)](#period) - - [Zero or more times(\*)](#zero-or-more-times) - - [Zero or one time(?)](#zero-or-one-time) - - [Quantifier in RegEx](#quantifier-in-regex) - - [Cart ^](#cart-) - - [💻 Exercises: Day 18](#-exercises-day-18) - - [Exercises: Level 1](#exercises-level-1) - - [Exercises: Level 2](#exercises-level-2) - - [Exercises: Level 3](#exercises-level-3) - -# 📘 Day 18 - -## Regular Expressions - -A regular expression or RegEx is a special text string that helps to find patterns in data. A RegEx can be used to check if some pattern exists in a different data type. To use RegEx in python first we should import the RegEx module which is called *re*. - -### The *re* Module - -After importing the module we can use it to detect or find patterns. +# 18 正则表达式 + +## 正则表达式 + +正则表达式(regular expression或者RegEx)是一种特殊的文本字符串,用于在数据中查找模式patterns。正则表达式可以用于检查不同数据类型中是否存在某种模式。要在Python中使用正则表达式,首先我们应该导入称为*re*的正则表达式模块。 + +### *re*模块 + +导入模块后,我们可以使用它来检测或查找模式。 ```py import re ``` -### Methods in *re* Module +### *re*模块中的方法 -To find a pattern we use different set of *re* character sets that allows to search for a match in a string. +要查找模式,我们使用不同的*re*字符集,允许在字符串中搜索匹配项。 -- *re.match()*: searches only in the beginning of the first line of the string and returns matched objects if found, else returns None. -- *re.search*: Returns a match object if there is one anywhere in the string, including multiline strings. -- *re.findall*: Returns a list containing all matches -- *re.split*: Takes a string, splits it at the match points, returns a list -- *re.sub*: Replaces one or many matches within a string +- *re.match()*:仅在字符串的第一行开头搜索,并返回匹配的对象(如果找到),否则返回None。 +- *re.search*:如果在字符串中的任何位置都有匹配对象,则返回匹配对象,包括多行字符串。 +- *re.findall*:返回包含所有匹配项的列表 +- *re.split*:获取字符串,将其分割为匹配点,并返回列表 +- *re.sub*:替换字符串中的一个或多个匹配项 -#### Match +#### 匹配 ```py -# syntac +# 语法 re.match(substring, string, re.I) -# substring is a string or a pattern, string is the text we look for a pattern , re.I is case ignore +# substring是一个字符串或模式,string是我们要查找模式的文本,re.I是不区分大小写 ``` +>`re.I` 是 Python 中 `re`(正则表达式)模块的一个常量,代表 "IGNORECASE"。当你在使用正则表达式进行模式匹配时,加上 `re.I` 会让匹配操作忽略大小写。这意味着,无论是大写还是小写的字符都会被视为等效。 + ```py import re txt = 'I love to teach python and javaScript' -# It returns an object with span, and match +# 它返回一个具有span和match的对象 match = re.match('I love to teach', txt, re.I) print(match) # -# We can get the starting and ending position of the match as tuple using span +# 我们可以使用span作为元组获取匹配的起始和结束位置 span = match.span() print(span) # (0, 15) -# Lets find the start and stop position from the span +# 让我们从span中找到开始和结束位置 start, end = span print(start, end) # 0, 15 substring = txt[start:end] print(substring) # I love to teach ``` -As you can see from the example above, the pattern we are looking for (or the substring we are looking for) is *I love to teach*. The match function returns an object **only** if the text starts with the pattern. +正如您从上面的示例中看到的,我们正在寻找的模式(或我们正在寻找的子字符串)是*I love to teach*。只有在文本以该模式开头时,匹配函数才会返回一个对象。 ```py -import re txt = 'I love to teach python and javaScript' match = re.match('I like to teach', txt, re.I) print(match) # None + ``` -The string does not string with *I like to teach*, therefore there was no match and the match method returned None. +字符串不以*I like to teach*开头,因此没有匹配,匹配方法返回None。 -#### Search +#### 搜索 ```py -# syntax +# 语法 re.match(substring, string, re.I) -# substring is a pattern, string is the text we look for a pattern , re.I is case ignore flag +# substring是一个模式,string是我们要查找模式的文本,re.I是不区分大小写标志 ``` ```py @@ -116,48 +75,48 @@ import re txt = '''Python is the most beautiful language that a human being has ever created. I recommend python for a first programming language''' -# It returns an object with span and match +# 它返回一个具有span和match的对象 match = re.search('first', txt, re.I) print(match) # -# We can get the starting and ending position of the match as tuple using span +# 我们可以使用span作为元组获取匹配的起始和结束位置 span = match.span() print(span) # (100, 105) -# Lets find the start and stop position from the span +# 让我们从span中找到开始和停止位置 start, end = span print(start, end) # 100 105 substring = txt[start:end] print(substring) # first ``` -As you can see, search is much better than match because it can look for the pattern throughout the text. Search returns a match object with a first match that was found, otherwise it returns *None*. A much better *re* function is *findall*. This function checks for the pattern through the whole string and returns all the matches as a list. +如您所见,search比match更好,因为它可以在整个文本中查找模式。搜索返回一个匹配对象,其中包含找到的第一个匹配项,否则它将返回*None*。一个更好的*re*函数是*findall*。此函数检查整个字符串的模式并将所有匹配项作为列表返回。 -#### Searching for All Matches Using *findall* +#### 使用*findall*查找所有匹配项 -*findall()* returns all the matches as a list +*findall()*返回所有匹配项的列表 ```py txt = '''Python is the most beautiful language that a human being has ever created. I recommend python for a first programming language''' -# It return a list +# 它返回一个列表 matches = re.findall('language', txt, re.I) print(matches) # ['language', 'language'] ``` -As you can see, the word *language* was found two times in the string. Let us practice some more. -Now we will look for both Python and python words in the string: +如您所见,字符串*language*在字符串中出现了两次。现在我们来练习一下。 +现在,我们将在字符串中查找Python和python两个单词: ```py txt = '''Python is the most beautiful language that a human being has ever created. I recommend python for a first programming language''' -# It returns list +# 它返回一个列表 matches = re.findall('python', txt, re.I) print(matches) # ['Python', 'python'] ``` -Since we are using *re.I* both lowercase and uppercase letters are included. If we do not have the re.I flag, then we will have to write our pattern differently. Let us check it out: +由于我们使用了*re.I*,因此包括了大写和小写字母。如果没有re.I标志,那么我们将不得不以不同的方式编写模式。让我们来看看: ```py txt = '''Python is the most beautiful language that a human being has ever created. @@ -172,7 +131,14 @@ print(matches) # ['Python', 'python'] ``` -#### Replacing a Substring +>在正则表达式中: +> +>1. `'Python|python'` 表示匹配字符串 "Python" 或 "python"。符号 `|` 在正则表达式中用作“或”运算符,意味着它会匹配在 `|` 符号前或后的任何一个模式。所以这个表达式会匹配 "Python" 和 "python" 这两个不同的字符串。 +>2. `'[Pp]ython'` 是一个稍微不同的表达式。在正则表达式中,方括号 `[]` 用来表示一个字符集。在这个特定的字符集 `[Pp]` 中,它表示匹配大写 "P" 或小写 "p"。因此,这个表达式会匹配以 "P" 或 "p" 开始,后面跟着 "ython" 的任何字符串,这意味着它也会匹配 "Python" 和 "python"。 +> +>虽然这两个表达式的写法不同,但它们都能匹配文本中的 "Python" 和 "python"。 + +#### 替换子字符串 ```py txt = '''Python is the most beautiful language that a human being has ever created. @@ -185,10 +151,11 @@ match_replaced = re.sub('[Pp]ython', 'JavaScript', txt, re.I) print(match_replaced) # JavaScript is the most beautiful language that a human being has ever created. ``` -Let us add one more example. The following string is really hard to read unless we remove the % symbol. Replacing the % with an empty string will clean the text. +让我们再添加一个示例。以下字符串非常难以阅读,除非我们删除%符号。用 -```py +空字符串替换%将清除文本。 +```py txt = '''%I a%m te%%a%%che%r% a%n%d %% I l%o%ve te%ach%ing. T%he%re i%s n%o%th%ing as r%ewarding a%s e%duc%at%i%ng a%n%d e%m%p%ow%er%ing p%e%o%ple. I fo%und te%a%ching m%ore i%n%t%er%%es%ting t%h%an any other %jobs. @@ -204,24 +171,24 @@ There is nothing as rewarding as educating and empowering people. I found teaching more interesting than any other jobs. Does this motivate you to be a teacher? ``` -## Splitting Text Using RegEx Split +## 使用RegEx Split拆分文本 ```py txt = '''I am teacher and I love teaching. There is nothing as rewarding as educating and empowering people. I found teaching more interesting than any other jobs. Does this motivate you to be a teacher?''' -print(re.split('\n', txt)) # splitting using \n - end of line symbol +print(re.split('\n', txt)) # 使用\n —— 行尾符号进行拆分 ``` ```sh ['I am teacher and I love teaching.', 'There is nothing as rewarding as educating and empowering people.', 'I found teaching more interesting than any other jobs.', 'Does this motivate you to be a teacher?'] ``` -## Writing RegEx Patterns +## 编写RegEx模式 -To declare a string variable we use a single or double quote. To declare RegEx variable *r''*. -The following pattern only identifies apple with lowercase, to make it case insensitive either we should rewrite our pattern or we should add a flag. +要声明字符串变量,我们使用单引号或双引号。要声明正则表达式变量,使用*r''*。 +以下模式仅识别小写的apple,要使其不区分大小写,要么重新编写模式,要么添加标志。 ```py import re @@ -231,52 +198,50 @@ txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor w matches = re.findall(regex_pattern, txt) print(matches) # ['apple'] -# To make case insensitive adding flag ' +# 要不区分大小写,请添加标志 ' matches = re.findall(regex_pattern, txt, re.I) print(matches) # ['Apple', 'apple'] -# or we can use a set of characters method -regex_pattern = r'[Aa]pple' # this mean the first letter could be Apple or apple +# 或者我们可以使用一组字符方法 +regex_pattern = r'[Aa]pple' # 这意味着第一个字母可以是Apple或apple matches = re.findall(regex_pattern, txt) print(matches) # ['Apple', 'apple'] - ``` -* []: A set of characters - - [a-c] means, a or b or c - - [a-z] means, any letter from a to z - - [A-Z] means, any character from A to Z - - [0-3] means, 0 or 1 or 2 or 3 - - [0-9] means any number from 0 to 9 - - [A-Za-z0-9] any single character, that is a to z, A to Z or 0 to 9 -- \\: uses to escape special characters - - \d means: match where the string contains digits (numbers from 0-9) - - \D means: match where the string does not contain digits -- . : any character except new line character(\n) -- ^: starts with - - r'^substring' eg r'^love', a sentence that starts with a word love - - r'[^abc] means not a, not b, not c. -- $: ends with - - r'substring$' eg r'love$', sentence that ends with a word love -- *: zero or more times - - r'[a]*' means a optional or it can occur many times. -- +: one or more times - - r'[a]+' means at least once (or more) -- ?: zero or one time - - r'[a]?' means zero times or once -- {3}: Exactly 3 characters -- {3,}: At least 3 characters -- {3,8}: 3 to 8 characters -- |: Either or - - r'apple|banana' means either apple or a banana -- (): Capture and group - -![Regular Expression cheat sheet](../images/regex.png) - -Let us use examples to clarify the meta characters above - -### Square Bracket - -Let us use square bracket to include lower and upper case +* `[]`:一组字符 + - `[a-c]`表示a或b或c + - `[a-z]`表示从a到z的任何字母 + - `[A-Z]`表示从A到Z的任何字符 + - `[0-3]`表示0或1或2或3 + - `[0-9]`表示从0到9的任何数字 + - `[A-Za-z0-9]`表示任何单个字符,即a到z、A到Z或0到9 + +- `\`:用于转义特殊字符 + - `\d`表示:匹配包含数字(0-9)的字符串 + - `\D`表示:匹配不包含数字的字符串 +- `.`:任何字符,除了换行符(\n) +- `^`:以...开始 + - `r'^substring'` 例如`r'^love'`,以单词love开头的句子 + - `r'[^abc]`表示不是a、不是b、不是c。 +- `$`:以...结束 + - `r'substring$'` 例如`r'love$'`,以单词love结尾的句子 +- `*`:零次或多次 + - `r'[a]*'`表示a是可选的,或者它可以出现多次。 +- `+`:一次或多次 + - `r'[a]+'`表示至少一次(或更多) +- `?`:零次或一次 + - `r'[a]?'`表示零次或一次 +- `{3}`:恰好3个字符 +- `{3,}`:至少3个字符 +- `{3,8}`:3到8个字符 +- `|`:要么是要么 + - `r'apple|banana'`表示要么是apple要么是banana +- `()`:捕获和分组 + +![正则表达式备忘单](./images/regex.png) + +### 方括号`[]` + +让我们使用方括号包含小写和大写字母 ```py regex_pattern = r'[Aa]pple' # this square bracket mean either A or a @@ -285,115 +250,114 @@ matches = re.findall(regex_pattern, txt) print(matches) # ['Apple', 'apple'] ``` -If we want to look for the banana, we write the pattern as follows: +如果我们想查找香蕉,我们将模式写成以下方式: ```py -regex_pattern = r'[Aa]pple|[Bb]anana' # this square bracket means either A or a +regex_pattern = r'[Aa]pple|[Bb]anana' # 这个方括号表示要么是A要么是a txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.' matches = re.findall(regex_pattern, txt) print(matches) # ['Apple', 'banana', 'apple', 'banana'] ``` -Using the square bracket and or operator , we manage to extract Apple, apple, Banana and banana. +使用方括号和或运算符,我们成功提取了Apple、apple、Banana和banana。 -### Escape character(\\) in RegEx +### 正则表达式中的转义字符`\` ```py -regex_pattern = r'\d' # d is a special character which means digits +regex_pattern = r'\d' # d 是一个特殊字符,表示数字 txt = 'This regular expression example was made on December 6, 2019 and revised on July 8, 2021' matches = re.findall(regex_pattern, txt) -print(matches) # ['6', '2', '0', '1', '9', '8', '2', '0', '2', '1'], this is not what we want +print(matches) # ['6', '2', '0', '1', '9', '8', '2', '0', '2', '1'],这不是我们想要的结果 ``` -### One or more times(+) +### 一次或多次`+` ```py -regex_pattern = r'\d+' # d is a special character which means digits, + mean one or more times +regex_pattern = r'\d+' # d 是一个特殊字符,表示数字,+ 表示一次或多次 txt = 'This regular expression example was made on December 6, 2019 and revised on July 8, 2021' -matches = re.findall(regex_pattern, txt) -print(matches) # ['6', '2019', '8', '2021'] - now, this is better! +print(matches) # ['6', '2019', '8', '2021'] - 现在,这更好! ``` -### Period(.) +### 句点Period`.` ```py -regex_pattern = r'[a].' # this square bracket means a and . means any character except new line +regex_pattern = r'[a].' # 这个方括号表示a,. 表示除换行符以外的任何字符 txt = '''Apple and banana are fruits''' matches = re.findall(regex_pattern, txt) print(matches) # ['an', 'an', 'an', 'a ', 'ar'] -regex_pattern = r'[a].+' # . any character, + any character one or more times +regex_pattern = r'[a].+' # . 任何字符,+ 任何字符一次或多次 matches = re.findall(regex_pattern, txt) print(matches) # ['and banana are fruits'] ``` -### Zero or more times(\*) +### 零次或多次`*` -Zero or many times. The pattern could may not occur or it can occur many times. +零次或多次。模式可以不出现,也可以出现多次。 ```py -regex_pattern = r'[a].*' # . any character, * any character zero or more times +regex_pattern = r'[a].*' # . 任何字符,* 任何字符零次或多次 txt = '''Apple and banana are fruits''' matches = re.findall(regex_pattern, txt) print(matches) # ['and banana are fruits'] ``` -### Zero or one time(?) +### 零次或一次`?` -Zero or one time. The pattern may not occur or it may occur once. +零次或一次。模式可能不出现,也可能出现一次。 ```py txt = '''I am not sure if there is a convention how to write the word e-mail. Some people write it as email others may write it as Email or E-mail.''' -regex_pattern = r'[Ee]-?mail' # ? means here that '-' is optional +regex_pattern = r'[Ee]-?mail' # ? 这里表示'-'是可选的 matches = re.findall(regex_pattern, txt) print(matches) # ['e-mail', 'email', 'Email', 'E-mail'] ``` -### Quantifier in RegEx +### 正则表达式中的量词 -We can specify the length of the substring we are looking for in a text, using a curly bracket. Let us imagine, we are interested in a substring with a length of 4 characters: +我们可以使用花括号来指定我们在文本中正在寻找的子字符串的长度。假设我们对长度为4个字符的子字符串感兴趣: ```py txt = 'This regular expression example was made on December 6, 2019 and revised on July 8, 2021' -regex_pattern = r'\d{4}' # exactly four times +regex_pattern = r'\d{4}' # 确切四次 matches = re.findall(regex_pattern, txt) print(matches) # ['2019', '2021'] txt = 'This regular expression example was made on December 6, 2019 and revised on July 8, 2021' -regex_pattern = r'\d{1, 4}' # 1 to 4 +regex_pattern = r'\d{1, 4}' # 1 到 4 次 matches = re.findall(regex_pattern, txt) print(matches) # ['6', '2019', '8', '2021'] ``` -### Cart ^ +### 脱字符/插入符号caret `^` + +- 以...开始 -- Starts with - ```py txt = 'This regular expression example was made on December 6, 2019 and revised on July 8, 2021' -regex_pattern = r'^This' # ^ means starts with +regex_pattern = r'^This' # ^ 表示以...开始 matches = re.findall(regex_pattern, txt) print(matches) # ['This'] ``` -- Negation +- 否定Negation ```py txt = 'This regular expression example was made on December 6, 2019 and revised on July 8, 2021' -regex_pattern = r'[^A-Za-z ]+' # ^ in set character means negation, not A to Z, not a to z, no space +regex_pattern = r'[^A-Za-z ]+' # ^ 集合字符中的 ^ 表示否定,不包括 A 到 Z,不包括 a 到 z,不包括空格 matches = re.findall(regex_pattern, txt) print(matches) # ['6,', '2019', '8', '2021'] ``` -## 💻 Exercises: Day 18 +## 💻 练习:第18天 -### Exercises: Level 1 +### 练习:级别1 - 1. What is the most frequent word in the following paragraph? +1. 在以下段落中,哪个单词最频繁出现? ```py - paragraph = 'I love teaching. If you do not love teaching what else can you love. I love Python if you do not love something which can give you all the capabilities to develop an application what else can you love. +paragraph = 'I love teaching. If you do not love teaching what else can you love. I love Python if you do not love something which can give you all the capabilities to develop an application what else can you love. ``` ```sh @@ -423,7 +387,7 @@ print(matches) # ['6,', '2019', '8', '2021'] ] ``` -2. The position of some particles on the horizontal x-axis are -12, -4, -3 and -1 in the negative direction, 0 at origin, 4 and 8 in the positive direction. Extract these numbers from this whole text and find the distance between the two furthest particles. +2. 在水平x轴上,一些粒子的位置是-12,-4,-3和-1在负方向,原点在0处,正方向为4和8。从整个文本中提取这些数字,并找到最远粒子之间的距离。 ```py points = ['-12', '-4', '-3', '-1', '0', '4', '8'] @@ -431,29 +395,26 @@ sorted_points = [-12, -4, -3, -1, -1, 0, 2, 4, 8] distance = 8 -(-12) # 20 ``` -### Exercises: Level 2 - -1. Write a pattern which identifies if a string is a valid python variable - - ```sh - is_valid_variable('first_name') # True - is_valid_variable('first-name') # False - is_valid_variable('1first_name') # False - is_valid_variable('firstname') # True - ``` +### 练习:级别2 -### Exercises: Level 3 +1. 编写一个模式,用于识别一个字符串是否是有效的Python变量 -1. Clean the following text. After cleaning, count three most frequent words in the string. + ```sh + is_valid_variable('first_name') # True + is_valid_variable('first-name') # False + is_valid_variable('1first_name') # False + is_valid_variable('firstname') # True + ``` - ```py - sentence = '''%I $am@% a %tea@cher%, &and& I lo%#ve %tea@ching%;. There $is nothing; &as& mo@re rewarding as educa@ting &and& @emp%o@wering peo@ple. ;I found tea@ching m%o@re interesting tha@n any other %jo@bs. %Do@es thi%s mo@tivate yo@u to be a tea@cher!?''' +### 练习:级别3 - print(clean_text(sentence)); - I am a teacher and I love teaching There is nothing as more rewarding as educating and empowering people I found teaching more interesting than any other jobs Does this motivate you to be a teacher - print(most_frequent_words(cleaned_text)) # [(3, 'I'), (2, 'teaching'), (2, 'teacher')] - ``` +1. 清理以下文本。清理后,计算字符串中三个最常出现的单词。 -🎉 CONGRATULATIONS ! 🎉 + ```py + sentence = '''%I $am@% a %tea@cher%, &and& I lo%#ve %tea@ching%;. There $is nothing; &as& mo@re rewarding as educa@ting &and& @emp%o@wering peo@ple. ;I found tea@ching m%o@re interesting tha@n any other %jo@bs. %Do@es thi%s mo@tivate yo@u to be a tea@cher!?''' + + print(clean_text(sentence)); + I am a teacher and I love teaching There is nothing as more rewarding as educating and empowering people I found teaching more interesting than any other jobs Does this motivate you to be a teacher + print(most_frequent_words(cleaned_text)) # [(3, 'I'), (2, 'teaching'), (2, 'teacher')] + ``` -[<< Day 17](../17_Day_Exception_handling/17_exception_handling.md) | [Day 19>>](../19_Day_File_handling/19_file_handling.md)