Skip to content

Collection of Regex for validating, filtering, sanitizing and finding Persian strings

License

Notifications You must be signed in to change notification settings

mirhmousavi/Regex.Persian.Language

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Regex Persian Language

Regex for Persian (Farsi) Characters

عبارت منظم برای حروف فارسی


Collection of Regex for validating, filtering, sanitizing and finding Persian strings

Introduction

Because of historical matters, many Arabic characters get a way into Persian language and transformed it, In these years many efforts have been made by governmental and non-governmental organizations to revivification of authority of Persian language and this is one of them (really?!).

✴️ Notes

  • Persian alphabet consists of 33 characters (including hamzah) and 3 vowel marks.

  • The important part of this effort, is codepoints range, so you can create your own regex for validating, filtering and finding strings, just put the desired range in it. for example when string should only contains persian words and spaces just concat space codepoints and persian alpha codepoints in the final Regex and so on.

  • Characters in table are sorted by codepoints

  • Don't put Regex in multiline because space will be included.

  • See tests after reading.


🔲 Codepoints Range

🔳 Space

This ranges include all kind of space, specially ZERO WIDTH NON-JOINER that use as half space and massively are using in Persian texts and NARROW NO-BREAK SPACE that is simmilar to previous character.

Keyabord shortcuts

ZERO WIDTH NON-JOINER

  • Shift + B in Persian keyboard layout in Linux
  • Ctrl + Shift + 2 in Persian keyboard layout in Windows

NARROW NO-BREAK SPACE

  • Shift + Space in Persian keyboard layout in Linux

🔸 Allowed characters

U+0020
U+2000-U+200F
U+2028-U+202F
code point character hex name
U+0020 20 SPACE
U+2000   e2 80 80 EN QUAD
U+2001   e2 80 81 EM QUAD
U+2002   e2 80 82 EN SPACE
U+2003   e2 80 83 EM SPACE
U+2004   e2 80 84 THREE-PER-EM SPACE
U+2005   e2 80 85 FOUR-PER-EM SPACE
U+2006   e2 80 86 SIX-PER-EM SPACE
U+2007   e2 80 87 FIGURE SPACE
U+2008   e2 80 88 PUNCTUATION SPACE
U+2009   e2 80 89 THIN SPACE
U+200A   e2 80 8a HAIR SPACE
U+200B ​ e2 80 8b ZERO WIDTH SPACE
U+200C ‌ e2 80 8c ZERO WIDTH NON-JOINER
U+200D ‍ e2 80 8d ZERO WIDTH JOINER
U+200E ‎ e2 80 8e LEFT-TO-RIGHT MARK
U+200F ‏ e2 80 8f RIGHT-TO-LEFT MARK
U+2028 e2 80 a8 LINE SEPARATOR
U+2029 e2 80 a9 PARAGRAPH SEPARATOR
U+202A ‪ e2 80 aa LEFT-TO-RIGHT EMBEDDING
U+202B ‫ e2 80 ab RIGHT-TO-LEFT EMBEDDING
U+202C ‬ e2 80 ac POP DIRECTIONAL FORMATTING
U+202D ‭ e2 80 ad LEFT-TO-RIGHT OVERRIDE
U+202E ‮ e2 80 ae RIGHT-TO-LEFT OVERRIDE
U+202F   e2 80 af NARROW NO-BREAK SPACE

🔸 Implemention

# python
space_codepoints ='\u0020\u2000-\u200F\u2028-\u202F'
// php
$space_codepoints = '\x{0020}\x{2000}-\x{200F}\x{2028}-\x{202F}';
// javascript
var space_codepoints ='\u0020\u2000-\u200F\u2028-\u202F';

🔳 Persian alphabet

🔸 Allowed characters

U+0621-U+0628
U+062A-U+063A
U+0641-U+0642
U+0644-U+0648
U+064E-U+0651
U+0655
U+067E
U+0686
U+0698
U+06A9-U+06AF
U+06BE
U+06CC
code point character hex name
U+0621 ء d8 a1 ARABIC LETTER HAMZA
U+0622 آ d8 a2 ARABIC LETTER ALEF WITH MADDA ABOVE
U+0623 أ d8 a3 ARABIC LETTER ALEF WITH HAMZA ABOVE
U+0624 ؤ d8 a4 ARABIC LETTER WAW WITH HAMZA ABOVE
U+0625 إ d8 a5 ARABIC LETTER ALEF WITH HAMZA BELOW
U+0626 ئ d8 a6 ARABIC LETTER YEH WITH HAMZA ABOVE
U+0627 ا d8 a7 ARABIC LETTER ALEF
U+0628 ب d8 a8 ARABIC LETTER BEH
U+062A ت d8 aa ARABIC LETTER TEH
U+062B ث d8 ab ARABIC LETTER THEH
U+062C ج d8 ac ARABIC LETTER JEEM
U+062D ح d8 ad ARABIC LETTER HAH
U+062E خ d8 ae ARABIC LETTER KHAH
U+062F د d8 af ARABIC LETTER DAL
U+0630 ذ d8 b0 ARABIC LETTER THAL
U+0631 ر d8 b1 ARABIC LETTER REH
U+0632 ز d8 b2 ARABIC LETTER ZAIN
U+0633 س d8 b3 ARABIC LETTER SEEN
U+0634 ش d8 b4 ARABIC LETTER SHEEN
U+0635 ص d8 b5 ARABIC LETTER SAD
U+0636 ض d8 b6 ARABIC LETTER DAD
U+0637 ط d8 b7 ARABIC LETTER TAH
U+0638 ظ d8 b8 ARABIC LETTER ZAH
U+0639 ع d8 b9 ARABIC LETTER AIN
U+063A غ d8 ba ARABIC LETTER GHAIN
U+0641 ف d9 81 ARABIC LETTER FEH
U+0642 ق d9 82 ARABIC LETTER QAF
U+0644 ل d9 84 ARABIC LETTER LAM
U+0645 م d9 85 ARABIC LETTER MEEM
U+0646 ن d9 86 ARABIC LETTER NOON
U+0647 ه d9 87 ARABIC LETTER HEH
U+0648 و d9 88 ARABIC LETTER WAW
U+064E َ d9 8e ARABIC FATHA
U+064F ُ d9 8f ARABIC DAMMA
U+0650 ِ d9 90 ARABIC KASRA
U+0651 ّ d9 91 ARABIC SHADDA
U+0655 ٕ d9 95 ARABIC HAMZA BELOW
U+067E پ d9 be ARABIC LETTER PEH
U+0686 چ da 86 ARABIC LETTER TCHEH
U+0698 ژ da 98 ARABIC LETTER JEH
U+06A9 ک da a9 ARABIC LETTER KEHEH
U+06AF گ da af ARABIC LETTER GAF
U+06BE ھ da be ARABIC LETTER HEH DOACHASHMEE
U+06CC ی db 8c ARABIC LETTER FARSI YEH

🔸 Implemention

# python
persian_alpha_codepoints = '\u0621-\u0628\u062A-\u063A\u0641-\u0642\u0644-\u0648\u064E-\u0651\u0655\u067E\u0686\u0698\u06A9\u06AF\u06BE\u06CC'
// php
$persian_alpha_codepoints = '\x{0621}-\x{0628}\x{062A}-\x{063A}\x{0641}-\x{0642}\x{0644}-\x{0648}\x{064E}-\x{0651}\x{0655}\x{067E}\x{0686}\x{0698}\x{06A9}\x{06AF}\x{06BE}\x{06CC}';
// javascript
var persian_alpha_codepoints = '\u0621-\u0628\u062A-\u063A\u0641-\u0642\u0644-\u0648\u064E-\u0651\u0655\u067E\u0686\u0698\u06A9\u06AF\u06BE\u06CC';

🔳 Persian numbers

🔸 Allowed characters

U+06F0-U+06F9
code point character hex name
U+06F0 ۰ db b0 EXTENDED ARABIC-INDIC DIGIT ZERO
U+06F1 ۱ db b1 EXTENDED ARABIC-INDIC DIGIT ONE
U+06F2 ۲ db b2 EXTENDED ARABIC-INDIC DIGIT TWO
U+06F3 ۳ db b3 EXTENDED ARABIC-INDIC DIGIT THREE
U+06F4 ۴ db b4 EXTENDED ARABIC-INDIC DIGIT FOUR
U+06F5 ۵ db b5 EXTENDED ARABIC-INDIC DIGIT FIVE
U+06F6 ۶ db b6 EXTENDED ARABIC-INDIC DIGIT SIX
U+06F7 ۷ db b7 EXTENDED ARABIC-INDIC DIGIT SEVEN
U+06F8 ۸ db b8 EXTENDED ARABIC-INDIC DIGIT EIGHT
U+06F9 ۹ db b9 EXTENDED ARABIC-INDIC DIGIT NINE

🔸 Implemention

# python
persian_num_codepoints = '\u06F0-\u06F9'
// php
$persian_num_codepoints = '\x{06F0}-\x{06F9}';
// javascript
var persian_num_codepoints = '\u06F0-\u06F9';

🔳 Persian(Arabic) punctuation marks

🔸 Allowed characters

U+060C
U+061B
U+061F
U+0640
U+066A
U+066B
U+066C
code point character hex name
U+060C ، d8 8c ARABIC COMMA
U+061B ؛ d8 9b ARABIC SEMICOLON
U+061F ؟ d8 9f ARABIC QUESTION MARK
U+0640 ـ d9 80 ARABIC TATWEEL
U+066A ٪ d9 aa ARABIC PERCENT SIGN
U+066B ٫ d9 ab ARABIC DECIMAL SEPARATOR
U+066C ٬ d9 ac ARABIC THOUSANDS SEPARATOR

✴️ for more common punctutation marks like ” | « | » | ?| ; | : | ...
see general punctuation page in unicode

# python
punctuation_marks_codepoints = '\u060C\u061B\u061F\u0640\u066A\u066B\u066C'
// php
$punctuation_marks_codepoints = '\x{060C}\x{061B}\x{061F}\x{0640}\x{066A}\x{066B}\x{066C}';
// javascript
var punctuation_marks_codepoints = '\u060C\u061B\u061F\u0640\u066A\u066B\u066C';

🔳 Most used Arabic characters in Persian texts.

🔸 Allowed characters

U+0629
U+0643
U+0649-U+064B
U+064D
U+06D5
code point character hex name
U+0629 ة d8 a9 ARABIC LETTER TEH MARBUTA
U+0643 ك d9 83 ARABIC LETTER KAF
U+0649 ى d9 89 ARABIC LETTER ALEF MAKSURA
U+064A ي d9 8a ARABIC LETTER YEH
U+064B ً d9 8b ARABIC FATHATAN
U+064D ٍ d9 8d ARABIC KASRATAN
U+06D5 ە db 95 ARABIC LETTER AE

🔸 Implemention

# python
additional_arabic_characters_codepoints = '\u0629\u0643\u0649-\u064B\u064D\u06D5'
// php
$additional_arabic_characters_codepoints ='\x{0629}\x{0643}\x{0649}-\x{064B}\x{064D}\x{06D5}';
// javascript
var additional_arabic_characters_codepoints = '\u0629\u0643\u0649-\u064B\u064D\u06D5';

🔳 Arabic numbers

🔸 Allowed characters

U+0660-U+0669
code point character hex name
U+0660 ٠ d9 a0 ARABIC-INDIC DIGIT ZERO
U+0661 ١ d9 a1 ARABIC-INDIC DIGIT ONE
U+0662 ٢ d9 a2 ARABIC-INDIC DIGIT TWO
U+0663 ٣ d9 a3 ARABIC-INDIC DIGIT THREE
U+0664 ٤ d9 a4 ARABIC-INDIC DIGIT FOUR
U+0665 ٥ d9 a5 ARABIC-INDIC DIGIT FIVE
U+0666 ٦ d9 a6 ARABIC-INDIC DIGIT SIX
U+0667 ٧ d9 a7 ARABIC-INDIC DIGIT SEVEN
U+0668 ٨ d9 a8 ARABIC-INDIC DIGIT EIGHT
U+0669 ٩ d9 a9 ARABIC-INDIC DIGIT NINE

🔸 Implemention

# python
arabic_numbers_codepoints = '\u0660-\u0669'
// php
$arabic_numbers_codepoints ='\x{0660}-\x{0669}';
// javascript
var arabic_numbers_codepoints = '\u0660-\u0669';

🏁🏁🏁

About

Collection of Regex for validating, filtering, sanitizing and finding Persian strings

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published