diff --git a/blog/202412-python-strings/index.html b/blog/202412-python-strings/index.html index 5c7cfd3..6db7769 100644 --- a/blog/202412-python-strings/index.html +++ b/blog/202412-python-strings/index.html @@ -1139,8 +1139,15 @@
Of course, bytes objects can be used in other contexts as well. For example, (1).to_bytes(4, byteorder='little')
would return the bytes representation of the
-integer 1
(in little endian).
Bytes do not necessarily have to be associated with individual code points, as is the
+case when using str.encode
. For example, suppose we want to express the
+string "a1b1"
as a byte object, where each pair of characters represents a
+byte in hex (i.e., 0xA1
followed by 0xB1
). In this case, using list("a1b1".encode())
is not appropriate, as it would return [97, 49, 98, 49]
, which
+are the ASCII codes for the characters a
, 1
, b
, and 1
, respectively. Instead, we
+should consider the additional structure and use list(bytes.fromhex("a1b1"))
,
+which results in [161, 177]
.
Bytes objects can also be used in other contexts. For instance, (1).to_bytes(4, byteorder='little')
returns the byte representation of the integer 1
+(in little-endian).
The design decision to have immutable string in python has far-reaching implication related to e.g., hashing, performance optimizations, garbage collection, thread safety diff --git a/search/search_index.json b/search/search_index.json index 2214aa3..40c0c0e 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to my page!","text":"
I have worked as a senior data scientist at Kelkoo (France) and a data scientist/software engineer at ProbaYes (France). I obtained a Ph. D. degree in 2006 from Tohoku University (Japan) and a M. Sc. diploma in 1999 from the Technical University-Sofia (Bulgaria). I have been a research scientist at INRIA - Grenoble (France), and a research scientist / lecturer at AASS \u00d6rebro University (Sweden).
"},{"location":"#contact","title":"Contact","text":"I was born in Sofia, Bulgaria (20.11.1975)
"},{"location":"cv/#professional-experience","title":"Professional experience","text":"2021.03 - 2024.05: Kelkoosenior data scientist (science team)
researchcodingpython
scala
spark
HDFS
, YARN
data scientist/software engineer
researchcodingmachine learning applied to various domains
causal analysis
python
C++
CMake
based build systemsrust
research scientist (BIPOP team)
researchcodingC++
c++
matlab
research scientist/lecturer (Mobile Robotics and Olfaction Lab at AASS)
researchteachingresearch scientist (BIPOP team)
Link to my publications
"},{"location":"cv/#languages","title":"Languages","text":"manuscript
authorsbibtexpreprint
authorsbibtex@incollection {Wieber.2016,\n author = {Wieber, Pierre-Brice and Escande, Adrien and Dimitrov, Dimitar and Sherikov, Alexander},\n title = {Geometric and numerical aspects of redundancy},\n editor = {J. P. Laumond et al.},\n booktitle = {Geometric and Numerical Foundations of Movements},\n publisher = {Springer-Verlag},\n pages = {67--85},\n year = 2016}\n
Safe navigation strategies for a biped robot walking in a crowd preprint video
authorsbibtex@inproceedings{Bohorquez.2016,\n author = {Bohorquez, Nestor and Sherikov, Alexander and Dimitrov, Dimitar and Wieber, Pierre-Brice},\n title = {Safe navigation strategies for a biped robot walking in a crowd},\n booktitle = {IEEE-RAS International Conference on Humanoid Robots (Humanoids)},\n pages = {379--386},\n year = 2016}\n
A newton method with always feasible iterates for nonlinear model predictive control of walking in a multi-contact situation preprint video
authorsbibtex@inproceedings{Serra.2016,\n author = {Serra, Diana and Brasseur, Camille and Sherikov, Alexander and Dimitrov, Dimitar and Wieber, Pierre-Brice},\n title = {A newton method with always feasible iterates for nonlinear model predictive control of walking in a multi-contact situation},\n booktitle = {IEEE-RAS International Conference on Humanoid Robots (Humanoids)},\n pages = {932--937},\n year = 2016}\n
A hierarchical approach to minimum-time control of industrial robots preprint
authorsbibtex@inproceedings{Holmsi.2016,\n author = {Al~Homsi, Saed and Sherikov, Alexander and Dimitrov, Dimitar and Wieber, Pierre-Brice},\n title = {A hierarchical approach to minimum-time control of industrial robots},\n booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},\n pages = {2368--2374},\n year = 2016}\n
"},{"location":"publications/#2015","title":"2015","text":"Autonomous transport vehicles: where we are and what is missing paper
authorsbibtex@article {Andreasson.2015,\n author = {Andreasson, Henrik and Bouguerra, Abdelbaki and Cirillo, Marcello and Dimitrov, Dimitar and Driankov, Dimiter and Karlsson, Lars and et al.},\n title = {Autonomous transport vehicles: where we are and what is missing},\n journal = {IEEE Robotics and Automation Magazine (IEEE-RAM)},\n volume = {22},\n number = {1},\n pages = {64--75},\n year = 2015}\n
Model predictive motion control based on generalized dynamical movement primitives preprint
authorsbibtex@article {Krug.2015,\n author = {Krug, Robert and Dimitrov, Dimitar},\n title = {Model predictive motion control based on generalized dynamical movement primitives},\n journal = {Journal of Intelligent \\& Robotic Systems},\n volume = {77},\n number = {1},\n pages = {17--35},\n year = 2015}\n
Balancing a humanoid robot with a prioritized contact force distribution preprint video
authorsbibtex@inproceedings{Sherikov.2015,\n author = {Sherikov, Alexander and Dimitrov, Dimitar and Wieber, Pierre-Brice},\n title = {Balancing a humanoid robot with a prioritized contact force distribution},\n booktitle = {IEEE-RAS International Conference on Humanoid Robots (Humanoids)},\n pages = {223--228},\n year = 2015}\n
A robust linear MPC approach to online generation of 3D biped walking motion preprint video
authorsbibtex@inproceedings{Brasseur.2015,\n author = {Brasseur, Camille and Sherikov, Alexander and Collette, Cyrille and Dimitrov, Dimitar and Wieber, Pierre-Brice},\n title = {A robust linear MPC approach to online generation of 3D biped walking motion},\n booktitle = {IEEE-RAS International Conference on Humanoid Robots (Humanoids)},\n pages = {595--601},\n year = 2015}\n
"},{"location":"publications/#2014","title":"2014","text":"Efficiently combining task and motion planning using geometric constraints preprint
authorsbibtex@article {Lagriffoul.2014,\n author = {Lagriffoul, Fabien and Dimitrov, Dimitar and Bidot, Julien and Saffiotti, Alessandro and Karlsson, Lars},\n title = {Efficiently combining task and motion planning using geometric constraints},\n journal = {The International Journal of Robotics Research},\n volume = {33},\n number = {14},\n pages = {1726--1747},\n year = 2014}\n
Whole body motion controller with long-term balance constraints preprint video
authorsbibtex@inproceedings{Sherikov.2014,\n author = {Sherikov, Alexander and Dimitrov, Dimitar and Wieber, Pierre-Brice},\n title = {Whole body motion controller with long-term balance constraints},\n booktitle = {IEEE-RAS International Conference on Humanoid Robots (Humanoids)},\n pages = {444--450},\n year = 2014}\n
Multi-objective control of robots preprint
authorsbibtex@article {Dimitrov.2014,\n author = {Dimitrov, Dimitar and Wieber, Pierre-Brice and Escande, Adrien},\n title = {Multi-objective control of robots},\n journal = {Journal of the Robotics Society of Japan},\n volume = {32},\n number = {6},\n pages = {512--518},\n year = 2014}\n
"},{"location":"publications/#2013","title":"2013","text":"Representing movement primitives as implicit dynamical systems learned from multiple demonstrations preprint
authorsbibtex@inproceedings{Krug.2013,\n author = {Krug, Robert and Dimitrov, Dimitar},\n title = {Representing movement primitives as implicit dynamical systems learned from multiple demonstrations},\n booktitle = {International Conference on Advanced Robotics (ICAR)},\n pages = {1--8},\n year = 2013}\n
"},{"location":"publications/#2012","title":"2012","text":"Constraint propagation on interval bounds for dealing with geometric backtracking preprint
authorsbibtex@inproceedings{Lagriffoul.2012,\n author = {Lagriffoul, Fabien and Dimitrov, Dimitar and Saffiotti, Alessandro and Karlsson, Lars},\n title = {Constraint propagation on interval bounds for dealing with geometric backtracking},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {957--964},\n year = 2012}\n
On mission-dependent coordination of multiple vehicles under spatial and temporal constraints preprint
authorsbibtex@inproceedings{Pecora.2012,\n author = {Pecora, Federico and Cirillo, Marcello and Dimitrov, Dimitar},\n title = {On mission-dependent coordination of multiple vehicles under spatial and temporal constraints},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {5262--5269},\n year = 2012}\n
Independent contact regions based on a patch contact model preprint
authorsbibtex@inproceedings{Charusta.2012b,\n author = {Charusta, Krzysztof and Krug, Robert and Dimitrov, Dimitar and Iliev, Boyko},\n title = {Independent contact regions based on a patch contact model},\n booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},\n pages = {4162--4169},\n year = 2012}\n
Generation of independent contact regions on objects reconstructed from noisy real-world range data preprint
authorsbibtex@inproceedings{Charusta.2012a,\n author = {Charusta, Krzysztof and Krug, Robert and Stoyanov, Todor and Dimitrov, Dimitar and Iliev, Boyko},\n title = {Generation of independent contact regions on objects reconstructed from noisy real-world range data},\n booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},\n pages = {1338--1344},\n year = 2012}\n
Mapping between different kinematic structures without absolute positioning during operation paper
authorsbibtex@article {Berglund.2012,\n author = {Berglund, Erik and Iliev, Boyko and Palm, Rainer and Krug, Robert and Charusta, Krzysztof and Dimitrov, Dimitar},\n title = {Mapping between different kinematic structures without absolute positioning during operation},\n journal = {Electronics Letters},\n volume = {48},\n number = {18},\n pages = {1110--1112},\n year = 2012}\n
"},{"location":"publications/#2011","title":"2011","text":"A sparse model predictive control formulation for walking motion generation preprint presentation errata implementation
authorsbibtex@inproceedings{Dimitrov.2011b,\n author = {Dimitrov, Dimitar and Sherikov, Alexander and Wieber, Pierre-Brice},\n title = {A sparse model predictive control formulation for walking motion generation},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {2292--2299},\n year = 2011}\n
Prioritized independent contact regions for form closure grasps preprint
authorsbibtex@inproceedings{Krug.2011,\n author = {Krug, Robert and Dimitrov, Dimitar and Charusta, Krzysztof and Iliev, Boyko},\n title = {Prioritized independent contact regions for form closure grasps},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {1797--1803},\n year = 2011}\n
Walking motion generation with online foot position adaptation based on - and -norm penalty formulations preprint presentation
authorsbibtex@inproceedings{Dimitrov.2011a,\n author = {Dimitrov, Dimitar and Paolillo, Antonio and Wieber, Pierre-Brice},\n title = {Walking motion generation with online foot position adaptation based on $\\ell_1$- and $\\ell_\\infty$-norm penalty formulations},\n booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},\n pages = {3523--3529},\n year = 2011}\n
"},{"location":"publications/#2010","title":"2010","text":"Online walking motion generation with automatic foot step placement preprint
authorsbibtex@article {Herdt.2010,\n author = {Herdt, Andrei and Diedam, Holger and Wieber, Pierre-Brice and Dimitrov, Dimitar and Mombaur, Katja and Diehl, Moritz},\n title = {Online walking motion generation with automatic foot step placement},\n journal = {Advanced Robotics},\n volume = {24},\n number = {5--6},\n pages = {719--737},\n year = 2010}\n
On the efficient computation of independent contact regions for force closure grasps preprint
authorsbibtex@inproceedings{Krug.2010,\n author = {Krug, Robert and Dimitrov, Dimitar and Charusta, Krzysztof and Iliev, Boyko},\n title = {On the efficient computation of independent contact regions for force closure grasps},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {586--591},\n year = 2010}\n
An optimized linear model predictive control solver paper
authorsbibtex@incollection {Dimitrov.2010,\n author = {Dimitrov, Dimitar and Wieber, Pierre-Brice and Stasse, Olivier and Ferreau, Hans Joachim and Diedam, Holger},\n title = {An optimized linear model predictive control solver},\n editor = {Diehl, Moritz and Glineur, Fran\\c{c}ois and Jarlebring, Elias and Michiels, Wim},\n booktitle = {Recent Advances in Optimization and its Applications in Engineering},\n publisher = {Springer},\n pages = {309--318},\n year = 2010}\n
"},{"location":"publications/#2009","title":"2009","text":"An optimized linear model predictive control solver for online walking motion generation paper
authorsbibtex@inproceedings{Dimitrov.2009,\n author = {Dimitrov, Dimitar and Wieber, Pierre-Brice and Stasse, Olivier and Ferreau, Hans Joachim and Diedam, Holger},\n title = {An optimized linear model predictive control solver for online walking motion generation},\n booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},\n pages = {1171--1176},\n year = 2009}\n
Extraction of grasp related features by human dual-hand object exploration paper
authorsbibtex@inproceedings{Charusta.2009,\n author = {Charusta, Krzysztof and Dimitrov, Dimitar and Lilienthal, Achim J and Iliev, Boyko},\n title = {Extraction of grasp related features by human dual-hand object exploration},\n booktitle = {International Conference on Advanced Robotics (ICAR)},\n pages = {122--127},\n year = 2009}\n
"},{"location":"publications/#2008","title":"2008","text":"Online walking gait generation with adaptive foot positioning through linear model predictive control paper
authorsbibtex@inproceedings{Diedam.2008,\n author = {Diedam, Holger and Dimitrov, Dimitar and Wieber, Pierre-Brice and Mombaur, Katja and Diehl, Moritz},\n title = {Online walking gait generation with adaptive foot positioning through linear model predictive control},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {1121--1126},\n year = 2008}\n
On the implementation of model predictive control for on-line walking pattern generation paper
authorsbibtex@inproceedings{Dimitrov.2008,\n author = {Dimitrov, Dimitar and Wieber, Pierre-Brice and Ferreau, Hans Joachim and Diehl, Moritz},\n title = {On the implementation of model predictive control for on-line walking pattern generation},\n booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},\n pages = {2685--2690},\n year = 2008}\n
"},{"location":"publications/#2006","title":"2006","text":"On the capture of tumbling satellite by a space robot paper
authorsbibtex@inproceedings{Yoshida.2006,\n author = {Yoshida, Kazuya and Dimitrov, Dimitar and Nakanishi, Hiroki},\n title = {On the capture of tumbling satellite by a space robot},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {4127--4132},\n year = 2006}\n
Utilization of holonomic distribution control for reactionless path planning paper
authorsbibtex@inproceedings{Dimitrov.2006,\n author = {Dimitrov, Dimitar and Yoshida, Kazuya},\n title = {Utilization of holonomic distribution control for reactionless path planning},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {3387--3392},\n year = 2006}\n
Dynamics and control of space manipulators during a satellite capturing operation thesis
authorsbibtex@phdthesis {Dimitrov.thesis.2006,\n author = {Dimitrov, Dimitar},\n title = {Dynamics and control of space manipulators during a satellite capturing operation},\n school = {Tohoku University},\n year = 2006}\n
"},{"location":"publications/#2005","title":"2005","text":"Utilization of distributed momentum control for planning approaching trajectories of a space manipulator to a target satellite preprint
authorsbibtex@inproceedings{Dimitrov.2005,\n author = {Dimitrov, Dimitar and Yoshida, Kazuya},\n title = {Utilization of distributed momentum control for planning approaching trajectories of a space manipulator to a target satellite},\n booktitle = {International Symposium on Artificial Intelligence, Robotics and Automation in Space (i-SAIRAS)},\n year = 2005}\n
"},{"location":"publications/#2004","title":"2004","text":"Utilization of the bias momentum approach for capturing a tumbling satellite preprint
authorsbibtex@inproceedings{Dimitrov.2004b,\n author = {Dimitrov, Dimitar and Yoshida, Kazuya},\n title = {Utilization of the bias momentum approach for capturing a tumbling satellite},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {3333--3338},\n year = 2004}\n
Momentum distribution in a space manipulator for facilitating the post-impact control preprint
authorsbibtex@inproceedings{Dimitrov.2004a,\n author = {Dimitrov, Dimitar and Yoshida, Kazuya},\n title = {Momentum distribution in a space manipulator for facilitating the post-impact control},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {3345--3350},\n year = 2004}\n
"},{"location":"blog/","title":"Posts","text":""},{"location":"blog/202411-summer-walking-challenge/","title":"202411-summer-walking-challenge","text":"data/export_clean.csv
is all the data I need (it is a post-processed extract of the data exported from my iphone).img/2024_summer.png
do not delete (the rest of the figures can be regenerated)No artifacts need to be generated here
emoji.py
: Download and visualize emoji (I just needed to select several nice cases for the post).verify_string_encoding.py
: Numerical verification that I have understood PEP393 and the CPython code (not nice, but had to be done!). In one of his lectures Stephen Boyd said that sometimes we have to test things in a way we are not prowd of - we should do it, delete the code and never admit what happened.Walking has always been an important part of my routine. This summer I decided to collect some data. Here are the results ...
Figure 1. Daily distance (entire challenge)
Figure 2. Daily distance (post-challenge)
Figure 3. Daily distance (last month of challenge)
Figure 4. Daily step count (last month of challenge)
"},{"location":"blog/202411-summer-walking-challenge/#the-challenge","title":"The challenge","text":"On May 21st, I decided to consistently carry my phone with me while walking, not with the intention of walking more than usual, but simply to satisfy my curiosity and track the distance. As it turned out, however, I was on a journey to prove, yet again, the Heisenberg Uncertainty Principle (that observation alters the phenomenon being observed). Before long, I had set targets for myself, and 85 days later, I had walked 1900 km.
Figure 1 above shows the distance walked per day (in km), along with the dates when I bought a new pair of walking shoes and when I had to discard them. Figures 3 & 4 zoom in on the final month of the challenge, during which I averaged 28 km per day (roughly 45K steps). It's interesting to observe my pace in the 85 days following the end of the challenge, as shown in Figure 2 (clearly, I couldn't reduce my walking right away).
By far the most interesting for me is Figure 5, every point on which represents average distance walked (right axis) and total distance covered (left axis) in the past 31 days. For example, the first point indicates that I have walked 500 km between May 21st and June 20th (which is around 16 km per day on average). The point on August 12th is a summary of what is depicted in Figure 3 and so on. The linear increase in pace is something I didn't aim for. I remember challenging myself to reach, at first, 20 km for the past month, later the target moved to 25 and towards the end 28. I briefly considered pushing for an average of 30 km, but after discarding my walking shoes on August 13th (which, by that point, were in terrible condition), I had difficulty adjusting to a new pair. I also felt tired, so I decided to end the challenge. Adding just two more km per day may not seem like a big deal but, trust me, it requires a level of consistency that's tough to maintain, while my kids kept insisting to go camping. Figure 6 is similar to Figure 5, but the rolling period is one week instead of 31 days (as can be seen, I did manage to hit an average of 30 km per day over the course of a week).
Somewhere in the middle of all this, I aimed to cover a marathon distance in a single day, but my daily maximum ended up being around 36 km.
Figure 5. 31-day rolling distance (entire challenge)
Figure 6. 7-day rolling distance (entire challenge)
"},{"location":"blog/202411-summer-walking-challenge/#a-typical-day","title":"A typical day","text":"I wake up around 6:30 and start the day with about half an hour of reading. Then I stretch and take a few moments to plan what I want to accomplish, aside from my walks. I have breakfast at 7:30, then head out for my first walk of the day at 8:00, which typically lasts about an hour and a half. From 10:00 to 11:00, I focus on other tasks, then have an apple and go for another walk, usually lasting an hour. Lunch is around 12:30 and by 14:00 I'm out again until about 15:00. Afterwards, I take a 30-minute nap and work on other tasks until 17:00 when I have another apple and head out for one more hour. Dinner is around 18:30, followed by my longest walk of the day, which typically lasts about two hours.
I mostly walk on flat terrain but occasionally I go hiking. There are three outdoor exercise parks near my place, and I pass by one almost every day. I usually stop to do push-ups and pull-ups, which fit perfectly into my walking routine.
All in all, this adds up to between 4 and 7 hours of walking per day.
"},{"location":"blog/202411-summer-walking-challenge/#lessons-learned","title":"Lessons learned","text":"Here is the csv data used to generate the above figures.
"},{"location":"blog/202412-python-strings/","title":"Anatomy of python strings","text":"From the docs: \"Strings are immutable sequences of Unicode code points\". This requires a bit of unpacking ...
"},{"location":"blog/202412-python-strings/#terminology","title":"Terminology","text":"From the sea of technical lingo, I will mostly use three concepts (and often abuse terminology):
SymbolA symbol is an entity that conveys meaning in a given context. It can be seen as a \"meme\" in that it represents an idea or recognized concept. For example, it can be a single character or unit of text as perceived by a human reader (regardless of the underlying primitive blocks from which it is formed). The digit 1
is a symbol, so is that letter \u00e9
, and so is the emoji \ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66.
A primitive building block for symbols. It is common to refer to a visible (i.e., a user-perceived) character as a grapheme.
Code pointUnicode code points are unsigned integers1 that map one to one with (primitive) characters. That is, to each character in the Unicode character set there is a corresponding integer code point as its index.
For example, the code point 97
corresponds to the grapheme e
. Every (primitive) character can be seen as a symbol, but the opposite is not true because there are many symbols that do not have an assigned code point. That is, some symbols are defined in terms of a sequence of characters (and thus, of code points). Such symbols are commonly referred to as grapheme clusters. An example of a grapheme cluster is \ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66 (as we will see shortly, it consists of 7 characters 4 of which are graphemes).
flowchart LR\n subgraph G0 [symbol]\n symbol{\"\u00e9\"}\n end\n subgraph G1 [as one code point]\n one_code_point[\"\u00e9 (U+00E9)\"]\n end\n subgraph G2 [as two code points]\n dispatch@{ shape: framed-circle }\n dispatch --> two_code_point_1[\"e (U+0065)\"]\n dispatch --> two_code_point_2[\"\u0301 (U+0301)\"]\n end\n symbol --> dispatch\n symbol --> one_code_point\n\n style symbol font-size:20px\n style one_code_point font-size:18px\n style two_code_point_1 font-size:18px\n style two_code_point_2 font-size:18px
In Unicode, the symbol \u00e9 can be encoded in two ways (see Unicode equivalence). First, it has a dedicated code point (which defines it as a \"primitive\" grapheme). Second, it can be represented as a combination of e and an acute accent (which makes it a grapheme cluster as well).
s1 = \"\u00e9\" # using one code point (U+00E9)\ns2 = \"e\u0301\" # using two code points (equivalent to s2 = \"e\\u0301\")\n\nassert s1 != s2\nassert len(s1) == 1\nassert len(s2) == 2\n\nfor char in s2:\n code_point = ord(char)\n print(f\"{code_point} ({hex(code_point)})\")\n
Output: (1)
M-x describe-char
in emacs
gives:
position: 1 of 1 (0%), column: 0\n character: \u00e9 (displayed as \u00e9) (codepoint 233, #o351, #xe9)\n charset: iso-8859-1 (Latin-1 (ISO/IEC 8859-1))\ncode point in charset: 0xE9\n script: latin\n syntax: w which means: word\n category: .:Base, L:Strong L2R, c:Chinese, j:Japanese, l:Latin, v:Viet\n to input: type \"C-x 8 RET e9\" or \"C-x 8 RET LATIN SMALL LETTER E WITH ACUTE\"\n buffer code: #xC3 #xA9\n file code: #xC3 #xA9 (encoded by coding system utf-8-unix)\n display: terminal code #xC3 #xA9\n\nCharacter code properties: customize what to show\n name: LATIN SMALL LETTER E WITH ACUTE\n old-name: LATIN SMALL LETTER E ACUTE\n general-category: Ll (Letter, Lowercase)\n decomposition: (101 769) ('e' ' ')\n
position: 1 of 2 (0%), column: 0\n character: e (displayed as e) (codepoint 101, #o145, #x65)\n charset: ascii (ASCII (ISO646 IRV))\ncode point in charset: 0x65\n script: latin\n syntax: w which means: word\n category: .:Base, L:Strong L2R, a:ASCII, l:Latin, r:Roman\n to input: type \"C-x 8 RET 65\" or \"C-x 8 RET LATIN SMALL LETTER E\"\n buffer code: #x65\n file code: #x65 (encoded by coding system utf-8-unix)\n display: composed to form \"e \" (see below)\n\nComposed with the following character(s) \" \" by these characters:\n e (#x65) LATIN SMALL LETTER E\n (#x301) COMBINING ACUTE ACCENT\n\nCharacter code properties: customize what to show\n name: LATIN SMALL LETTER E\n general-category: Ll (Letter, Lowercase)\n decomposition: (101) ('e')\n
101 (0x65)\n769 (0x301)\n
"},{"location":"blog/202412-python-strings/#example-a-family","title":"Example: a family","text":"flowchart TD\n %%{init: {'themeVariables': {'title': 'My Flowchart Title'}}}%%\n family1[\"\ud83d\udc69\u200d\ud83d\udc67\"]\n family2[\"\ud83d\udc69\u200d\ud83d\udc69\u200d\ud83d\udc67\"]\n family3[\"\ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66\"]\n family4[\"\ud83d\udc6a\ufe0e\"]\n family5[\"\ud83d\udc68\u200d\ud83d\udc66\u200d\ud83d\udc66\"]\n C@{ shape: framed-circle, label: \"Stop\" }\n C --> cp1[\"\ud83d\udc68\"]\n C --> cp2[\"U+200d\"]\n C --> cp3[\"\ud83d\udc69\"]\n C --> cp4[\"U+200d\"]\n C --> cp5[\"\ud83d\udc67\"]\n C --> cp6[\"U+200d\"]\n C --> cp7[\"\ud83d\udc66\"]\n family3 --> C\n\n cp1-.->cp1-hex[\"U+1f468\"]\n cp3-.->cp3-hex[\"U+1f469\"]\n cp5-.->cp5-hex[\"U+1f467\"]\n cp7-.->cp7-hex[\"U+1f466\"]\n\n style family1 font-size:50px\n style family2 font-size:50px\n style family3 font-size:50px\n style family4 font-size:50px\n style family5 font-size:50px\n style cp1 font-size:30px\n style cp2 font-size:30px\n style cp3 font-size:30px\n style cp4 font-size:30px\n style cp5 font-size:30px\n style cp6 font-size:30px\n style cp7 font-size:30px\n style cp1-hex font-size:30px\n style cp3-hex font-size:30px\n style cp5-hex font-size:30px\n style cp7-hex font-size:30px
There are various emoji symbols that portray a family. They have different semantics, which is reflected by the code points used to form them. In the representation of the middle one (depicted on the lower levels), there are 4 primitive graphemes glued together with the zero-width joiner character U+200d
. We can use list(\"\ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66\")
to get a list of characters associated with the code points that form \ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66.
Consider the string sentense = \"This \ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66 is my family!\"
. As python strings are (stored as) sequences of code points, sentense[:6]
would give \"This \ud83d\udc68\"
because \ud83d\udc68 corresponds to the first (also called a base) code point of \ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66. As can be expected sentense[:8]
returns \"This\ud83d\udc68\u200d\ud83d\udc69\"
, where the zero-width joiner is not visible2.
The situation can get tricky with symbols that may have different Unicode representations. For example len(\"L'id\u00e9e a \u00e9t\u00e9 r\u00e9\u00e9valu\u00e9e.\")
is 23, while len(\"L'ide\u0301e a e\u0301te\u0301 re\u0301e\u0301value\u0301e.\")
is 29 because all symbols e\u0301 in the latter string are encoded using two code points. One can imagine strings with a mix of representations for the same symbols which can be difficult to handle in an ad hoc manner.
The Unicode standard defines rules for identifying sequences of code points that are meant to form a particular symbol (i.e., grapheme cluster). Finding symbol boundaries is a common problem e.g., in text editors and terminal emulators. As an example, consider the following functionality from the grapheme
3 package:
import grapheme\n\nsentense = \"This \ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66 is my family!\"\n\nassert len(sentense) == 26\nassert grapheme.length(sentense) == 20\nassert not grapheme.startswith(sentense, sentense[:6])\n
"},{"location":"blog/202412-python-strings/#normalization","title":"Normalization","text":"The unicodedata
package is a part of python's standard library and can be used to normalize a string. That is, to detect symbols for which alternative Unicode encodings exist and to convert them to a given canonical form.
import unicodedata\n\ns1 = \"L'id\u00e9e a \u00e9t\u00e9 r\u00e9\u00e9valu\u00e9e.\"\nassert len(s1) == 23\n\n# each \"\u00e9\" becomes \"e\\u0301\"\ns2 = unicodedata.normalize(\"NFD\", s1) # canonical decomposition\nassert len(s2) == 29 # (1)!\n\ns3 = unicodedata.normalize(\"NFC\", s2) # canonical composition\nassert len(s3) == 23\nassert s1 == s3\nassert s1 != s2\n
NDF
canonical decomposition may contain more code points, it allows for greater flexibility of text processing in many contexts, e.g., string pattern matching.The above discussion is mostly abstract in that it makes no assumptions on how code points (ranging from 0
to 1114111
) are to be stored in memory. Starting from PEP 393, python addresses the memory storage problem in a pragmatic way by handling four cases which depend only on one parameter: the largest code point occurring in the string.
import sys\nimport unicodedata\n\ns1 = \"L'id\u00e9e a \u00e9t\u00e9 r\u00e9\u00e9valu\u00e9e.\"\ns2 = unicodedata.normalize(\"NFD\", s1)\n\nm1, m2 = max(s1), max(s2)\nprint(f\"[s1]: {ord(m1)} ( {m1} ) #bytes = {sys.getsizeof(s1)}\")\nprint(f\"[s2]: {ord(m2)} ( {m2} ) #bytes = {sys.getsizeof(s2)}\")\n
Output:
[s1]: 233 ( \u00e9 ) #bytes = 80\n[s2]: 769 ( \u0301 ) #bytes = 116\n
The largest code point for the s2
string corresponds to the combining acute accent, while for the s1
string it corresponds to \u00e9
.
The four cases are:
where denotes the largest code point in the string . The memory required to store is
where is the number of code points in and, the size of the C-struct
that holds the data is given by4
The above logic is implemented in the string_bytes
function below5.
def string_bytes(s):
def string_bytes(s):\n numb_code_points, max_code_points = len(s), ord(max(s))\n\n # C-structs in cpython/Objects/unicodeobject.c\n # ----------------------------------------------\n # ASCII (use PyASCIIObject):\n # 2 x ssize_t = 16\n # 6 x unsigned int = 24\n # otherwise (use PyCompactUnicodeObject):\n # 1 x PyASCIIObject = 40\n # 1 x ssize_t = 8\n # 1 x char * = 8\n # assuming a x86_64 architecture\n struct_bytes = 56\n if max_code_points < 2**7:\n code_point_bytes = 1\n struct_bytes = 40\n elif max_code_points < 2**8:\n code_point_bytes = 1\n elif max_code_points < 2**16:\n code_point_bytes = 2\n else:\n code_point_bytes = 4\n\n # `+ 1` for zero termination\n # the result is identical with sys.getsizeof(s)\n return struct_bytes + (numb_code_points + 1) * code_point_bytes\n
For the above example, s1
is 56 + (23 + 1) * 1 = 80
bytes because it falls in the second case as its largest code point is 233. The string s2
, on the other hand, falls in the third case because the acute accent has a code point above 255 (so its size is 56 + (29 + 1) * 2 = 116
bytes).
Three clear advantages of the PEP 393 approach:
On the flip-side, concatenating a single emoji to an ASCII string increases the size x 4.
"},{"location":"blog/202412-python-strings/#code-units","title":"Code units","text":"The building block used to actually store a code point in memory is often called a code unit. For example, consider the acute accent (U+0301
):
flowchart TD\n %%{init: {'themeVariables': {'title': 'My Flowchart Title'}}}%%\n\n s[\"0x301\"]\n s --> utf8[\"UTF-8\"]\n s --> utf16[\"UTF-16\"]\n s --> utf32[\"UTF-32\"]\n\n C@{ shape: framed-circle, label: \"Stop\" }\n C -.-> utf8-1[\"0xCC\"]\n C -.-> utf8-2[\"0x81\"]\n\n utf8 -.-> C\n utf16 -.-> utf16-1[\"0x0103\"]\n utf32 -.-> utf16-2[\"0x01030000\"]\n\n style utf8 stroke-width:2px,stroke-dasharray: 5 5\n style utf16 stroke-width:2px,stroke-dasharray: 5 5\n style utf32 stroke-width:2px,stroke-dasharray: 5 5
utf-8
encoding there are two 8-bit code units (0xCC
and 0x81
)utf-16
encoding there is one 16-bit code unit (0x0103
)utf-32
encoding there is one 32-bit code unit (0x01030000
).Python uses a different encoding in each of the four cases discussed above.
For example, the string mess
in the snippet below has 8 code points and , hence we are in case 3 in which UTF-16 encoding should be used. At the end, the encoding computed manually is compared7 with the actual memory occupied by our string.
mess = \"I\u2665\ufe0f\u65e5\u672c\u0413\u041e\u00a9\"\n\nassert len(mess) == 8\nassert ord(max(mess)) == 65039 # case 3: 255 < 65039 < 65536\n\n# [2:] removes the Byte Order Mark (little-endian)\nencoding = b''.join([char.encode(\"utf-16\")[2:] for char in mess]).hex()\n\nassert string_bytes(mess) == 74 # 56 + (8 + 1) * 2\nassert len(encoding) == 32 # i.e., 16 bytes as it is in hex\nassert encoding == \"490065260ffee5652c6713041e04a900\"\n\n# --------------------------------------------------------------------------\n# compare to groundtruth (this is a hack!)\n# --------------------------------------------------------------------------\nimport ctypes\nimport sys\n\ndef memory_dump(string):\n address = id(string) # assuming CPython\n buffer = (ctypes.c_char * sys.getsizeof(string)).from_address(address)\n return bytes(buffer)\n\n# [56:] removes what we called struct_bytes above (in CPython they come first)\n# [:-2] removes the zero termination bytes\nassert memory_dump(mess)[56:-2].hex() == encoding\n# --------------------------------------------------------------------------\n
"},{"location":"blog/202412-python-strings/#bytes-objects","title":"Bytes objects","text":"As we have seen, the code units used to store a python string in memory depend on the string itself and are abstracted away from the user. While this is a good thing in many cases, sometimes we need more fine-grained control. To this end, python provides the \"bytes\" object (an immutable sequences of single bytes). Actually we already used it in the previous example as it is the return type of str.encode
.
Let us consider the string a_man = \"a\ud83d\udc68\"
. By now we know that it is stored using 4 bytes per code point. Using a_man.encode(\"utf-32\")
we obtain:
\"a\"
: 97, 0, 0, 0
\"\ud83d\udc68\"
: 104, 244, 1, 0
.If we relax the constraint of constant number of bytes per code point, we can dedicate less space to our string. Using a_man.encode(\"utf-16\")
we obtain:
\"a\"
: 97, 0
\"\ud83d\udc68\"
: 61, 216, 104, 220
or using a_man.encode(\"utf-8\")
:
\"a\"
: 97
\"\ud83d\udc68\"
: 240, 159, 145, 168
.All above representations have their applications. For example UTF-8 provides compatibility with ASCII and efficient data storage, while UTF-16 and UTF-32 allow for faster processing of a larger range of characters. Having the possibility to easily/efficiently change representations is convenient.
Of course, bytes objects can be used in other contexts as well. For example, (1).to_bytes(4, byteorder='little')
would return the bytes representation of the integer 1
(in little endian).
The design decision to have immutable string in python has far-reaching implication related to e.g., hashing, performance optimizations, garbage collection, thread safety etc. In addition to all this, having immutable strings was a prerequisite for the approach in PEP 393.
Often expressed as a hexadecimal number.\u00a0\u21a9
The string might be rendered as \"This \ud83d\udc68\\u200d\ud83d\udc69\"
.\u00a0\u21a9
pip install grapheme
\u21a9
Assuming a x86_64
architecture (see the string_bytes
function for more details).\u00a0\u21a9
Based on PyObject * PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
in unicodeobject.c
.\u00a0\u21a9
The smallest possible is always chosen.\u00a0\u21a9
We used a CPython
implementation of python 3.12
.\u00a0\u21a9
I have worked as a senior data scientist at Kelkoo (France) and a data scientist/software engineer at ProbaYes (France). I obtained a Ph. D. degree in 2006 from Tohoku University (Japan) and a M. Sc. diploma in 1999 from the Technical University-Sofia (Bulgaria). I have been a research scientist at INRIA - Grenoble (France), and a research scientist / lecturer at AASS \u00d6rebro University (Sweden).
"},{"location":"#contact","title":"Contact","text":"I was born in Sofia, Bulgaria (20.11.1975)
"},{"location":"cv/#professional-experience","title":"Professional experience","text":"2021.03 - 2024.05: Kelkoosenior data scientist (science team)
researchcodingpython
scala
spark
HDFS
, YARN
data scientist/software engineer
researchcodingmachine learning applied to various domains
causal analysis
python
C++
CMake
based build systemsrust
research scientist (BIPOP team)
researchcodingC++
c++
matlab
research scientist/lecturer (Mobile Robotics and Olfaction Lab at AASS)
researchteachingresearch scientist (BIPOP team)
Link to my publications
"},{"location":"cv/#languages","title":"Languages","text":"manuscript
authorsbibtexpreprint
authorsbibtex@incollection {Wieber.2016,\n author = {Wieber, Pierre-Brice and Escande, Adrien and Dimitrov, Dimitar and Sherikov, Alexander},\n title = {Geometric and numerical aspects of redundancy},\n editor = {J. P. Laumond et al.},\n booktitle = {Geometric and Numerical Foundations of Movements},\n publisher = {Springer-Verlag},\n pages = {67--85},\n year = 2016}\n
Safe navigation strategies for a biped robot walking in a crowd preprint video
authorsbibtex@inproceedings{Bohorquez.2016,\n author = {Bohorquez, Nestor and Sherikov, Alexander and Dimitrov, Dimitar and Wieber, Pierre-Brice},\n title = {Safe navigation strategies for a biped robot walking in a crowd},\n booktitle = {IEEE-RAS International Conference on Humanoid Robots (Humanoids)},\n pages = {379--386},\n year = 2016}\n
A newton method with always feasible iterates for nonlinear model predictive control of walking in a multi-contact situation preprint video
authorsbibtex@inproceedings{Serra.2016,\n author = {Serra, Diana and Brasseur, Camille and Sherikov, Alexander and Dimitrov, Dimitar and Wieber, Pierre-Brice},\n title = {A newton method with always feasible iterates for nonlinear model predictive control of walking in a multi-contact situation},\n booktitle = {IEEE-RAS International Conference on Humanoid Robots (Humanoids)},\n pages = {932--937},\n year = 2016}\n
A hierarchical approach to minimum-time control of industrial robots preprint
authorsbibtex@inproceedings{Holmsi.2016,\n author = {Al~Homsi, Saed and Sherikov, Alexander and Dimitrov, Dimitar and Wieber, Pierre-Brice},\n title = {A hierarchical approach to minimum-time control of industrial robots},\n booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},\n pages = {2368--2374},\n year = 2016}\n
"},{"location":"publications/#2015","title":"2015","text":"Autonomous transport vehicles: where we are and what is missing paper
authorsbibtex@article {Andreasson.2015,\n author = {Andreasson, Henrik and Bouguerra, Abdelbaki and Cirillo, Marcello and Dimitrov, Dimitar and Driankov, Dimiter and Karlsson, Lars and et al.},\n title = {Autonomous transport vehicles: where we are and what is missing},\n journal = {IEEE Robotics and Automation Magazine (IEEE-RAM)},\n volume = {22},\n number = {1},\n pages = {64--75},\n year = 2015}\n
Model predictive motion control based on generalized dynamical movement primitives preprint
authorsbibtex@article {Krug.2015,\n author = {Krug, Robert and Dimitrov, Dimitar},\n title = {Model predictive motion control based on generalized dynamical movement primitives},\n journal = {Journal of Intelligent \\& Robotic Systems},\n volume = {77},\n number = {1},\n pages = {17--35},\n year = 2015}\n
Balancing a humanoid robot with a prioritized contact force distribution preprint video
authorsbibtex@inproceedings{Sherikov.2015,\n author = {Sherikov, Alexander and Dimitrov, Dimitar and Wieber, Pierre-Brice},\n title = {Balancing a humanoid robot with a prioritized contact force distribution},\n booktitle = {IEEE-RAS International Conference on Humanoid Robots (Humanoids)},\n pages = {223--228},\n year = 2015}\n
A robust linear MPC approach to online generation of 3D biped walking motion preprint video
authorsbibtex@inproceedings{Brasseur.2015,\n author = {Brasseur, Camille and Sherikov, Alexander and Collette, Cyrille and Dimitrov, Dimitar and Wieber, Pierre-Brice},\n title = {A robust linear MPC approach to online generation of 3D biped walking motion},\n booktitle = {IEEE-RAS International Conference on Humanoid Robots (Humanoids)},\n pages = {595--601},\n year = 2015}\n
"},{"location":"publications/#2014","title":"2014","text":"Efficiently combining task and motion planning using geometric constraints preprint
authorsbibtex@article {Lagriffoul.2014,\n author = {Lagriffoul, Fabien and Dimitrov, Dimitar and Bidot, Julien and Saffiotti, Alessandro and Karlsson, Lars},\n title = {Efficiently combining task and motion planning using geometric constraints},\n journal = {The International Journal of Robotics Research},\n volume = {33},\n number = {14},\n pages = {1726--1747},\n year = 2014}\n
Whole body motion controller with long-term balance constraints preprint video
authorsbibtex@inproceedings{Sherikov.2014,\n author = {Sherikov, Alexander and Dimitrov, Dimitar and Wieber, Pierre-Brice},\n title = {Whole body motion controller with long-term balance constraints},\n booktitle = {IEEE-RAS International Conference on Humanoid Robots (Humanoids)},\n pages = {444--450},\n year = 2014}\n
Multi-objective control of robots preprint
authorsbibtex@article {Dimitrov.2014,\n author = {Dimitrov, Dimitar and Wieber, Pierre-Brice and Escande, Adrien},\n title = {Multi-objective control of robots},\n journal = {Journal of the Robotics Society of Japan},\n volume = {32},\n number = {6},\n pages = {512--518},\n year = 2014}\n
"},{"location":"publications/#2013","title":"2013","text":"Representing movement primitives as implicit dynamical systems learned from multiple demonstrations preprint
authorsbibtex@inproceedings{Krug.2013,\n author = {Krug, Robert and Dimitrov, Dimitar},\n title = {Representing movement primitives as implicit dynamical systems learned from multiple demonstrations},\n booktitle = {International Conference on Advanced Robotics (ICAR)},\n pages = {1--8},\n year = 2013}\n
"},{"location":"publications/#2012","title":"2012","text":"Constraint propagation on interval bounds for dealing with geometric backtracking preprint
authorsbibtex@inproceedings{Lagriffoul.2012,\n author = {Lagriffoul, Fabien and Dimitrov, Dimitar and Saffiotti, Alessandro and Karlsson, Lars},\n title = {Constraint propagation on interval bounds for dealing with geometric backtracking},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {957--964},\n year = 2012}\n
On mission-dependent coordination of multiple vehicles under spatial and temporal constraints preprint
authorsbibtex@inproceedings{Pecora.2012,\n author = {Pecora, Federico and Cirillo, Marcello and Dimitrov, Dimitar},\n title = {On mission-dependent coordination of multiple vehicles under spatial and temporal constraints},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {5262--5269},\n year = 2012}\n
Independent contact regions based on a patch contact model preprint
authorsbibtex@inproceedings{Charusta.2012b,\n author = {Charusta, Krzysztof and Krug, Robert and Dimitrov, Dimitar and Iliev, Boyko},\n title = {Independent contact regions based on a patch contact model},\n booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},\n pages = {4162--4169},\n year = 2012}\n
Generation of independent contact regions on objects reconstructed from noisy real-world range data preprint
authorsbibtex@inproceedings{Charusta.2012a,\n author = {Charusta, Krzysztof and Krug, Robert and Stoyanov, Todor and Dimitrov, Dimitar and Iliev, Boyko},\n title = {Generation of independent contact regions on objects reconstructed from noisy real-world range data},\n booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},\n pages = {1338--1344},\n year = 2012}\n
Mapping between different kinematic structures without absolute positioning during operation paper
authorsbibtex@article {Berglund.2012,\n author = {Berglund, Erik and Iliev, Boyko and Palm, Rainer and Krug, Robert and Charusta, Krzysztof and Dimitrov, Dimitar},\n title = {Mapping between different kinematic structures without absolute positioning during operation},\n journal = {Electronics Letters},\n volume = {48},\n number = {18},\n pages = {1110--1112},\n year = 2012}\n
"},{"location":"publications/#2011","title":"2011","text":"A sparse model predictive control formulation for walking motion generation preprint presentation errata implementation
authorsbibtex@inproceedings{Dimitrov.2011b,\n author = {Dimitrov, Dimitar and Sherikov, Alexander and Wieber, Pierre-Brice},\n title = {A sparse model predictive control formulation for walking motion generation},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {2292--2299},\n year = 2011}\n
Prioritized independent contact regions for form closure grasps preprint
authorsbibtex@inproceedings{Krug.2011,\n author = {Krug, Robert and Dimitrov, Dimitar and Charusta, Krzysztof and Iliev, Boyko},\n title = {Prioritized independent contact regions for form closure grasps},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {1797--1803},\n year = 2011}\n
Walking motion generation with online foot position adaptation based on - and -norm penalty formulations preprint presentation
authorsbibtex@inproceedings{Dimitrov.2011a,\n author = {Dimitrov, Dimitar and Paolillo, Antonio and Wieber, Pierre-Brice},\n title = {Walking motion generation with online foot position adaptation based on $\\ell_1$- and $\\ell_\\infty$-norm penalty formulations},\n booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},\n pages = {3523--3529},\n year = 2011}\n
"},{"location":"publications/#2010","title":"2010","text":"Online walking motion generation with automatic foot step placement preprint
authorsbibtex@article {Herdt.2010,\n author = {Herdt, Andrei and Diedam, Holger and Wieber, Pierre-Brice and Dimitrov, Dimitar and Mombaur, Katja and Diehl, Moritz},\n title = {Online walking motion generation with automatic foot step placement},\n journal = {Advanced Robotics},\n volume = {24},\n number = {5--6},\n pages = {719--737},\n year = 2010}\n
On the efficient computation of independent contact regions for force closure grasps preprint
authorsbibtex@inproceedings{Krug.2010,\n author = {Krug, Robert and Dimitrov, Dimitar and Charusta, Krzysztof and Iliev, Boyko},\n title = {On the efficient computation of independent contact regions for force closure grasps},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {586--591},\n year = 2010}\n
An optimized linear model predictive control solver paper
authorsbibtex@incollection {Dimitrov.2010,\n author = {Dimitrov, Dimitar and Wieber, Pierre-Brice and Stasse, Olivier and Ferreau, Hans Joachim and Diedam, Holger},\n title = {An optimized linear model predictive control solver},\n editor = {Diehl, Moritz and Glineur, Fran\\c{c}ois and Jarlebring, Elias and Michiels, Wim},\n booktitle = {Recent Advances in Optimization and its Applications in Engineering},\n publisher = {Springer},\n pages = {309--318},\n year = 2010}\n
"},{"location":"publications/#2009","title":"2009","text":"An optimized linear model predictive control solver for online walking motion generation paper
authorsbibtex@inproceedings{Dimitrov.2009,\n author = {Dimitrov, Dimitar and Wieber, Pierre-Brice and Stasse, Olivier and Ferreau, Hans Joachim and Diedam, Holger},\n title = {An optimized linear model predictive control solver for online walking motion generation},\n booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},\n pages = {1171--1176},\n year = 2009}\n
Extraction of grasp related features by human dual-hand object exploration paper
authorsbibtex@inproceedings{Charusta.2009,\n author = {Charusta, Krzysztof and Dimitrov, Dimitar and Lilienthal, Achim J and Iliev, Boyko},\n title = {Extraction of grasp related features by human dual-hand object exploration},\n booktitle = {International Conference on Advanced Robotics (ICAR)},\n pages = {122--127},\n year = 2009}\n
"},{"location":"publications/#2008","title":"2008","text":"Online walking gait generation with adaptive foot positioning through linear model predictive control paper
authorsbibtex@inproceedings{Diedam.2008,\n author = {Diedam, Holger and Dimitrov, Dimitar and Wieber, Pierre-Brice and Mombaur, Katja and Diehl, Moritz},\n title = {Online walking gait generation with adaptive foot positioning through linear model predictive control},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {1121--1126},\n year = 2008}\n
On the implementation of model predictive control for on-line walking pattern generation paper
authorsbibtex@inproceedings{Dimitrov.2008,\n author = {Dimitrov, Dimitar and Wieber, Pierre-Brice and Ferreau, Hans Joachim and Diehl, Moritz},\n title = {On the implementation of model predictive control for on-line walking pattern generation},\n booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},\n pages = {2685--2690},\n year = 2008}\n
"},{"location":"publications/#2006","title":"2006","text":"On the capture of tumbling satellite by a space robot paper
authorsbibtex@inproceedings{Yoshida.2006,\n author = {Yoshida, Kazuya and Dimitrov, Dimitar and Nakanishi, Hiroki},\n title = {On the capture of tumbling satellite by a space robot},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {4127--4132},\n year = 2006}\n
Utilization of holonomic distribution control for reactionless path planning paper
authorsbibtex@inproceedings{Dimitrov.2006,\n author = {Dimitrov, Dimitar and Yoshida, Kazuya},\n title = {Utilization of holonomic distribution control for reactionless path planning},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {3387--3392},\n year = 2006}\n
Dynamics and control of space manipulators during a satellite capturing operation thesis
authorsbibtex@phdthesis {Dimitrov.thesis.2006,\n author = {Dimitrov, Dimitar},\n title = {Dynamics and control of space manipulators during a satellite capturing operation},\n school = {Tohoku University},\n year = 2006}\n
"},{"location":"publications/#2005","title":"2005","text":"Utilization of distributed momentum control for planning approaching trajectories of a space manipulator to a target satellite preprint
authorsbibtex@inproceedings{Dimitrov.2005,\n author = {Dimitrov, Dimitar and Yoshida, Kazuya},\n title = {Utilization of distributed momentum control for planning approaching trajectories of a space manipulator to a target satellite},\n booktitle = {International Symposium on Artificial Intelligence, Robotics and Automation in Space (i-SAIRAS)},\n year = 2005}\n
"},{"location":"publications/#2004","title":"2004","text":"Utilization of the bias momentum approach for capturing a tumbling satellite preprint
authorsbibtex@inproceedings{Dimitrov.2004b,\n author = {Dimitrov, Dimitar and Yoshida, Kazuya},\n title = {Utilization of the bias momentum approach for capturing a tumbling satellite},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {3333--3338},\n year = 2004}\n
Momentum distribution in a space manipulator for facilitating the post-impact control preprint
authorsbibtex@inproceedings{Dimitrov.2004a,\n author = {Dimitrov, Dimitar and Yoshida, Kazuya},\n title = {Momentum distribution in a space manipulator for facilitating the post-impact control},\n booktitle = {IEEE/RSJ International Conference on Intelligent Robots and System (IROS)},\n pages = {3345--3350},\n year = 2004}\n
"},{"location":"blog/","title":"Posts","text":""},{"location":"blog/202411-summer-walking-challenge/","title":"202411-summer-walking-challenge","text":"data/export_clean.csv
is all the data I need (it is a post-processed extract of the data exported from my iphone).img/2024_summer.png
do not delete (the rest of the figures can be regenerated)No artifacts need to be generated here
emoji.py
: Download and visualize emoji (I just needed to select several nice cases for the post).verify_string_encoding.py
: Numerical verification that I have understood PEP393 and the CPython code (not nice, but had to be done!). In one of his lectures Stephen Boyd said that sometimes we have to test things in a way we are not prowd of - we should do it, delete the code and never admit what happened.Walking has always been an important part of my routine. This summer I decided to collect some data. Here are the results ...
Figure 1. Daily distance (entire challenge)
Figure 2. Daily distance (post-challenge)
Figure 3. Daily distance (last month of challenge)
Figure 4. Daily step count (last month of challenge)
"},{"location":"blog/202411-summer-walking-challenge/#the-challenge","title":"The challenge","text":"On May 21st, I decided to consistently carry my phone with me while walking, not with the intention of walking more than usual, but simply to satisfy my curiosity and track the distance. As it turned out, however, I was on a journey to prove, yet again, the Heisenberg Uncertainty Principle (that observation alters the phenomenon being observed). Before long, I had set targets for myself, and 85 days later, I had walked 1900 km.
Figure 1 above shows the distance walked per day (in km), along with the dates when I bought a new pair of walking shoes and when I had to discard them. Figures 3 & 4 zoom in on the final month of the challenge, during which I averaged 28 km per day (roughly 45K steps). It's interesting to observe my pace in the 85 days following the end of the challenge, as shown in Figure 2 (clearly, I couldn't reduce my walking right away).
By far the most interesting for me is Figure 5, every point on which represents average distance walked (right axis) and total distance covered (left axis) in the past 31 days. For example, the first point indicates that I have walked 500 km between May 21st and June 20th (which is around 16 km per day on average). The point on August 12th is a summary of what is depicted in Figure 3 and so on. The linear increase in pace is something I didn't aim for. I remember challenging myself to reach, at first, 20 km for the past month, later the target moved to 25 and towards the end 28. I briefly considered pushing for an average of 30 km, but after discarding my walking shoes on August 13th (which, by that point, were in terrible condition), I had difficulty adjusting to a new pair. I also felt tired, so I decided to end the challenge. Adding just two more km per day may not seem like a big deal but, trust me, it requires a level of consistency that's tough to maintain, while my kids kept insisting to go camping. Figure 6 is similar to Figure 5, but the rolling period is one week instead of 31 days (as can be seen, I did manage to hit an average of 30 km per day over the course of a week).
Somewhere in the middle of all this, I aimed to cover a marathon distance in a single day, but my daily maximum ended up being around 36 km.
Figure 5. 31-day rolling distance (entire challenge)
Figure 6. 7-day rolling distance (entire challenge)
"},{"location":"blog/202411-summer-walking-challenge/#a-typical-day","title":"A typical day","text":"I wake up around 6:30 and start the day with about half an hour of reading. Then I stretch and take a few moments to plan what I want to accomplish, aside from my walks. I have breakfast at 7:30, then head out for my first walk of the day at 8:00, which typically lasts about an hour and a half. From 10:00 to 11:00, I focus on other tasks, then have an apple and go for another walk, usually lasting an hour. Lunch is around 12:30 and by 14:00 I'm out again until about 15:00. Afterwards, I take a 30-minute nap and work on other tasks until 17:00 when I have another apple and head out for one more hour. Dinner is around 18:30, followed by my longest walk of the day, which typically lasts about two hours.
I mostly walk on flat terrain but occasionally I go hiking. There are three outdoor exercise parks near my place, and I pass by one almost every day. I usually stop to do push-ups and pull-ups, which fit perfectly into my walking routine.
All in all, this adds up to between 4 and 7 hours of walking per day.
"},{"location":"blog/202411-summer-walking-challenge/#lessons-learned","title":"Lessons learned","text":"Here is the csv data used to generate the above figures.
"},{"location":"blog/202412-python-strings/","title":"Anatomy of python strings","text":"From the docs: \"Strings are immutable sequences of Unicode code points\". This requires a bit of unpacking ...
"},{"location":"blog/202412-python-strings/#terminology","title":"Terminology","text":"From the sea of technical lingo, I will mostly use three concepts (and often abuse terminology):
SymbolA symbol is an entity that conveys meaning in a given context. It can be seen as a \"meme\" in that it represents an idea or recognized concept. For example, it can be a single character or unit of text as perceived by a human reader (regardless of the underlying primitive blocks from which it is formed). The digit 1
is a symbol, so is that letter \u00e9
, and so is the emoji \ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66.
A primitive building block for symbols. It is common to refer to a visible (i.e., a user-perceived) character as a grapheme.
Code pointUnicode code points are unsigned integers1 that map one to one with (primitive) characters. That is, to each character in the Unicode character set there is a corresponding integer code point as its index.
For example, the code point 97
corresponds to the grapheme e
. Every (primitive) character can be seen as a symbol, but the opposite is not true because there are many symbols that do not have an assigned code point. That is, some symbols are defined in terms of a sequence of characters (and thus, of code points). Such symbols are commonly referred to as grapheme clusters. An example of a grapheme cluster is \ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66 (as we will see shortly, it consists of 7 characters 4 of which are graphemes).
flowchart LR\n subgraph G0 [symbol]\n symbol{\"\u00e9\"}\n end\n subgraph G1 [as one code point]\n one_code_point[\"\u00e9 (U+00E9)\"]\n end\n subgraph G2 [as two code points]\n dispatch@{ shape: framed-circle }\n dispatch --> two_code_point_1[\"e (U+0065)\"]\n dispatch --> two_code_point_2[\"\u0301 (U+0301)\"]\n end\n symbol --> dispatch\n symbol --> one_code_point\n\n style symbol font-size:20px\n style one_code_point font-size:18px\n style two_code_point_1 font-size:18px\n style two_code_point_2 font-size:18px
In Unicode, the symbol \u00e9 can be encoded in two ways (see Unicode equivalence). First, it has a dedicated code point (which defines it as a \"primitive\" grapheme). Second, it can be represented as a combination of e and an acute accent (which makes it a grapheme cluster as well).
s1 = \"\u00e9\" # using one code point (U+00E9)\ns2 = \"e\u0301\" # using two code points (equivalent to s2 = \"e\\u0301\")\n\nassert s1 != s2\nassert len(s1) == 1\nassert len(s2) == 2\n\nfor char in s2:\n code_point = ord(char)\n print(f\"{code_point} ({hex(code_point)})\")\n
Output: (1)
M-x describe-char
in emacs
gives:
position: 1 of 1 (0%), column: 0\n character: \u00e9 (displayed as \u00e9) (codepoint 233, #o351, #xe9)\n charset: iso-8859-1 (Latin-1 (ISO/IEC 8859-1))\ncode point in charset: 0xE9\n script: latin\n syntax: w which means: word\n category: .:Base, L:Strong L2R, c:Chinese, j:Japanese, l:Latin, v:Viet\n to input: type \"C-x 8 RET e9\" or \"C-x 8 RET LATIN SMALL LETTER E WITH ACUTE\"\n buffer code: #xC3 #xA9\n file code: #xC3 #xA9 (encoded by coding system utf-8-unix)\n display: terminal code #xC3 #xA9\n\nCharacter code properties: customize what to show\n name: LATIN SMALL LETTER E WITH ACUTE\n old-name: LATIN SMALL LETTER E ACUTE\n general-category: Ll (Letter, Lowercase)\n decomposition: (101 769) ('e' ' ')\n
position: 1 of 2 (0%), column: 0\n character: e (displayed as e) (codepoint 101, #o145, #x65)\n charset: ascii (ASCII (ISO646 IRV))\ncode point in charset: 0x65\n script: latin\n syntax: w which means: word\n category: .:Base, L:Strong L2R, a:ASCII, l:Latin, r:Roman\n to input: type \"C-x 8 RET 65\" or \"C-x 8 RET LATIN SMALL LETTER E\"\n buffer code: #x65\n file code: #x65 (encoded by coding system utf-8-unix)\n display: composed to form \"e \" (see below)\n\nComposed with the following character(s) \" \" by these characters:\n e (#x65) LATIN SMALL LETTER E\n (#x301) COMBINING ACUTE ACCENT\n\nCharacter code properties: customize what to show\n name: LATIN SMALL LETTER E\n general-category: Ll (Letter, Lowercase)\n decomposition: (101) ('e')\n
101 (0x65)\n769 (0x301)\n
"},{"location":"blog/202412-python-strings/#example-a-family","title":"Example: a family","text":"flowchart TD\n %%{init: {'themeVariables': {'title': 'My Flowchart Title'}}}%%\n family1[\"\ud83d\udc69\u200d\ud83d\udc67\"]\n family2[\"\ud83d\udc69\u200d\ud83d\udc69\u200d\ud83d\udc67\"]\n family3[\"\ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66\"]\n family4[\"\ud83d\udc6a\ufe0e\"]\n family5[\"\ud83d\udc68\u200d\ud83d\udc66\u200d\ud83d\udc66\"]\n C@{ shape: framed-circle, label: \"Stop\" }\n C --> cp1[\"\ud83d\udc68\"]\n C --> cp2[\"U+200d\"]\n C --> cp3[\"\ud83d\udc69\"]\n C --> cp4[\"U+200d\"]\n C --> cp5[\"\ud83d\udc67\"]\n C --> cp6[\"U+200d\"]\n C --> cp7[\"\ud83d\udc66\"]\n family3 --> C\n\n cp1-.->cp1-hex[\"U+1f468\"]\n cp3-.->cp3-hex[\"U+1f469\"]\n cp5-.->cp5-hex[\"U+1f467\"]\n cp7-.->cp7-hex[\"U+1f466\"]\n\n style family1 font-size:50px\n style family2 font-size:50px\n style family3 font-size:50px\n style family4 font-size:50px\n style family5 font-size:50px\n style cp1 font-size:30px\n style cp2 font-size:30px\n style cp3 font-size:30px\n style cp4 font-size:30px\n style cp5 font-size:30px\n style cp6 font-size:30px\n style cp7 font-size:30px\n style cp1-hex font-size:30px\n style cp3-hex font-size:30px\n style cp5-hex font-size:30px\n style cp7-hex font-size:30px
There are various emoji symbols that portray a family. They have different semantics, which is reflected by the code points used to form them. In the representation of the middle one (depicted on the lower levels), there are 4 primitive graphemes glued together with the zero-width joiner character U+200d
. We can use list(\"\ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66\")
to get a list of characters associated with the code points that form \ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66.
Consider the string sentense = \"This \ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66 is my family!\"
. As python strings are (stored as) sequences of code points, sentense[:6]
would give \"This \ud83d\udc68\"
because \ud83d\udc68 corresponds to the first (also called a base) code point of \ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66. As can be expected sentense[:8]
returns \"This\ud83d\udc68\u200d\ud83d\udc69\"
, where the zero-width joiner is not visible2.
The situation can get tricky with symbols that may have different Unicode representations. For example len(\"L'id\u00e9e a \u00e9t\u00e9 r\u00e9\u00e9valu\u00e9e.\")
is 23, while len(\"L'ide\u0301e a e\u0301te\u0301 re\u0301e\u0301value\u0301e.\")
is 29 because all symbols e\u0301 in the latter string are encoded using two code points. One can imagine strings with a mix of representations for the same symbols which can be difficult to handle in an ad hoc manner.
The Unicode standard defines rules for identifying sequences of code points that are meant to form a particular symbol (i.e., grapheme cluster). Finding symbol boundaries is a common problem e.g., in text editors and terminal emulators. As an example, consider the following functionality from the grapheme
3 package:
import grapheme\n\nsentense = \"This \ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66 is my family!\"\n\nassert len(sentense) == 26\nassert grapheme.length(sentense) == 20\nassert not grapheme.startswith(sentense, sentense[:6])\n
"},{"location":"blog/202412-python-strings/#normalization","title":"Normalization","text":"The unicodedata
package is a part of python's standard library and can be used to normalize a string. That is, to detect symbols for which alternative Unicode encodings exist and to convert them to a given canonical form.
import unicodedata\n\ns1 = \"L'id\u00e9e a \u00e9t\u00e9 r\u00e9\u00e9valu\u00e9e.\"\nassert len(s1) == 23\n\n# each \"\u00e9\" becomes \"e\\u0301\"\ns2 = unicodedata.normalize(\"NFD\", s1) # canonical decomposition\nassert len(s2) == 29 # (1)!\n\ns3 = unicodedata.normalize(\"NFC\", s2) # canonical composition\nassert len(s3) == 23\nassert s1 == s3\nassert s1 != s2\n
NDF
canonical decomposition may contain more code points, it allows for greater flexibility of text processing in many contexts, e.g., string pattern matching.The above discussion is mostly abstract in that it makes no assumptions on how code points (ranging from 0
to 1114111
) are to be stored in memory. Starting from PEP 393, python addresses the memory storage problem in a pragmatic way by handling four cases which depend only on one parameter: the largest code point occurring in the string.
import sys\nimport unicodedata\n\ns1 = \"L'id\u00e9e a \u00e9t\u00e9 r\u00e9\u00e9valu\u00e9e.\"\ns2 = unicodedata.normalize(\"NFD\", s1)\n\nm1, m2 = max(s1), max(s2)\nprint(f\"[s1]: {ord(m1)} ( {m1} ) #bytes = {sys.getsizeof(s1)}\")\nprint(f\"[s2]: {ord(m2)} ( {m2} ) #bytes = {sys.getsizeof(s2)}\")\n
Output:
[s1]: 233 ( \u00e9 ) #bytes = 80\n[s2]: 769 ( \u0301 ) #bytes = 116\n
The largest code point for the s2
string corresponds to the combining acute accent, while for the s1
string it corresponds to \u00e9
.
The four cases are:
where denotes the largest code point in the string . The memory required to store is
where is the number of code points in and, the size of the C-struct
that holds the data is given by4
The above logic is implemented in the string_bytes
function below5.
def string_bytes(s):
def string_bytes(s):\n numb_code_points, max_code_points = len(s), ord(max(s))\n\n # C-structs in cpython/Objects/unicodeobject.c\n # ----------------------------------------------\n # ASCII (use PyASCIIObject):\n # 2 x ssize_t = 16\n # 6 x unsigned int = 24\n # otherwise (use PyCompactUnicodeObject):\n # 1 x PyASCIIObject = 40\n # 1 x ssize_t = 8\n # 1 x char * = 8\n # assuming a x86_64 architecture\n struct_bytes = 56\n if max_code_points < 2**7:\n code_point_bytes = 1\n struct_bytes = 40\n elif max_code_points < 2**8:\n code_point_bytes = 1\n elif max_code_points < 2**16:\n code_point_bytes = 2\n else:\n code_point_bytes = 4\n\n # `+ 1` for zero termination\n # the result is identical with sys.getsizeof(s)\n return struct_bytes + (numb_code_points + 1) * code_point_bytes\n
For the above example, s1
is 56 + (23 + 1) * 1 = 80
bytes because it falls in the second case as its largest code point is 233. The string s2
, on the other hand, falls in the third case because the acute accent has a code point above 255 (so its size is 56 + (29 + 1) * 2 = 116
bytes).
Three clear advantages of the PEP 393 approach:
On the flip-side, concatenating a single emoji to an ASCII string increases the size x 4.
"},{"location":"blog/202412-python-strings/#code-units","title":"Code units","text":"The building block used to actually store a code point in memory is often called a code unit. For example, consider the acute accent (U+0301
):
flowchart TD\n %%{init: {'themeVariables': {'title': 'My Flowchart Title'}}}%%\n\n s[\"0x301\"]\n s --> utf8[\"UTF-8\"]\n s --> utf16[\"UTF-16\"]\n s --> utf32[\"UTF-32\"]\n\n C@{ shape: framed-circle, label: \"Stop\" }\n C -.-> utf8-1[\"0xCC\"]\n C -.-> utf8-2[\"0x81\"]\n\n utf8 -.-> C\n utf16 -.-> utf16-1[\"0x0103\"]\n utf32 -.-> utf16-2[\"0x01030000\"]\n\n style utf8 stroke-width:2px,stroke-dasharray: 5 5\n style utf16 stroke-width:2px,stroke-dasharray: 5 5\n style utf32 stroke-width:2px,stroke-dasharray: 5 5
utf-8
encoding there are two 8-bit code units (0xCC
and 0x81
)utf-16
encoding there is one 16-bit code unit (0x0103
)utf-32
encoding there is one 32-bit code unit (0x01030000
).Python uses a different encoding in each of the four cases discussed above.
For example, the string mess
in the snippet below has 8 code points and , hence we are in case 3 in which UTF-16 encoding should be used. At the end, the encoding computed manually is compared7 with the actual memory occupied by our string.
mess = \"I\u2665\ufe0f\u65e5\u672c\u0413\u041e\u00a9\"\n\nassert len(mess) == 8\nassert ord(max(mess)) == 65039 # case 3: 255 < 65039 < 65536\n\n# [2:] removes the Byte Order Mark (little-endian)\nencoding = b''.join([char.encode(\"utf-16\")[2:] for char in mess]).hex()\n\nassert string_bytes(mess) == 74 # 56 + (8 + 1) * 2\nassert len(encoding) == 32 # i.e., 16 bytes as it is in hex\nassert encoding == \"490065260ffee5652c6713041e04a900\"\n\n# --------------------------------------------------------------------------\n# compare to groundtruth (this is a hack!)\n# --------------------------------------------------------------------------\nimport ctypes\nimport sys\n\ndef memory_dump(string):\n address = id(string) # assuming CPython\n buffer = (ctypes.c_char * sys.getsizeof(string)).from_address(address)\n return bytes(buffer)\n\n# [56:] removes what we called struct_bytes above (in CPython they come first)\n# [:-2] removes the zero termination bytes\nassert memory_dump(mess)[56:-2].hex() == encoding\n# --------------------------------------------------------------------------\n
"},{"location":"blog/202412-python-strings/#bytes-objects","title":"Bytes objects","text":"As we have seen, the code units used to store a python string in memory depend on the string itself and are abstracted away from the user. While this is a good thing in many cases, sometimes we need more fine-grained control. To this end, python provides the \"bytes\" object (an immutable sequences of single bytes). Actually we already used it in the previous example as it is the return type of str.encode
.
Let us consider the string a_man = \"a\ud83d\udc68\"
. By now we know that it is stored using 4 bytes per code point. Using a_man.encode(\"utf-32\")
we obtain:
\"a\"
: 97, 0, 0, 0
\"\ud83d\udc68\"
: 104, 244, 1, 0
.If we relax the constraint of constant number of bytes per code point, we can dedicate less space to our string. Using a_man.encode(\"utf-16\")
we obtain:
\"a\"
: 97, 0
\"\ud83d\udc68\"
: 61, 216, 104, 220
or using a_man.encode(\"utf-8\")
:
\"a\"
: 97
\"\ud83d\udc68\"
: 240, 159, 145, 168
.All above representations have their applications. For example UTF-8 provides compatibility with ASCII and efficient data storage, while UTF-16 and UTF-32 allow for faster processing of a larger range of characters. Having the possibility to easily/efficiently change representations is convenient.
Bytes do not necessarily have to be associated with individual code points, as is the case when using str.encode
. For example, suppose we want to express the string \"a1b1\"
as a byte object, where each pair of characters represents a byte in hex (i.e., 0xA1
followed by 0xB1
). In this case, using list(\"a1b1\".encode())
is not appropriate, as it would return [97, 49, 98, 49]
, which are the ASCII codes for the characters a
, 1
, b
, and 1
, respectively. Instead, we should consider the additional structure and use list(bytes.fromhex(\"a1b1\"))
, which results in [161, 177]
.
Bytes objects can also be used in other contexts. For instance, (1).to_bytes(4, byteorder='little')
returns the byte representation of the integer 1 (in little-endian).
The design decision to have immutable string in python has far-reaching implication related to e.g., hashing, performance optimizations, garbage collection, thread safety etc. In addition to all this, having immutable strings was a prerequisite for the approach in PEP 393.
Often expressed as a hexadecimal number.\u00a0\u21a9
The string might be rendered as \"This \ud83d\udc68\\u200d\ud83d\udc69\"
.\u00a0\u21a9
pip install grapheme
\u21a9
Assuming a x86_64
architecture (see the string_bytes
function for more details).\u00a0\u21a9
Based on PyObject * PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
in unicodeobject.c
.\u00a0\u21a9
The smallest possible is always chosen.\u00a0\u21a9
We used a CPython
implementation of python 3.12
.\u00a0\u21a9