Arduino to MicroPython Transcompiler

Abstract

The Arduino ecosystem has 2700+ libraries written in the 'Arduino dialect' of C++. For the Portenta to gain widespread MicroPython usage it is required that:

A large number of these libraries be ported to MicroPython.
This porting must be efficient to do, to avoid time delays and uneccessary effort.

The project idea aims to make a source to source compiler (known as a Transpiler) which can:

'Translate' a given example program of a library (in Arduino) to MicroPython automatically.
Highlight any errors and parts of the code that it cannot translate effectively.
Be easy to maintain as support for more library code is added down the road.

I. Technical Details

1.1 Rationale behind approach

Writing a compiler is significantly more work than writing a simple script to mechanically convert one language syntax to another. This particular approach was chosen because of the following reasons:

It is far more flexible: An equivalent Python (or any other language script) would be long, hard to maintain and difficult to modify without affecting the rest of the program. Moreover, a compiler is designed with a seperate front end that is language agnostic, so the AST (Abstract Syntax Tree) generated by the front end can be used to generate code for target languages other than MicroPython (such as JerryScript).
A compiler handles ambiguity better: C++ can get ambiguous (an input can be interpreted multiple ways) and context sensitive (the meaning of a particular statement depends on the words before it and after it).

For example: The * operator can mean multiplication as in a*b or a pointer in int *a. If we write a C or Python script, we will have to include special cases for all such instances such as *, unary minus etc. Compilers can be designed with lookahead, so they can take into account the next few characters of an input and assign meaning accordingly. There are many such methods such as LALR parsing, GLR parsing etc.

MicroPython is constantly evolving: MicroPython is still growing and adding new features. The latest release was less than 60 days ago. This makes it essential for our project to be amenable to change, which is easier with a Compiler than a handwritten script. To add a new keyword, we just need to make an entry to the symbol table and the compiler will do the rest. Since our implementation language is C, which all Arduino developers are familiar with, this also makes the compiler easier to maintain.
Better target code optimisation: The equivalent MicroPython code is much more compact than the given Arduino code for most applications. Compilers are better suited to implement optimisation algorithms and processes like peephole optimisation.

1.2 Feasibility & Scope

Given the large scope of the project it becomes necessary to define boundaries of what is doable within the GSoC time period, while keeping in mind to have the project be useful enough to be helpful to the Arduino community long after the GSoC coding period is over.

Working with the Arduino Reference: Unlike the libraries which have a lot more C++ going on, the examples for each library itself are given in a lot more concise, easy to parse and understand 'Arduino language'. This is not by accident, since the language after all is designed to be easy to understand for beginners. The Arduino Reference defines around 200+ keywords which is a good starting point to tokenize the program input, perform semantic analysis, generate the AST which can then be transformed into the AST of the target language, and then the target language (MicroPython) itself. The MicroPython website also lists equivalent libraries, so finding equivalent keywords and functions is not difficult.
Building a generic C++ parser is very difficult: It would ideally be a lot better if we could parse the whole C++ language instead of just Arduino, but there has been research that states that "C++ grammar is ambiguous, context-dependent and potentially requires infinite lookahead to resolve some ambiguities". It makes far more sense to limit ourselves to Arduino.
Successful transpiler projects exist C++ has strict typing and has a lot more details about the size of memory allocated and resource management. So when we translate C++ code to MicroPython, no assumptions have to made. This is not true when we go from Python --> C++. Moreover, a number of successful transpilers such as Coffeescript, TypeScript, Babel are in widespread use. The Haxe project compiles to multiple target languages. Even a tool to convert C++ code to Python exists (see: pybee/seasnake).
The goal is to translate as many library examples as possible: Even if 80% or only 60% of a given example program can be translated, it still increases the speed of porting libraries over writing everything by hand.

1.3 Implementation Details

A Transpiler is basically a compiler in which the source language is converted into equivalent high level (human readable) code in another language instead of generating low level byte code or machine code. Our Transpiler consists of two parts:

The Front end: This stage has to split the input into tokens, analyse the syntax, generate a parse tree and ultimately the Abstract Syntax Tree (AST).
The Back end: To generate the target language (MicroPython).

1.3.1 Overall Compiler Design

I used Practical Approach to Compiler Construction as a reference while outlining the transpiler design.

Source Code: The Arduino sample code. It may be from inbuilt libraries, external libraries or even self written code.
Lexical Analyser: The lexical analyser is a program that takes a file (or program) as input and gives the tokens (basic units) of the given code as output. The lexical analyser must be able to identify numbers, identifiers, keywords, special characters etc. Ordinarily comments are ignored, because they are not significant to the program itself. In our program we will choose to keep the multiline comments** /* ....*/** because they usually have important details pertaining to the program (such as the creator's name and program description) while removing the single line // comments as they wont make much sense after the program is translated. I will be using Flex to create a lexical analyser as it is popular, well documented and open source.
Syntax Analysis: The syntax analysis phase also groups together the tokens generated by the above lexical analyser according to the syntax rules of the source language (Arduino). The syntax tree is constructed in this phase. For this phase I use Bison for the same reason I used Flex above, in addition to Bison having very good compatibility with Flex.
Semantic Analysis and AST creation: We need to perform type checking, traversing the tree and adding the right information to the abstract syntax tree is done at this stage.
Code Generation: We traverse through the AST created in the above steps, ordinarily most compilers create an 'Intermediate Representation' (IR) language at this stage which can be converted to byte code or machine code, but since we don't need machine code, we'll be using MicroPython as our IR language directly and converting the AST to MicroPython statement by statement linearly. Since this compiler produces source code which will be read by a human, we need to worry about readability and 'correctness' as much as the transpiler can manage.

1.3.2 Lexical Analysis

Let us take a snippet from an Arduino program, I chose the 'BarGraph' example from File--> Examples --> Display --> BarGraph in the Arduino IDE which has this typical setup() :

void setup() {
  // loop over the pin array and set them all to output:
  for (int thisLed = 0; thisLed < ledCount; thisLed++) {
    pinMode(ledPins[thisLed], OUTPUT); 
  }
}

ideally we want our lexical analyser (lexer) to give the output as something like this:

void (keyword), setup (keyword), for (keyword) , ( , ) , { , } , 0 (integer constant) , // (comment line) , thisLed (identifier) , = , < , ++ , pinMode (Keyword)

From the above we can see that the Lexical Analyser should be able to differentiate between variable names that the programmer gives (like thisLed) and words that have special meaning to the compiler like **setup() **, for, pinMode etc. We do this by listing a set of reserved words in the language specification, telling the transpiler to take special care of them. Luckily, Arduino provides us a good list of the keywords we need to have in mind in their language reference. In our transpiler, we will be keeping the following words as keywords:

		Arduino
digitalRead()	isAlpha()	Serial (All)	int	# define
digitalWrite()	isALphaNumeric()	Stream (All)	long	# include
pinMode()	isAscii()	Keyboard (All)	short	block comment
analogRead()	isControl()	Mouse (All)	size_t	single line cmt
analogReference()	isDigit()	Floating pt const (All)	string	semicolon
analogWrite()	isGraph()	Integer Const (All)	unsigned char	curly braces
analogReadResolution()	isHexadecimalDigit()	HIGH/LOW	unsigned int	remainder
analogWriteResolution()	isLowerCase()	INPUT/OUTPUT/PULLUP	unsigned long	multiplication
noTone()	isPrintable()	LED_BUILTIN	void	addition
pulseIn()	isPunct()	true	false	word
pulseInLong()	isSpace()	(unsigned int)	const	division
shiftIn()	isUpperCase()	(unsigned long)	scope	assignment
shiftOut()	isWhitespace()	byte()	static	not equal to
tone()	random()	char()	volatile	less than
delay()	randomSeed()	float()	PROGMEM	less than equal to
delayMicroseconds()	bit()	int()	sizeof()	equal to
micros()	bitClear()	long()	loop()	greater than
millis()	bitRead()	word()	setup()	greater than equal to
abs()	bitSet()	String()	break	logical not
constrain()	bitWrite()	array	continue	logical and
map()	highByte()	bool	do..while	logical or
max()	lowByte()	boolean	else	reference operator
min()	attachInterrupt()	byte	goto	dereference
pow()	detachInterrupt()	char	if	bitwise and
sq()	interrupts()	double	return	bitwise left
sqrt()	noInterrupts()	float	switch..case	bitwise right

The above is our keyword table. Please note that words with (All) suffixed have multiple functions inside them, like the Keyboard and Mouse. All of the functions in these will be implemented.
Since there are tokens such ++ (increment) and >= (greater than or equal to) we have to take special care to not interpret these as + + or as < followed by = respectively. For this we will have to implement *lookahead *in our lexer.
We need to implement the lexer to identify the following tokens: Identifiers, character constants, floating point nubers, operators, strings, keywords, even comments have to be parsed so that they can be ignored (In our case, mainly the single line comments).
We need to keep track of the indentation in the code to get a properly indented output, to do this we must note the beginning and end of the indented block and generate "start block" and "end block" tokens.
Finally, when the flex scanner returns a stream of tokens, there are two part in the output: the token itself, and a corresponding value. eg: we can set *add *to mean 258, multiply to mean 259 etc. This will help us keep track of the tokens later on.

1.3.3 Using Flex lexical analyser

Flex is used for tokenizing especially because it is much more powerful than writing an implementation C or any other language. For instance, to write a basic lexical analyser in C (which can identify identifiers, operators and about 10 keywords) can be found on this site . This tokenizer is over 160 lines of code, is inflexible and has no lookahead (unless we manually implement it through loops).

In contrast, here is a tokenizer that does the same thing:

letter	 [a-zA-Z]
digit	 [0-9]
letter_or_digit [a-zA-Z0-9]
white_space	 [ \t\n]
other 		.
%%
{white_space}+ 		;
{letter}{letter_or_digit}* 	return 1;
{digit}+					 return 2;
{other}					 return 3;


%%
int main() {
 int lextoken;
 while (lextoken = yylex())
	printf("%d - %s\n",lextoken, yytext);
}
int yywrap()
{
return 1;
}

Flex is much more compact because it uses 'regular expressions' (popularly shortened to regex). Using the combination of rules given to generate regex in flex, we can define even complex tokens such as floating point numbers in the IEEE standard, which can be represented in the following way:

/* float exponent */ EXP ([Ee][-+]?[0-9]+)

//and then using the definition of EXP

/* decimal float */ ([0-9]*.[0-9]+|[0-9]+.){EXP}?[flFL]? [0-9]+{EXP}[flFL]?

The above expression can handle all kinds of float input, like 12 (no digits after decimal), 5.1 , 6.23142 etc.
As we mentioned above, we have to parse comments so that we can choose to keep them or ignore them. Even this is possible to do in one line:

/\ *([^**]|*+[^/ \*])*\ *+/

Fortunately for us our source language is mostly C/C++ code so we do not need to worry about such special inputs, they have multiple implementations and we need to just re-implement the ones we need.

Moreover, Flex allows us to use our own custom C code inside the program, so it offers all the customisability of C without any downsides. The generated executable C program is stored in lex.yy.c .

The program can be executed with instructions:

 lex filename.l
 gcc lex.yy.c
 ./a.out

As mentioned before, we cannot just take the tokens and store them, there is information associated with tokens such as:
- Data types and names, for variables.
- Declaration procedures.
- Details about parameters (call by value and call by reference).
- Number and type of arguments passed to functions.
- All the other keywords that we mentioned in the table above.
To store all this information, we need to have a symbol table. A symbol table can be implemented using list, linked list, binary search tree or hash table. For its speed and effectiveness, and because we will have a rather large number of entries in the table, we use the** hash table**. The symbol table is called using the symbol table routine in the code section of the above lex program.

We can use any Hash function with the Symbol table, usually the source symbol is itself used as a key for the hash function.

Any hash function can be used, so we use the one from Flex and Bison O'Reilly book:

Quoting the author:

The hash function is also quite simple: For each character, multiply the previous hash by 9 and then xor the character, doing all the arithmetic as unsigned, which ignores overflows. The lookup routine computes the symbol table entry index as the hash value modulo the size of the symbol table, which was chosen as a number with no even factors, again to mix the hash bits up.

Here is the Hash function:

/* hash a symbol */
static unsigned
symhash(char *sym)
{
unsigned int hash = 0;
unsigned c;
while(c = *sym++) hash = hash*9 ^ c;
}
return hash;
struct symbol *
lookup(char* sym)
{
struct symbol *sp = &symtab[symhash(sym)%NHASH];
int scount = NHASH;
/* how many have we looked at */
while(--scount >= 0) {
if(sp->name && !strcmp(sp->name, sym)) return sp;
if(!sp->name) {
sp->name = strdup(sym);
sp->reflist = 0;
return sp;
}
/* new entry */
if(++sp >= symtab+NHASH) sp = symtab; /* try the next entry */
}
fputs("symbol table overflow\n", stderr);
abort(); /* tried them all, table is full */
}

The** lookup** routine takes a string and returns the address of the table entry corresponding to that name, and it creates one if it doesn't exist already. Since the hashing function is being applied on the symbol itself, strdup is called to save a copy of the input string (symbol) itself so that it can be entered in the right place in the symbol table entry.

Once we have defined the expressions in flex to handle all possible input and populated the symbol table with all the keywords and tokens, and verified that the Lexer is giving the right output, the Lexical analysis part is done.

We can and should point out some errors at this stage itself. flex will throw an error message along with the line number if the input does not make sense (such as entering a number like 12.3.1) . However we are limited to only reporting basic grammatical error at this stage because the lexical analyser can only handle 1 token at a time. So if we have an input like:

for if 1

or

if while int

will still evaluate to true because we are still not capable of capturing context or the true 'meaning' of these tokens in a statement.

Moreover something like:

while (digitalRead(buttonPin) == HIGH) {
    calibrate(); 
  }
  // signal the end of the calibration period
  digitalWrite(indicatorLedPin, LOW);

Cannot be evaluated properly because regular expressions cannot handle balancing tokens like parenthesis properly. For that, we will have to analyse the syntax, so we move on to the syntax analysis phase.

1.3.4 Syntax Analysis

The basic purpose of parsing is to group the tokens we generated in the lexical analysis phase and then checking them against the rules of the language. things like defining the scope, typing and the way the tokens are put together are done here. Ultimately, the goal of the syntax analysis phase is to generate a parse tree.

To illustrate this point, we take a snippet from ForLoopIteration example. It can be found in File-->Examples-->Control-->ForLoopIteration

for (int thisPin = 2; thisPin < 8; thisPin++) { 
    // turn the pin on:
    digitalWrite(thisPin, HIGH);   
    delay(timer);                  
    // turn the pin off:
    digitalWrite(thisPin, LOW);    
  }

A psuedocode representation of the output of parsing the** for loop** should look something like this:

STATEMENTLIST - Statement list, this statement: | KEYWORD - Assign to word, keyword is: | | ASSIGN - Assign to variable at offset 2, expression is: | | | LESSTHAN - lesser than, lhs is: thisPin | | INCREMENT - Increment operator, lhs is: thisPin STATEMENTLIST - Statement list, continued. . .

The graphical representation of the syntax tree generated by the above code block would be:

A few points have to be kept in mind while designing our syntax analyser to generate a parse tree.

The rules of the grammar must be such that only one such parse tree is possible for any given input. If the maker of the original Arduino code does not take care while writing the code, the translation may result in unpredictable behaviour. When this happens we say the grammar is ambiguous. As mentioned before, some of the ambiguity can be removed by adding one character lookahead so that we don't get errors. Without lookahead it is possible that:

1234+ 12 may result in an error because integer tokens should have no signs like +, - after the token, but if our parser can 'look ahead' by 2 tokens, it will realise that the term is just a part of the larger expression.

It is possible to parse both from the right and left. Since Left to Right parsers are more common, and when combined with Lookahead are sufficiently powerful for our purposes, we will use a LALR parser. Generalised Look Ahead (GLR) parsers are even more powerful, but they increase code complexity by quite a bit and LALR will suffice for our needs.
The parsing can be done from the root node onwards, considering the alternatives until it reaches the child nodes, or we could start 'bottom up'. Although top-down parsers have their own advantages when we right a parser by hand, we are using Bison, which has support for bottom - up parsing. Bottom Up parsing also has the advantage of being more powerful than top down parsers. So we use a bottom up approach.

Generating good error messages

Atleast in the initial stages of transpiler development, the support for various library functions will not be complete. Even after all of the core Arduino keywords are integrated, all of the sensors have their own libraries and each of those libraries have their own functions.

Here is a sample code for DHT11 sensor, which is a common temperature and humidity sensor used with the Arduino:

#include **<dht.h>

dht DHT;

#define DHT11_PIN 7

void setup(){
  Serial.begin(9600);
}

void loop()
{
  int chk = DHT.read11(DHT11_PIN);
  Serial.print("Temperature = ");
  Serial.println(DHT.temperature);
  Serial.print("Humidity = ");
  Serial.println(DHT.humidity);
  delay(1000);
}

In the given example, dht.h , DHT.temperature, DHT.humidity etc will not be recognised properly by the transpiler because these are not a part of the core language reference. We need to parse through these examples and point out all the errors at once, so that the programmer can edit those parts with the corresponding MicroPython library and generate a working program.
```
  for (int thisPin = 2; thisPin < 8; thisPin++) 
  	    // turn the pin on:
  	    digitalWrite(thisPin, HIGH);   
  	    delay(timer);                  
  	    // turn the pin off:
  	    digitalWrite(thisPin, LOW);    
  	  }
```
In this code snippet given above, the ** { ** is absent on the first line. But the error may not be reported until the last line. It is tempting to try and code a fix for this into the transpiler, but we should avoid designing a compiler to fix such bugs as it may introduce new problems into the code. Our best bet is to report the line number accurately and hope that the programmer can fix it themselves.
```
  void yyerror(char *s)
  {
  printf("%s on line %d\n",s,yylineno);
  }
```
Fortunately, Bison comes inbuilt with a yyerror function and yylineno to handle errors and to note the line number. Otherwise we would have to include code to count line numbers and report errors ourselves. However we still have the option of generating custom error messages for particular kinds of errors.

1.3.5 Using Bison for Parsing

Just like with Flex, it is possible to write the whole parser in C or any other programming language. But writing Bottom up LALR(1) parser is far from trivial. Given that we are trying to parse a big chunk of C/C++, it makes sense to use Bison. Handwritten parsers may give ambiguous output if not designed properly, this is not a problem with Bison since it simply will not parse an ambiguous grammar. This saves us a lot of pain while debugging our transpiler later.

A Bison program usually has 3 parts, just like a lex program.

%{ C Declarations %} Bison Declarations %% Grammar Rules %% Additional C Code

In the C declarations part, we can call the yylex() function that we generated earlier. we can call yyerror() also. So Bison part of the transpiler will get tokenized input, this approach makes our transpiler modular. It is also possible to write the yylex() function inside the bison program itself, but this will make our code harder to debug. So we call yylex() from the Bison program.

The GNU Bison manual is a great resource and I will be making use of it while writing the parser.

We will name terminal symbols with C like Identifiers to make them easier to remember. symbols like + - * / ; , etc. The Syntax for this is:
```
  stmt: RETURN expr ';' ;
  //for the return character to punctuate rules
```
statements like x+1, x+4211 etc are equivalent, but we need to take care that constants are recognised as such, so tokens have two components:
- Type: INTEGER, IDENTIFIER etc.
- Semantic value: the value, name etc.
Finally, we rules. Just parsing the input is not enough. It has to produce output based on input if we have defined rules for that particular input. If we allow two floating point numbers to be added in Arduino for example, we can use:

expr: expr '+" expr ($$ = $1 +$3;} ;

The Appel Modern Compiler in C book, popularly known as the tiger book has a great set of examples on Flex and Bison based calculators.

The error recovery problem that we mentioned earlier is easy to resolve. The Bison language reserves the word error. yyerror() can be used to make the program wait, but to make sure that it continues parsing (so that we get a list of all potential errors at once like normal compilers).
```
  line:
    '\n'
  | exp '\n'   { printf ("\t%.10g\n", $1); }
  | error '\n' { yyerrok;                  }
  ;
```

The code snippet above does this. yyerror() is still called eventually, so that we can display our relevant error messages;

Finally when we have defined all the rules for our language and the parser is built, we can turn our attention to the output of the program. Just like how the lexer gave us the yylex.c file as an output, Bison gives a parser implementation file that can be used to implement the output. This file is called yyparse().

Generation of a Parse tree

After the language has been parsed, we need to construct a parse tree like the one we saw in figure 3. This is not very difficult to do. We need to judge the number of unique 'nodes' in our expression and just modify the code to incorporate the tree datastructure by defining the nodes. Then the expressions, instead of being evaluated will generate a tree.

To generate the tree, first we have to design it We need to specify all the different node types and what each node should contain. The node for int will have size, type, name while a token like semicolon (;) does not have that many details.

It might be possible to perform some simplification here. For instance the MicroPython code for BMP180 (Temperature and Barometric pressure sensor) on an ESP32 is:

  from bmp180 import BMP180
  from machine import I2C, Pin                        \# create an I2C bus object accordingly to the port you are using
  \#bus = I2C(1, baudrate=100000)           \# on pyboard
  bus =  I2C(scl=Pin(22), sda=Pin(21), freq=100000)   # on esp8266
  bmp180 = BMP180(bus)
  bmp180.oversample_sett = 2
  bmp180.baseline = 101325
   
  temp = bmp180.temperature
  p = bmp180.pressure
  altitude = bmp180.altitude
  print("Temperature: ",temp)
  print("Pressure: ",p)
  print("Altitude: ",altitude)

The corresponding Arduino code for BMP180 to measure temperature and pressure is:

\#include <Wire.h> //Including wire library

\#include <SFE_BMP180.h> //Including BMP180 library

\#define ALTITUDE 35.6 //Altitude where I live (change this to your altitude)

SFE_BMP180 pressure; //Creating an object

void setup() {
  Serial.begin(9600); //Starting serial communication

  Serial.println("Program started");

  if (pressure.begin()) //If initialization was successful, continue
    Serial.println("BMP180 init success");
  else //Else, stop code forever
  {
    Serial.println("BMP180 init fail");
    while (1);
  }
}

void loop() {
  char status;
  double T, P, p0; //Creating variables for temp, pressure and relative pressure

  Serial.print("You provided altitude: ");
  Serial.print(ALTITUDE, 0);
  Serial.println(" meters");

  status = pressure.startTemperature();
  if (status != 0) {
    delay(status);

    status = pressure.getTemperature(T);
    if (status != 0) {
      Serial.print("Temp: ");
      Serial.print(T, 1);
      Serial.println(" deg C");

      status = pressure.startPressure(3);

      if (status != 0) {
        delay(status);

        status = pressure.getPressure(P, T);
        if (status != 0) {
          Serial.print("Pressure measurement: ");
          Serial.print(P);
          Serial.println(" hPa (Pressure measured using temperature)");

          p0 = pressure.sealevel(P, ALTITUDE);
          Serial.print("Relative (sea-level) pressure: ");
          Serial.print(p0);
          Serial.println("hPa");
        }
      }
    }
  }  
  delay(1000);
}

The first thing we can see is even after removing a lot of the comments and syntactic sugar is that the MicroPython code is much shorter. Infact it does not even have a setup() or loop() method unlike all Arduino programs where it is mandatory. As we work with the Portenta board running MicroPython, we can conclude what does not need to be translated and just ignore those parts of the input file. This will simplify the abstract syntax tree and make code generation easier.
After we produce the abstract syntax tree, we are near the end. The AST contains only the important parts of the source language that we need to translate into the target language. Unlike parse trees of which there can be only one for any given grammar, we can have different types of ASTs for the same grammar depending upon the level of optimisation we are doing to the source code. A sample AST for the Syntax tree we generated in section 1.3.5 is given below:

1.3.6 Semantic Analysis and Target Code

The semantic analysis phase is the last phase that is concerned with the source language (Arduino). We now have an Abstract Syntax tree produced by the Syntax analyser, converting the AST into the source language is a simple matter of 'flattening' the tree and translating the Arduino code into MicroPython statement by statement. Since we are compiling from source to source, our back end is rather simple compared to LLVM which has millions of lines of code, we don't really have to worry about machine specific details either, because we are not generating machine (or in Arduino's case, Intel Hex format code).

In the semantic analysis phase we have to take care of context sensitive analysis. Modern high level languages (and Arduino is no exception) have statements that are ambiguous on their own, but are perfectly clear once the program is read as a whole. I gave an example of this in 1.1 with C++ having potentially ambiguous syntax. There is even a Wikipedia article about the Most Vexing Parse in C++ .

The Abstract Syntax Tree generated in the previous phase is missing some key details that are necessary for it to translate well to MicroPython. Most of these details concern the management of names and types. A lot of the details are stored in the symbol table we created earlier and we can use that symbol table here to annotate the AST. We will consider them one by one. After all these steps are performed, we have an annotated syntax tree which can then be 'walked' to get the language.

Type and Type Checking: C++ (Arduino) is a strongly typed language which means it places strict limits upon how much memory and resources are allocated at compile time itself. MicroPython is also built for microcontrollers, and has strong typing itself. Arduino also allows safe type conversions, as does MicroPython.

*** Type Information:** typedef and structures allow custom datatypes and it can get potentially troublesome to store data about them and their handling, fortunately, they are not used too often in microcontrollers with limited memory and processing power. Same goes for multidimensional arrays.

*** Type Checking:** Python allows for dynamic typing, allowing variables to be declared and type checking happens during runtime. This can be troublesome. However we can get around this problem since we are translating existing Arduino code to MicroPython. Since the sensors and data coming from them is unchanged, it is reasonable to assume the data types used in the Arduino code can be used in the resulting Python code too.

To preserve the type information, we label the nodes of the AST with type information.

In the above example variable i is promoted to real because b and a are real.

Type Rules: For microcontroller applications where date and time are involved (like the DS1307 RTC module which keeps track of time) and we need to measure something every x minutes, hours or days, we need to determine the rules for adding and subtracting units of time.
Scope information: In Arduino {} are used to denote scope of a variable, function or loop. In MicroPython, we use indentation instead of braces. The compiler has to store tokens corresponding to the beginning and end of blocks and then generate appropriate tab spacing to solve this issue.

Target Code generation

Most compilers at the end of the semantic analysis phase convert the annotated AST to a IR (Intermediate representation) language. The IR is then converted to machine code. We use linear substitution to generate MicroPython code from the annotated AST. Our annotated AST contains tokens, their details, type, function etc. There are not many resources on transpiler design, but most sources suggest that a 1:1 substitution from source language elements to target language elements should be enough.

1.3.7 Testing and creating examples for Portenta

After the transpiler design is completed, the next step is to 'debug' the generated code using the Portenta board. The end goal is to produce code that can compile with as few modifications as possible. I will test out the Portenta with the following sensors that I have on hand:

Sensor	Usage
BMP280	To measure temperature and pressure
DHT11	To measure relative humidity and temperature
RTC	To measure time
HCSR04	Ultrasonic Sensor
IR Sensor	To detect objects using IR LED and Photoresistor
PIR Sensor	Passive IR, to detect objects
HC05	Bluetooth
ESP01	Wifi
HC12	Long Range Radio Module
433 Mhz Transciever	Short Range Radio module
MPU6050	Accelerometer and Gyroscope IMU
MPU9250	Accelerometer, Gyroscope,Magnetometer IMU
MQ5 and MQ3	Alcohol, Butane, Smoke etc
Rainfall Sensor	To detect water droplets
Soil Moisture Sensor	Soil Humidity measurement
Flow Rate sensor To measure flow rate of water
KY039	To measure Pulse rate
AD8232	To measure ECG of the heart
Myoware Sensor	To measure contraction of muscles
Neo 6M	GPS Module
2.8 inch TFT LCD	Screen
16x2 LCD	LCD Display module
0.96 inch OLED	Small OLED Screen

Note: Even ESP libraries do not exist for some of these sensors, and the library code may require a lot of rewriting, I will attempt to make as many sensors work with the portenta as possible.

II. Schedule of Deliverables

0. Before Community Bonding

The time period from April 1- May 4 is the pre - community bonding phase.
Objective is to push a good number of Pull Requests to the Arduino GitHub repository.
Getting familiar with Arduino's processes and codebase.
Take a course on Compiler design. I went through 4 textbooks (Des Watson, O'Reilly, the 'Dragon Book' and the 'Tiger Book' to write this proposal. I have taken compiler design as a course this semester at my college, yet to be safe it would be good to take one of the MOOCs from Coursera or EdX, and go through MIT OCW notes.

1. Community Bonding Period

1.1 Week 1:

Contact the Arduino developers, and try to get in touch with the developers who wrote the Arduino Language reference.
Get in touch with my project mentor, introduce myself, workout a good time to communicate (taking into account timezone differences and their schedule).
Ask the mentor for any suggestions/changes to the project. The project proposal reflects my knowledge as of writing, there may be essential amendments neccessary and it would be better to make them early.

1.2 Week 2:

Go through the Arduino code base again. Make notes of which tokens: - datatypes, identifiers, punctuation marks, keywords etc are used most often. Some of them may be functions defined only inside certain libraries, pay special attention to those.
Go through all the example programs given in the IDE, the inbuilt examples are what we'll be using as a template for a basic Arduino program, so it is essential to get familiar with them.
Get in touch with other GSoCer's , both within the Arduino community and outside.

1.3 Week 3:

MicroPython is the target language for compilation and to implement the compiler properly, I have to know the source language and target language equally well. So going through the MicroPython documentation is essential.
At present the ESP32 is among the only popular boards to have widespread Micropython usage. It also runs Arduino style code from the Arduino IDE, so it makes for good comparison between MicroPython code and C/C++ style code. I will go through the code for various sensors (the ones mentioned above in this proposal, since I have them at my disposal to test).
Go through the MicroPython code base on GitHub.

1.4 Week 4:

GNU/GCC is among the most popular compilers for C++. Although the code base has millions of lines and it is impossible to comprehend all of it, I will take a look at it to understand the idiosycracies of compiler design.
Start coding the lexer. Even though this is before the official coding period starts, I'd like to start working on the lexical analyser a bit in advance so that I can iron out any problems in it. The success of the Bison Parser is very much dependent on the lexer working correctly, otherwise it will become extremely difficult to debug in the later stages.
Work on creating a comprehensive suite of test inputs for the lexer, to check all corner cases and verify that the lexer is indeed working correctly.

2. Phase 1

2.1 Week 5:

*** Deliverable:** The lexer by the end of week 5 (first week of the coding period) , should be able to recognise basic tokens such as words, punctuation marks, numbers, comments etc.

This is not very hard to do and since lexical analysers for C code have been implemented before, they can be used as a reference.
Give the mentors an update on my progress so far and take their feedback.

2.2 Week 6:

**Deliverable: **Symbol table.
Deliverable: Generating appropriate error messages for incorrect lex input.
The lexical analyser has a symbol table that stores the keywords and the meaning of different tokens, this symbol table will be used later by the parser and during the creation of the annotated abstract syntax tree. The flex lexical analyser has to report incorrect lexemes.

2.3 Week 7:

Deliverable: Working basic parser in Bison.
The parser we are implementing is a LALR(1) parser. This feature will be implemented during this time period.
This is also the week before the phase 1 evaluations, so getting in contact with the mentors to discuss any possible improvements before the week 1 evaluation proceeds.

2.4 Week 8:

Deliverable: Parser rules for datatypes: array, strings, basic types should be defined.
Deliverable: Defining rules for punctuation and special symbols like escape, semicolon, equal to, plus, minus, multiply, divide etc.

3. Phase 2

3.1 Week 9:

Deliverable: Syntax Tree.
The rest of the week is to test the lexer and parser thoroughly before moving on to the AST and code generation phase.

3.2 Week 10:

Deliverable: Create the Abstract Syntax tree from the syntax tree, by stripping out unneccessary tokens and nodes.
Deliverable: The error() and yyerror() functions should return proper error messages.

3.3 Week 11:

Deliverable: Create annotated AST by using the symbol table we built earlier, the annotated AST should have details about Type information, scope of the variable etc.

3.4 Week 12:

Deliverable: Create the symbol table/list corresponding commands of MicroPython with respect to Arduino. The AST will be traversed to generate the target language.
Deliverable: The basic code generator function must be ready.

4. Phase 3

4.1 Week 13:

Deliverable: Optimisation of the generated code (removing uneccessary code, anomalies).
Since MicroPython seems to include very few features by default, it may be 'mandatory' to import some libraries (much like Wire.h is for many Arduino programs).

4.2 Week 14:

Deliverable: Library example code for 5-10 sensors.
We will use the transpiler we built to create some code for libraries, based on Portenta Hardware and the output, further optimisation will almost certainly be needed.

4.3 Week 15:

Buffer week for unexpected delays.

4.4 Week 16:

Buffer week for unexpected delays, final submissions for GSoC 2020.

III. Development Experience

3.1 Github:

GSoc Specific Pull Requests and code will be on: https://github.com/AshutoshPandey123456/GSocArduino2020

Workout Sleeve: https://github.com/AshutoshPandey123456/WorkoutSleeve

This was a hack which clinched the top spot at the Smart India Hackathon - 2019, hardware edition. SIH is the world's biggest hackathon with over 2,80,000 students participating every year.

The product is a 'sleeve' that can be worn over the arm and it tracks unconventional exercises such as situps, pushups and calculate the accuracy of each exercise against a mean to inform the person if they are performing the exercise right. It uses a Raspberry Pi Zero W, MPU9250 and other sensors combined with Machine learning (HMM Algorithm and Decision trees.)

The result of the hackathon can be verified here: https://www.sih.gov.in/SIH2019HardwareGrandFinaleResult . Serial number 91 is my team.

HealthHack: https://github.com/AshutoshPandey123456/HealthHack

A next gen healthkit that reads ECG, EMG, Temperature, Pressure, Blood Alcohol Conc and takes Thermal Images to diagnose medical conditions anywhere, anytime.

In India, more than 50% of people don't reach a hospital on time when they get a heart attack. More than 60% of our ambulances are ill equipped to diagnose conditions, which means the patients vital parameters are anyone's guess until they reach the hospital. Moreover, with the advent of Motorcycle ambulances to deal with traffic, this has worsened since they carry no diagnostic tools. What is the need of the hour is a cheap, portable, easy to use solution for their diagnostics needs. Patients who are stabilised before being transported have a 8X higher chance of survival.

We built such a system and a companion app to enable quick and easy diagnostics anytime, anywhere.

Machine Learning algorithms to automatically identify atrial fibrillation and other conditions are currently under development.

Hardware Used: Arduino UNO - To interface with the sensors Arduino Nano - To measure the environment variables and the drive the ILI9341 display AD8232 - To measure ECG Advancer Myoware V3 - To measure EMG BMP180 - Measure harometric pressure and temperature Neo 6M - GPS DS18B20 - To measure body temperature Raspberry Pi Model 4B - To interface with both the arduinos and the drive the 5" LCD Display KY039 - To measure pulse

Hardware Assembly video: https://www.youtube.com/watch?v=qbvS8T_uvMI

Extended Reality: https://github.com/AshutoshPandey123456/ExtendedReality

Inspired by Pranav Mistry's sixth sense project. Uses Raspberry Pi, Raspicam, Machine Learning, OCR and CV to make a gesture controlled computer.

Demonstration video: https://www.youtube.com/watch?v=8u3bYXxyju4

Prosthetic Hand: https://github.com/AshutoshPandey123456/ProstheticHand

This project is a prototype for an affordable, modular, myoelectric prosthetic arm. It was one of the entries for Dassault Systemes Aakruti Design contest 2019 and came among the top 3 in the south zone regionals in India among 5000+ teams. It is made out 3D printed ABS plastic and uses TPU material for joints. A detailed description of the project and its features can be found in the DIY report of this repository and the Powerpoint Presentation. All the CAD files are in SolidWorks and can be used to 3D print the prosthesis in any 3D printer.

The electronics component for myoelectric control is still under development, and the repository will be updated when it is done.

LabVIEW Motion Capture: https://github.com/AshutoshPandey123456/LabVIEW-Motion-Capture

Using Arduino, MPU6050, Flex Resistors, LCD and LabVIEW GUI to create a motion capture wearable device. Won Jury's Choice award in Rakathon 2018. Rakuten is one of the biggest Japanese conglomerate companies around.

FacePixer: https://github.com/AshutoshPandey123456/Facepixer

Converts sketches into photorealistic images. Potential uses include generating photographs from sketches for criminals, sketching out a building and generating a photorealistic image from that etc.

Used CUHK Dataset to train Pix2Pix networks for the generation of images. GAN based off https://phillipi.github.io/pix2pix/

###3.2 University Courses:

I am a Computer Science and Engineering undergraduate student currently in my 3rd year. My college (Dayananda Sagar College of Engineering) offers the following courses that I have taken relevant to the proposal. The course codes are given on the left:

17CS3DCMUP : Microprocessors and Microcontrollers, in the 3rd semester.
**17CS4DCAFL: **Automata theory and formal languages, this forms the basis of compiler design. taken in the 4th semester.
17CS5DEIOT: Internet of things elective, currently taking this course.
**17CS5DCSSA: **System software applications, it is about using lex and yacc to write system software. Currently taking this course.

###3.3 Experience of working with Arduino

In addition to the GitHub projects mentioned above , and the university courses, I have worked with the Arduino platform multiple times:

Official Arduino Foundations Certification: i took the official Arduino foundations certification offered by Arduino. I cleared it successfully in my first attempt with a score of 81/100 (with 70 being the qualifying score).

The certificate can be verified here: https://create.arduino.cc/edu/courses/local/certification/cert.php?id=1c8bf393-4042-4a22-ba95-9e31a4ab315a

Element14 NanoRama Giveaway winner: I was one of the 5 people to win an Arduino Nano 33 BLE Sense from Element14 hardware community. I have been involved with hardware communities like Element14, Hackster and Circuitdigest for a while now. The announcement can be viewed here: https://www.element14.com/community/docs/DOC-94623?et=notification.mention#comment-253811
SAE AeroDesign Advanced Class DAS I was a member of Team Arcis, the official Aero Design team of my college for two years. I developed a 'Data Acquisition System' for the Advanced class aircraft in my second year. The system consisted of an Arduino Nano, HC12, MPU6050, MPL3115A2 sensor and 3 IR Sensors that could transmit the data of the fixed wing UAV aircraft to the ground station. The team ranked #2 in the Technical presentation in the world. The results of the team for year 2019 can be viewed here: https://www.saeaerodesign.com/content/2019_advancedClassResults-AW.pdf

IV. Other Experiences

Element14 Summer code club: I was one of the 10 individuals in the world, and the only person from India to get a chance to establish a BBC Micro: Bit Summer Code club. I got a set of 30 Micro: Bit microcontrollers from element14 to teach young students (grades 6th to 8th) to code. I established a code club in Dayananda Sagar International School, Bengaluru.

The announcement of the 10 winning individuals from across the world can be viewed here: https://www.element14.com/community/docs/DOC-92809/l/microbit-summer-code-club-challenge-winners-announced

The projects built by the young students are documented here on my Element14 blog: https://www.element14.com/community/people/ashutosh_pandey/blog

Grand Prize winner, summer code club challenge, Element14: placed 1st among the 10 individuals who were selected for the summer code club. The students I taught were awarded with 10 Raspberry Pi complete kits: https://www.element14.com/community/docs/DOC-93778/l/microbit-summer-code-club-the-winning-teams
Raspberry Pi 4B Roadtest: One of the 20 individuals to win a Raspberry Pi 4B to roadtest. I wrote a detailed review complete with benchmarking which can be found here: https://www.element14.com/community/roadTestReviews/3188/l/RoadTest-the-Raspberry-Pi-4-Model-B-(2GB)
Others: I write about science and technology on my personal blog: https://sapience2017.wordpress.com/ . I recently won the state regionals of the TCS TechBytes quiz. I am the head of the Internet of things Innovation lab of my college. I recieved a scholarship from Udacity to study Edge AI.

V.Why this Project?

I chose Arduino because I am fascinated by electronics and microcontrollers in general. Knowing designing, electronics and coding together allows a person to be able to build whatever they can concievably think of! I am very grateful to the Arduino community for bringing around the maker movement, without it, I couldn't dream of knowing as much electronics as I do now.

Within the Arduino organisation, I chose this project for the following reasons:

Impact: By contributing to a few library codes, I could make an impact. By writing a tool, I can potentially impact many more libraries.
The Portenta Board: The Portenta board with its limitless expansion capabilities, powerful processing and ability to run AI on the edge interests me greatly, I am interested in non-avr architectures (like the Arduinos that are based on the SAMD21 chip) so i find the Portenta board very intriguing. GSoc offers me an opportunity to work with this board so why not take it!
Compiler Design: Compiler design is one of the hardest courses a Computer Science undergraduate can take, and I want to take the challenge head on.
A chance to be a part of an open source commmunity: Most of my projects are based around hackathons. I have not had the opportunity to work in a large scale open source organisation, and this is an opportunity to get introduced to open source along with mentorship. I will be contributing to Arduino for a long time to come.

VI. Do you have any commitments during the GSoC Period?

None whatsoever. My University will be closed for the entire period except for the last week of August, by which time GSoC will be in its conclusion phase.

VII. Personal Details

Name: Ashutosh Pandey

Email: ashutoshpandey123456@gmail.com

LinkedIn: https://www.linkedin.com/in/ashupdsce/

GitHub: https://github.com/AshutoshPandey123456

Mobile Number: +91 8971190672.

Files

proposal.md

Latest commit

History