Mercurial > hg > CbC > CbC_llvm
comparison docs/tutorial/OCamlLangImpl3.rst @ 31:d22a1cf4041c
merge with the LLVM_original
author | Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> |
---|---|
date | Thu, 12 Dec 2013 14:37:49 +0900 |
parents | 95c75e76d11b |
children | afa8332a0e37 |
comparison
equal
deleted
inserted
replaced
2:4bc3e1cd2659 | 31:d22a1cf4041c |
---|---|
1 ======================================== | |
2 Kaleidoscope: Code generation to LLVM IR | |
3 ======================================== | |
4 | |
5 .. contents:: | |
6 :local: | |
7 | |
8 Chapter 3 Introduction | |
9 ====================== | |
10 | |
11 Welcome to Chapter 3 of the "`Implementing a language with | |
12 LLVM <index.html>`_" tutorial. This chapter shows you how to transform | |
13 the `Abstract Syntax Tree <OCamlLangImpl2.html>`_, built in Chapter 2, | |
14 into LLVM IR. This will teach you a little bit about how LLVM does | |
15 things, as well as demonstrate how easy it is to use. It's much more | |
16 work to build a lexer and parser than it is to generate LLVM IR code. :) | |
17 | |
18 **Please note**: the code in this chapter and later require LLVM 2.3 or | |
19 LLVM SVN to work. LLVM 2.2 and before will not work with it. | |
20 | |
21 Code Generation Setup | |
22 ===================== | |
23 | |
24 In order to generate LLVM IR, we want some simple setup to get started. | |
25 First we define virtual code generation (codegen) methods in each AST | |
26 class: | |
27 | |
28 .. code-block:: ocaml | |
29 | |
30 let rec codegen_expr = function | |
31 | Ast.Number n -> ... | |
32 | Ast.Variable name -> ... | |
33 | |
34 The ``Codegen.codegen_expr`` function says to emit IR for that AST node | |
35 along with all the things it depends on, and they all return an LLVM | |
36 Value object. "Value" is the class used to represent a "`Static Single | |
37 Assignment | |
38 (SSA) <http://en.wikipedia.org/wiki/Static_single_assignment_form>`_ | |
39 register" or "SSA value" in LLVM. The most distinct aspect of SSA values | |
40 is that their value is computed as the related instruction executes, and | |
41 it does not get a new value until (and if) the instruction re-executes. | |
42 In other words, there is no way to "change" an SSA value. For more | |
43 information, please read up on `Static Single | |
44 Assignment <http://en.wikipedia.org/wiki/Static_single_assignment_form>`_ | |
45 - the concepts are really quite natural once you grok them. | |
46 | |
47 The second thing we want is an "Error" exception like we used for the | |
48 parser, which will be used to report errors found during code generation | |
49 (for example, use of an undeclared parameter): | |
50 | |
51 .. code-block:: ocaml | |
52 | |
53 exception Error of string | |
54 | |
55 let context = global_context () | |
56 let the_module = create_module context "my cool jit" | |
57 let builder = builder context | |
58 let named_values:(string, llvalue) Hashtbl.t = Hashtbl.create 10 | |
59 let double_type = double_type context | |
60 | |
61 The static variables will be used during code generation. | |
62 ``Codgen.the_module`` is the LLVM construct that contains all of the | |
63 functions and global variables in a chunk of code. In many ways, it is | |
64 the top-level structure that the LLVM IR uses to contain code. | |
65 | |
66 The ``Codegen.builder`` object is a helper object that makes it easy to | |
67 generate LLVM instructions. Instances of the | |
68 ```IRBuilder`` <http://llvm.org/doxygen/IRBuilder_8h-source.html>`_ | |
69 class keep track of the current place to insert instructions and has | |
70 methods to create new instructions. | |
71 | |
72 The ``Codegen.named_values`` map keeps track of which values are defined | |
73 in the current scope and what their LLVM representation is. (In other | |
74 words, it is a symbol table for the code). In this form of Kaleidoscope, | |
75 the only things that can be referenced are function parameters. As such, | |
76 function parameters will be in this map when generating code for their | |
77 function body. | |
78 | |
79 With these basics in place, we can start talking about how to generate | |
80 code for each expression. Note that this assumes that the | |
81 ``Codgen.builder`` has been set up to generate code *into* something. | |
82 For now, we'll assume that this has already been done, and we'll just | |
83 use it to emit code. | |
84 | |
85 Expression Code Generation | |
86 ========================== | |
87 | |
88 Generating LLVM code for expression nodes is very straightforward: less | |
89 than 30 lines of commented code for all four of our expression nodes. | |
90 First we'll do numeric literals: | |
91 | |
92 .. code-block:: ocaml | |
93 | |
94 | Ast.Number n -> const_float double_type n | |
95 | |
96 In the LLVM IR, numeric constants are represented with the | |
97 ``ConstantFP`` class, which holds the numeric value in an ``APFloat`` | |
98 internally (``APFloat`` has the capability of holding floating point | |
99 constants of Arbitrary Precision). This code basically just creates | |
100 and returns a ``ConstantFP``. Note that in the LLVM IR that constants | |
101 are all uniqued together and shared. For this reason, the API uses "the | |
102 foo::get(..)" idiom instead of "new foo(..)" or "foo::Create(..)". | |
103 | |
104 .. code-block:: ocaml | |
105 | |
106 | Ast.Variable name -> | |
107 (try Hashtbl.find named_values name with | |
108 | Not_found -> raise (Error "unknown variable name")) | |
109 | |
110 References to variables are also quite simple using LLVM. In the simple | |
111 version of Kaleidoscope, we assume that the variable has already been | |
112 emitted somewhere and its value is available. In practice, the only | |
113 values that can be in the ``Codegen.named_values`` map are function | |
114 arguments. This code simply checks to see that the specified name is in | |
115 the map (if not, an unknown variable is being referenced) and returns | |
116 the value for it. In future chapters, we'll add support for `loop | |
117 induction variables <LangImpl5.html#for>`_ in the symbol table, and for | |
118 `local variables <LangImpl7.html#localvars>`_. | |
119 | |
120 .. code-block:: ocaml | |
121 | |
122 | Ast.Binary (op, lhs, rhs) -> | |
123 let lhs_val = codegen_expr lhs in | |
124 let rhs_val = codegen_expr rhs in | |
125 begin | |
126 match op with | |
127 | '+' -> build_fadd lhs_val rhs_val "addtmp" builder | |
128 | '-' -> build_fsub lhs_val rhs_val "subtmp" builder | |
129 | '*' -> build_fmul lhs_val rhs_val "multmp" builder | |
130 | '<' -> | |
131 (* Convert bool 0/1 to double 0.0 or 1.0 *) | |
132 let i = build_fcmp Fcmp.Ult lhs_val rhs_val "cmptmp" builder in | |
133 build_uitofp i double_type "booltmp" builder | |
134 | _ -> raise (Error "invalid binary operator") | |
135 end | |
136 | |
137 Binary operators start to get more interesting. The basic idea here is | |
138 that we recursively emit code for the left-hand side of the expression, | |
139 then the right-hand side, then we compute the result of the binary | |
140 expression. In this code, we do a simple switch on the opcode to create | |
141 the right LLVM instruction. | |
142 | |
143 In the example above, the LLVM builder class is starting to show its | |
144 value. IRBuilder knows where to insert the newly created instruction, | |
145 all you have to do is specify what instruction to create (e.g. with | |
146 ``Llvm.create_add``), which operands to use (``lhs`` and ``rhs`` here) | |
147 and optionally provide a name for the generated instruction. | |
148 | |
149 One nice thing about LLVM is that the name is just a hint. For instance, | |
150 if the code above emits multiple "addtmp" variables, LLVM will | |
151 automatically provide each one with an increasing, unique numeric | |
152 suffix. Local value names for instructions are purely optional, but it | |
153 makes it much easier to read the IR dumps. | |
154 | |
155 `LLVM instructions <../LangRef.html#instref>`_ are constrained by strict | |
156 rules: for example, the Left and Right operators of an `add | |
157 instruction <../LangRef.html#i_add>`_ must have the same type, and the | |
158 result type of the add must match the operand types. Because all values | |
159 in Kaleidoscope are doubles, this makes for very simple code for add, | |
160 sub and mul. | |
161 | |
162 On the other hand, LLVM specifies that the `fcmp | |
163 instruction <../LangRef.html#i_fcmp>`_ always returns an 'i1' value (a | |
164 one bit integer). The problem with this is that Kaleidoscope wants the | |
165 value to be a 0.0 or 1.0 value. In order to get these semantics, we | |
166 combine the fcmp instruction with a `uitofp | |
167 instruction <../LangRef.html#i_uitofp>`_. This instruction converts its | |
168 input integer into a floating point value by treating the input as an | |
169 unsigned value. In contrast, if we used the `sitofp | |
170 instruction <../LangRef.html#i_sitofp>`_, the Kaleidoscope '<' operator | |
171 would return 0.0 and -1.0, depending on the input value. | |
172 | |
173 .. code-block:: ocaml | |
174 | |
175 | Ast.Call (callee, args) -> | |
176 (* Look up the name in the module table. *) | |
177 let callee = | |
178 match lookup_function callee the_module with | |
179 | Some callee -> callee | |
180 | None -> raise (Error "unknown function referenced") | |
181 in | |
182 let params = params callee in | |
183 | |
184 (* If argument mismatch error. *) | |
185 if Array.length params == Array.length args then () else | |
186 raise (Error "incorrect # arguments passed"); | |
187 let args = Array.map codegen_expr args in | |
188 build_call callee args "calltmp" builder | |
189 | |
190 Code generation for function calls is quite straightforward with LLVM. | |
191 The code above initially does a function name lookup in the LLVM | |
192 Module's symbol table. Recall that the LLVM Module is the container that | |
193 holds all of the functions we are JIT'ing. By giving each function the | |
194 same name as what the user specifies, we can use the LLVM symbol table | |
195 to resolve function names for us. | |
196 | |
197 Once we have the function to call, we recursively codegen each argument | |
198 that is to be passed in, and create an LLVM `call | |
199 instruction <../LangRef.html#i_call>`_. Note that LLVM uses the native C | |
200 calling conventions by default, allowing these calls to also call into | |
201 standard library functions like "sin" and "cos", with no additional | |
202 effort. | |
203 | |
204 This wraps up our handling of the four basic expressions that we have so | |
205 far in Kaleidoscope. Feel free to go in and add some more. For example, | |
206 by browsing the `LLVM language reference <../LangRef.html>`_ you'll find | |
207 several other interesting instructions that are really easy to plug into | |
208 our basic framework. | |
209 | |
210 Function Code Generation | |
211 ======================== | |
212 | |
213 Code generation for prototypes and functions must handle a number of | |
214 details, which make their code less beautiful than expression code | |
215 generation, but allows us to illustrate some important points. First, | |
216 lets talk about code generation for prototypes: they are used both for | |
217 function bodies and external function declarations. The code starts | |
218 with: | |
219 | |
220 .. code-block:: ocaml | |
221 | |
222 let codegen_proto = function | |
223 | Ast.Prototype (name, args) -> | |
224 (* Make the function type: double(double,double) etc. *) | |
225 let doubles = Array.make (Array.length args) double_type in | |
226 let ft = function_type double_type doubles in | |
227 let f = | |
228 match lookup_function name the_module with | |
229 | |
230 This code packs a lot of power into a few lines. Note first that this | |
231 function returns a "Function\*" instead of a "Value\*" (although at the | |
232 moment they both are modeled by ``llvalue`` in ocaml). Because a | |
233 "prototype" really talks about the external interface for a function | |
234 (not the value computed by an expression), it makes sense for it to | |
235 return the LLVM Function it corresponds to when codegen'd. | |
236 | |
237 The call to ``Llvm.function_type`` creates the ``Llvm.llvalue`` that | |
238 should be used for a given Prototype. Since all function arguments in | |
239 Kaleidoscope are of type double, the first line creates a vector of "N" | |
240 LLVM double types. It then uses the ``Llvm.function_type`` method to | |
241 create a function type that takes "N" doubles as arguments, returns one | |
242 double as a result, and that is not vararg (that uses the function | |
243 ``Llvm.var_arg_function_type``). Note that Types in LLVM are uniqued | |
244 just like ``Constant``'s are, so you don't "new" a type, you "get" it. | |
245 | |
246 The final line above checks if the function has already been defined in | |
247 ``Codegen.the_module``. If not, we will create it. | |
248 | |
249 .. code-block:: ocaml | |
250 | |
251 | None -> declare_function name ft the_module | |
252 | |
253 This indicates the type and name to use, as well as which module to | |
254 insert into. By default we assume a function has | |
255 ``Llvm.Linkage.ExternalLinkage``. "`external | |
256 linkage <LangRef.html#linkage>`_" means that the function may be defined | |
257 outside the current module and/or that it is callable by functions | |
258 outside the module. The "``name``" passed in is the name the user | |
259 specified: this name is registered in "``Codegen.the_module``"s symbol | |
260 table, which is used by the function call code above. | |
261 | |
262 In Kaleidoscope, I choose to allow redefinitions of functions in two | |
263 cases: first, we want to allow 'extern'ing a function more than once, as | |
264 long as the prototypes for the externs match (since all arguments have | |
265 the same type, we just have to check that the number of arguments | |
266 match). Second, we want to allow 'extern'ing a function and then | |
267 defining a body for it. This is useful when defining mutually recursive | |
268 functions. | |
269 | |
270 .. code-block:: ocaml | |
271 | |
272 (* If 'f' conflicted, there was already something named 'name'. If it | |
273 * has a body, don't allow redefinition or reextern. *) | |
274 | Some f -> | |
275 (* If 'f' already has a body, reject this. *) | |
276 if Array.length (basic_blocks f) == 0 then () else | |
277 raise (Error "redefinition of function"); | |
278 | |
279 (* If 'f' took a different number of arguments, reject. *) | |
280 if Array.length (params f) == Array.length args then () else | |
281 raise (Error "redefinition of function with different # args"); | |
282 f | |
283 in | |
284 | |
285 In order to verify the logic above, we first check to see if the | |
286 pre-existing function is "empty". In this case, empty means that it has | |
287 no basic blocks in it, which means it has no body. If it has no body, it | |
288 is a forward declaration. Since we don't allow anything after a full | |
289 definition of the function, the code rejects this case. If the previous | |
290 reference to a function was an 'extern', we simply verify that the | |
291 number of arguments for that definition and this one match up. If not, | |
292 we emit an error. | |
293 | |
294 .. code-block:: ocaml | |
295 | |
296 (* Set names for all arguments. *) | |
297 Array.iteri (fun i a -> | |
298 let n = args.(i) in | |
299 set_value_name n a; | |
300 Hashtbl.add named_values n a; | |
301 ) (params f); | |
302 f | |
303 | |
304 The last bit of code for prototypes loops over all of the arguments in | |
305 the function, setting the name of the LLVM Argument objects to match, | |
306 and registering the arguments in the ``Codegen.named_values`` map for | |
307 future use by the ``Ast.Variable`` variant. Once this is set up, it | |
308 returns the Function object to the caller. Note that we don't check for | |
309 conflicting argument names here (e.g. "extern foo(a b a)"). Doing so | |
310 would be very straight-forward with the mechanics we have already used | |
311 above. | |
312 | |
313 .. code-block:: ocaml | |
314 | |
315 let codegen_func = function | |
316 | Ast.Function (proto, body) -> | |
317 Hashtbl.clear named_values; | |
318 let the_function = codegen_proto proto in | |
319 | |
320 Code generation for function definitions starts out simply enough: we | |
321 just codegen the prototype (Proto) and verify that it is ok. We then | |
322 clear out the ``Codegen.named_values`` map to make sure that there isn't | |
323 anything in it from the last function we compiled. Code generation of | |
324 the prototype ensures that there is an LLVM Function object that is | |
325 ready to go for us. | |
326 | |
327 .. code-block:: ocaml | |
328 | |
329 (* Create a new basic block to start insertion into. *) | |
330 let bb = append_block context "entry" the_function in | |
331 position_at_end bb builder; | |
332 | |
333 try | |
334 let ret_val = codegen_expr body in | |
335 | |
336 Now we get to the point where the ``Codegen.builder`` is set up. The | |
337 first line creates a new `basic | |
338 block <http://en.wikipedia.org/wiki/Basic_block>`_ (named "entry"), | |
339 which is inserted into ``the_function``. The second line then tells the | |
340 builder that new instructions should be inserted into the end of the new | |
341 basic block. Basic blocks in LLVM are an important part of functions | |
342 that define the `Control Flow | |
343 Graph <http://en.wikipedia.org/wiki/Control_flow_graph>`_. Since we | |
344 don't have any control flow, our functions will only contain one block | |
345 at this point. We'll fix this in `Chapter 5 <OCamlLangImpl5.html>`_ :). | |
346 | |
347 .. code-block:: ocaml | |
348 | |
349 let ret_val = codegen_expr body in | |
350 | |
351 (* Finish off the function. *) | |
352 let _ = build_ret ret_val builder in | |
353 | |
354 (* Validate the generated code, checking for consistency. *) | |
355 Llvm_analysis.assert_valid_function the_function; | |
356 | |
357 the_function | |
358 | |
359 Once the insertion point is set up, we call the ``Codegen.codegen_func`` | |
360 method for the root expression of the function. If no error happens, | |
361 this emits code to compute the expression into the entry block and | |
362 returns the value that was computed. Assuming no error, we then create | |
363 an LLVM `ret instruction <../LangRef.html#i_ret>`_, which completes the | |
364 function. Once the function is built, we call | |
365 ``Llvm_analysis.assert_valid_function``, which is provided by LLVM. This | |
366 function does a variety of consistency checks on the generated code, to | |
367 determine if our compiler is doing everything right. Using this is | |
368 important: it can catch a lot of bugs. Once the function is finished and | |
369 validated, we return it. | |
370 | |
371 .. code-block:: ocaml | |
372 | |
373 with e -> | |
374 delete_function the_function; | |
375 raise e | |
376 | |
377 The only piece left here is handling of the error case. For simplicity, | |
378 we handle this by merely deleting the function we produced with the | |
379 ``Llvm.delete_function`` method. This allows the user to redefine a | |
380 function that they incorrectly typed in before: if we didn't delete it, | |
381 it would live in the symbol table, with a body, preventing future | |
382 redefinition. | |
383 | |
384 This code does have a bug, though. Since the ``Codegen.codegen_proto`` | |
385 can return a previously defined forward declaration, our code can | |
386 actually delete a forward declaration. There are a number of ways to fix | |
387 this bug, see what you can come up with! Here is a testcase: | |
388 | |
389 :: | |
390 | |
391 extern foo(a b); # ok, defines foo. | |
392 def foo(a b) c; # error, 'c' is invalid. | |
393 def bar() foo(1, 2); # error, unknown function "foo" | |
394 | |
395 Driver Changes and Closing Thoughts | |
396 =================================== | |
397 | |
398 For now, code generation to LLVM doesn't really get us much, except that | |
399 we can look at the pretty IR calls. The sample code inserts calls to | |
400 Codegen into the "``Toplevel.main_loop``", and then dumps out the LLVM | |
401 IR. This gives a nice way to look at the LLVM IR for simple functions. | |
402 For example: | |
403 | |
404 :: | |
405 | |
406 ready> 4+5; | |
407 Read top-level expression: | |
408 define double @""() { | |
409 entry: | |
410 %addtmp = fadd double 4.000000e+00, 5.000000e+00 | |
411 ret double %addtmp | |
412 } | |
413 | |
414 Note how the parser turns the top-level expression into anonymous | |
415 functions for us. This will be handy when we add `JIT | |
416 support <OCamlLangImpl4.html#jit>`_ in the next chapter. Also note that | |
417 the code is very literally transcribed, no optimizations are being | |
418 performed. We will `add | |
419 optimizations <OCamlLangImpl4.html#trivialconstfold>`_ explicitly in the | |
420 next chapter. | |
421 | |
422 :: | |
423 | |
424 ready> def foo(a b) a*a + 2*a*b + b*b; | |
425 Read function definition: | |
426 define double @foo(double %a, double %b) { | |
427 entry: | |
428 %multmp = fmul double %a, %a | |
429 %multmp1 = fmul double 2.000000e+00, %a | |
430 %multmp2 = fmul double %multmp1, %b | |
431 %addtmp = fadd double %multmp, %multmp2 | |
432 %multmp3 = fmul double %b, %b | |
433 %addtmp4 = fadd double %addtmp, %multmp3 | |
434 ret double %addtmp4 | |
435 } | |
436 | |
437 This shows some simple arithmetic. Notice the striking similarity to the | |
438 LLVM builder calls that we use to create the instructions. | |
439 | |
440 :: | |
441 | |
442 ready> def bar(a) foo(a, 4.0) + bar(31337); | |
443 Read function definition: | |
444 define double @bar(double %a) { | |
445 entry: | |
446 %calltmp = call double @foo(double %a, double 4.000000e+00) | |
447 %calltmp1 = call double @bar(double 3.133700e+04) | |
448 %addtmp = fadd double %calltmp, %calltmp1 | |
449 ret double %addtmp | |
450 } | |
451 | |
452 This shows some function calls. Note that this function will take a long | |
453 time to execute if you call it. In the future we'll add conditional | |
454 control flow to actually make recursion useful :). | |
455 | |
456 :: | |
457 | |
458 ready> extern cos(x); | |
459 Read extern: | |
460 declare double @cos(double) | |
461 | |
462 ready> cos(1.234); | |
463 Read top-level expression: | |
464 define double @""() { | |
465 entry: | |
466 %calltmp = call double @cos(double 1.234000e+00) | |
467 ret double %calltmp | |
468 } | |
469 | |
470 This shows an extern for the libm "cos" function, and a call to it. | |
471 | |
472 :: | |
473 | |
474 ready> ^D | |
475 ; ModuleID = 'my cool jit' | |
476 | |
477 define double @""() { | |
478 entry: | |
479 %addtmp = fadd double 4.000000e+00, 5.000000e+00 | |
480 ret double %addtmp | |
481 } | |
482 | |
483 define double @foo(double %a, double %b) { | |
484 entry: | |
485 %multmp = fmul double %a, %a | |
486 %multmp1 = fmul double 2.000000e+00, %a | |
487 %multmp2 = fmul double %multmp1, %b | |
488 %addtmp = fadd double %multmp, %multmp2 | |
489 %multmp3 = fmul double %b, %b | |
490 %addtmp4 = fadd double %addtmp, %multmp3 | |
491 ret double %addtmp4 | |
492 } | |
493 | |
494 define double @bar(double %a) { | |
495 entry: | |
496 %calltmp = call double @foo(double %a, double 4.000000e+00) | |
497 %calltmp1 = call double @bar(double 3.133700e+04) | |
498 %addtmp = fadd double %calltmp, %calltmp1 | |
499 ret double %addtmp | |
500 } | |
501 | |
502 declare double @cos(double) | |
503 | |
504 define double @""() { | |
505 entry: | |
506 %calltmp = call double @cos(double 1.234000e+00) | |
507 ret double %calltmp | |
508 } | |
509 | |
510 When you quit the current demo, it dumps out the IR for the entire | |
511 module generated. Here you can see the big picture with all the | |
512 functions referencing each other. | |
513 | |
514 This wraps up the third chapter of the Kaleidoscope tutorial. Up next, | |
515 we'll describe how to `add JIT codegen and optimizer | |
516 support <OCamlLangImpl4.html>`_ to this so we can actually start running | |
517 code! | |
518 | |
519 Full Code Listing | |
520 ================= | |
521 | |
522 Here is the complete code listing for our running example, enhanced with | |
523 the LLVM code generator. Because this uses the LLVM libraries, we need | |
524 to link them in. To do this, we use the | |
525 `llvm-config <http://llvm.org/cmds/llvm-config.html>`_ tool to inform | |
526 our makefile/command line about which options to use: | |
527 | |
528 .. code-block:: bash | |
529 | |
530 # Compile | |
531 ocamlbuild toy.byte | |
532 # Run | |
533 ./toy.byte | |
534 | |
535 Here is the code: | |
536 | |
537 \_tags: | |
538 :: | |
539 | |
540 <{lexer,parser}.ml>: use_camlp4, pp(camlp4of) | |
541 <*.{byte,native}>: g++, use_llvm, use_llvm_analysis | |
542 | |
543 myocamlbuild.ml: | |
544 .. code-block:: ocaml | |
545 | |
546 open Ocamlbuild_plugin;; | |
547 | |
548 ocaml_lib ~extern:true "llvm";; | |
549 ocaml_lib ~extern:true "llvm_analysis";; | |
550 | |
551 flag ["link"; "ocaml"; "g++"] (S[A"-cc"; A"g++"]);; | |
552 | |
553 token.ml: | |
554 .. code-block:: ocaml | |
555 | |
556 (*===----------------------------------------------------------------------=== | |
557 * Lexer Tokens | |
558 *===----------------------------------------------------------------------===*) | |
559 | |
560 (* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of | |
561 * these others for known things. *) | |
562 type token = | |
563 (* commands *) | |
564 | Def | Extern | |
565 | |
566 (* primary *) | |
567 | Ident of string | Number of float | |
568 | |
569 (* unknown *) | |
570 | Kwd of char | |
571 | |
572 lexer.ml: | |
573 .. code-block:: ocaml | |
574 | |
575 (*===----------------------------------------------------------------------=== | |
576 * Lexer | |
577 *===----------------------------------------------------------------------===*) | |
578 | |
579 let rec lex = parser | |
580 (* Skip any whitespace. *) | |
581 | [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream | |
582 | |
583 (* identifier: [a-zA-Z][a-zA-Z0-9] *) | |
584 | [< ' ('A' .. 'Z' | 'a' .. 'z' as c); stream >] -> | |
585 let buffer = Buffer.create 1 in | |
586 Buffer.add_char buffer c; | |
587 lex_ident buffer stream | |
588 | |
589 (* number: [0-9.]+ *) | |
590 | [< ' ('0' .. '9' as c); stream >] -> | |
591 let buffer = Buffer.create 1 in | |
592 Buffer.add_char buffer c; | |
593 lex_number buffer stream | |
594 | |
595 (* Comment until end of line. *) | |
596 | [< ' ('#'); stream >] -> | |
597 lex_comment stream | |
598 | |
599 (* Otherwise, just return the character as its ascii value. *) | |
600 | [< 'c; stream >] -> | |
601 [< 'Token.Kwd c; lex stream >] | |
602 | |
603 (* end of stream. *) | |
604 | [< >] -> [< >] | |
605 | |
606 and lex_number buffer = parser | |
607 | [< ' ('0' .. '9' | '.' as c); stream >] -> | |
608 Buffer.add_char buffer c; | |
609 lex_number buffer stream | |
610 | [< stream=lex >] -> | |
611 [< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >] | |
612 | |
613 and lex_ident buffer = parser | |
614 | [< ' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' as c); stream >] -> | |
615 Buffer.add_char buffer c; | |
616 lex_ident buffer stream | |
617 | [< stream=lex >] -> | |
618 match Buffer.contents buffer with | |
619 | "def" -> [< 'Token.Def; stream >] | |
620 | "extern" -> [< 'Token.Extern; stream >] | |
621 | id -> [< 'Token.Ident id; stream >] | |
622 | |
623 and lex_comment = parser | |
624 | [< ' ('\n'); stream=lex >] -> stream | |
625 | [< 'c; e=lex_comment >] -> e | |
626 | [< >] -> [< >] | |
627 | |
628 ast.ml: | |
629 .. code-block:: ocaml | |
630 | |
631 (*===----------------------------------------------------------------------=== | |
632 * Abstract Syntax Tree (aka Parse Tree) | |
633 *===----------------------------------------------------------------------===*) | |
634 | |
635 (* expr - Base type for all expression nodes. *) | |
636 type expr = | |
637 (* variant for numeric literals like "1.0". *) | |
638 | Number of float | |
639 | |
640 (* variant for referencing a variable, like "a". *) | |
641 | Variable of string | |
642 | |
643 (* variant for a binary operator. *) | |
644 | Binary of char * expr * expr | |
645 | |
646 (* variant for function calls. *) | |
647 | Call of string * expr array | |
648 | |
649 (* proto - This type represents the "prototype" for a function, which captures | |
650 * its name, and its argument names (thus implicitly the number of arguments the | |
651 * function takes). *) | |
652 type proto = Prototype of string * string array | |
653 | |
654 (* func - This type represents a function definition itself. *) | |
655 type func = Function of proto * expr | |
656 | |
657 parser.ml: | |
658 .. code-block:: ocaml | |
659 | |
660 (*===---------------------------------------------------------------------=== | |
661 * Parser | |
662 *===---------------------------------------------------------------------===*) | |
663 | |
664 (* binop_precedence - This holds the precedence for each binary operator that is | |
665 * defined *) | |
666 let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10 | |
667 | |
668 (* precedence - Get the precedence of the pending binary operator token. *) | |
669 let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1 | |
670 | |
671 (* primary | |
672 * ::= identifier | |
673 * ::= numberexpr | |
674 * ::= parenexpr *) | |
675 let rec parse_primary = parser | |
676 (* numberexpr ::= number *) | |
677 | [< 'Token.Number n >] -> Ast.Number n | |
678 | |
679 (* parenexpr ::= '(' expression ')' *) | |
680 | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e | |
681 | |
682 (* identifierexpr | |
683 * ::= identifier | |
684 * ::= identifier '(' argumentexpr ')' *) | |
685 | [< 'Token.Ident id; stream >] -> | |
686 let rec parse_args accumulator = parser | |
687 | [< e=parse_expr; stream >] -> | |
688 begin parser | |
689 | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e | |
690 | [< >] -> e :: accumulator | |
691 end stream | |
692 | [< >] -> accumulator | |
693 in | |
694 let rec parse_ident id = parser | |
695 (* Call. *) | |
696 | [< 'Token.Kwd '('; | |
697 args=parse_args []; | |
698 'Token.Kwd ')' ?? "expected ')'">] -> | |
699 Ast.Call (id, Array.of_list (List.rev args)) | |
700 | |
701 (* Simple variable ref. *) | |
702 | [< >] -> Ast.Variable id | |
703 in | |
704 parse_ident id stream | |
705 | |
706 | [< >] -> raise (Stream.Error "unknown token when expecting an expression.") | |
707 | |
708 (* binoprhs | |
709 * ::= ('+' primary)* *) | |
710 and parse_bin_rhs expr_prec lhs stream = | |
711 match Stream.peek stream with | |
712 (* If this is a binop, find its precedence. *) | |
713 | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c -> | |
714 let token_prec = precedence c in | |
715 | |
716 (* If this is a binop that binds at least as tightly as the current binop, | |
717 * consume it, otherwise we are done. *) | |
718 if token_prec < expr_prec then lhs else begin | |
719 (* Eat the binop. *) | |
720 Stream.junk stream; | |
721 | |
722 (* Parse the primary expression after the binary operator. *) | |
723 let rhs = parse_primary stream in | |
724 | |
725 (* Okay, we know this is a binop. *) | |
726 let rhs = | |
727 match Stream.peek stream with | |
728 | Some (Token.Kwd c2) -> | |
729 (* If BinOp binds less tightly with rhs than the operator after | |
730 * rhs, let the pending operator take rhs as its lhs. *) | |
731 let next_prec = precedence c2 in | |
732 if token_prec < next_prec | |
733 then parse_bin_rhs (token_prec + 1) rhs stream | |
734 else rhs | |
735 | _ -> rhs | |
736 in | |
737 | |
738 (* Merge lhs/rhs. *) | |
739 let lhs = Ast.Binary (c, lhs, rhs) in | |
740 parse_bin_rhs expr_prec lhs stream | |
741 end | |
742 | _ -> lhs | |
743 | |
744 (* expression | |
745 * ::= primary binoprhs *) | |
746 and parse_expr = parser | |
747 | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream | |
748 | |
749 (* prototype | |
750 * ::= id '(' id* ')' *) | |
751 let parse_prototype = | |
752 let rec parse_args accumulator = parser | |
753 | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e | |
754 | [< >] -> accumulator | |
755 in | |
756 | |
757 parser | |
758 | [< 'Token.Ident id; | |
759 'Token.Kwd '(' ?? "expected '(' in prototype"; | |
760 args=parse_args []; | |
761 'Token.Kwd ')' ?? "expected ')' in prototype" >] -> | |
762 (* success. *) | |
763 Ast.Prototype (id, Array.of_list (List.rev args)) | |
764 | |
765 | [< >] -> | |
766 raise (Stream.Error "expected function name in prototype") | |
767 | |
768 (* definition ::= 'def' prototype expression *) | |
769 let parse_definition = parser | |
770 | [< 'Token.Def; p=parse_prototype; e=parse_expr >] -> | |
771 Ast.Function (p, e) | |
772 | |
773 (* toplevelexpr ::= expression *) | |
774 let parse_toplevel = parser | |
775 | [< e=parse_expr >] -> | |
776 (* Make an anonymous proto. *) | |
777 Ast.Function (Ast.Prototype ("", [||]), e) | |
778 | |
779 (* external ::= 'extern' prototype *) | |
780 let parse_extern = parser | |
781 | [< 'Token.Extern; e=parse_prototype >] -> e | |
782 | |
783 codegen.ml: | |
784 .. code-block:: ocaml | |
785 | |
786 (*===----------------------------------------------------------------------=== | |
787 * Code Generation | |
788 *===----------------------------------------------------------------------===*) | |
789 | |
790 open Llvm | |
791 | |
792 exception Error of string | |
793 | |
794 let context = global_context () | |
795 let the_module = create_module context "my cool jit" | |
796 let builder = builder context | |
797 let named_values:(string, llvalue) Hashtbl.t = Hashtbl.create 10 | |
798 let double_type = double_type context | |
799 | |
800 let rec codegen_expr = function | |
801 | Ast.Number n -> const_float double_type n | |
802 | Ast.Variable name -> | |
803 (try Hashtbl.find named_values name with | |
804 | Not_found -> raise (Error "unknown variable name")) | |
805 | Ast.Binary (op, lhs, rhs) -> | |
806 let lhs_val = codegen_expr lhs in | |
807 let rhs_val = codegen_expr rhs in | |
808 begin | |
809 match op with | |
810 | '+' -> build_add lhs_val rhs_val "addtmp" builder | |
811 | '-' -> build_sub lhs_val rhs_val "subtmp" builder | |
812 | '*' -> build_mul lhs_val rhs_val "multmp" builder | |
813 | '<' -> | |
814 (* Convert bool 0/1 to double 0.0 or 1.0 *) | |
815 let i = build_fcmp Fcmp.Ult lhs_val rhs_val "cmptmp" builder in | |
816 build_uitofp i double_type "booltmp" builder | |
817 | _ -> raise (Error "invalid binary operator") | |
818 end | |
819 | Ast.Call (callee, args) -> | |
820 (* Look up the name in the module table. *) | |
821 let callee = | |
822 match lookup_function callee the_module with | |
823 | Some callee -> callee | |
824 | None -> raise (Error "unknown function referenced") | |
825 in | |
826 let params = params callee in | |
827 | |
828 (* If argument mismatch error. *) | |
829 if Array.length params == Array.length args then () else | |
830 raise (Error "incorrect # arguments passed"); | |
831 let args = Array.map codegen_expr args in | |
832 build_call callee args "calltmp" builder | |
833 | |
834 let codegen_proto = function | |
835 | Ast.Prototype (name, args) -> | |
836 (* Make the function type: double(double,double) etc. *) | |
837 let doubles = Array.make (Array.length args) double_type in | |
838 let ft = function_type double_type doubles in | |
839 let f = | |
840 match lookup_function name the_module with | |
841 | None -> declare_function name ft the_module | |
842 | |
843 (* If 'f' conflicted, there was already something named 'name'. If it | |
844 * has a body, don't allow redefinition or reextern. *) | |
845 | Some f -> | |
846 (* If 'f' already has a body, reject this. *) | |
847 if block_begin f <> At_end f then | |
848 raise (Error "redefinition of function"); | |
849 | |
850 (* If 'f' took a different number of arguments, reject. *) | |
851 if element_type (type_of f) <> ft then | |
852 raise (Error "redefinition of function with different # args"); | |
853 f | |
854 in | |
855 | |
856 (* Set names for all arguments. *) | |
857 Array.iteri (fun i a -> | |
858 let n = args.(i) in | |
859 set_value_name n a; | |
860 Hashtbl.add named_values n a; | |
861 ) (params f); | |
862 f | |
863 | |
864 let codegen_func = function | |
865 | Ast.Function (proto, body) -> | |
866 Hashtbl.clear named_values; | |
867 let the_function = codegen_proto proto in | |
868 | |
869 (* Create a new basic block to start insertion into. *) | |
870 let bb = append_block context "entry" the_function in | |
871 position_at_end bb builder; | |
872 | |
873 try | |
874 let ret_val = codegen_expr body in | |
875 | |
876 (* Finish off the function. *) | |
877 let _ = build_ret ret_val builder in | |
878 | |
879 (* Validate the generated code, checking for consistency. *) | |
880 Llvm_analysis.assert_valid_function the_function; | |
881 | |
882 the_function | |
883 with e -> | |
884 delete_function the_function; | |
885 raise e | |
886 | |
887 toplevel.ml: | |
888 .. code-block:: ocaml | |
889 | |
890 (*===----------------------------------------------------------------------=== | |
891 * Top-Level parsing and JIT Driver | |
892 *===----------------------------------------------------------------------===*) | |
893 | |
894 open Llvm | |
895 | |
896 (* top ::= definition | external | expression | ';' *) | |
897 let rec main_loop stream = | |
898 match Stream.peek stream with | |
899 | None -> () | |
900 | |
901 (* ignore top-level semicolons. *) | |
902 | Some (Token.Kwd ';') -> | |
903 Stream.junk stream; | |
904 main_loop stream | |
905 | |
906 | Some token -> | |
907 begin | |
908 try match token with | |
909 | Token.Def -> | |
910 let e = Parser.parse_definition stream in | |
911 print_endline "parsed a function definition."; | |
912 dump_value (Codegen.codegen_func e); | |
913 | Token.Extern -> | |
914 let e = Parser.parse_extern stream in | |
915 print_endline "parsed an extern."; | |
916 dump_value (Codegen.codegen_proto e); | |
917 | _ -> | |
918 (* Evaluate a top-level expression into an anonymous function. *) | |
919 let e = Parser.parse_toplevel stream in | |
920 print_endline "parsed a top-level expr"; | |
921 dump_value (Codegen.codegen_func e); | |
922 with Stream.Error s | Codegen.Error s -> | |
923 (* Skip token for error recovery. *) | |
924 Stream.junk stream; | |
925 print_endline s; | |
926 end; | |
927 print_string "ready> "; flush stdout; | |
928 main_loop stream | |
929 | |
930 toy.ml: | |
931 .. code-block:: ocaml | |
932 | |
933 (*===----------------------------------------------------------------------=== | |
934 * Main driver code. | |
935 *===----------------------------------------------------------------------===*) | |
936 | |
937 open Llvm | |
938 | |
939 let main () = | |
940 (* Install standard binary operators. | |
941 * 1 is the lowest precedence. *) | |
942 Hashtbl.add Parser.binop_precedence '<' 10; | |
943 Hashtbl.add Parser.binop_precedence '+' 20; | |
944 Hashtbl.add Parser.binop_precedence '-' 20; | |
945 Hashtbl.add Parser.binop_precedence '*' 40; (* highest. *) | |
946 | |
947 (* Prime the first token. *) | |
948 print_string "ready> "; flush stdout; | |
949 let stream = Lexer.lex (Stream.of_channel stdin) in | |
950 | |
951 (* Run the main "interpreter loop" now. *) | |
952 Toplevel.main_loop stream; | |
953 | |
954 (* Print out all the generated code. *) | |
955 dump_module Codegen.the_module | |
956 ;; | |
957 | |
958 main () | |
959 | |
960 `Next: Adding JIT and Optimizer Support <OCamlLangImpl4.html>`_ | |
961 |