Pharo Libclang FFI, part 3, loading an AST

In the last part we learnt how to get the version string of the library.  That was good to prove it basically works, and also to develop our first C type “CXString“. Now we want to Pharo to process some C code.  Baby steps with `libclang`: Walking an abstract syntax tree provided a good introductory tutorial to using libclang but was a bit C++ oriented, which is not so suitable for Pharo’s FFI.  A pure C interface is easier, so I adapted that tutorial with help from sabottenda’s libclang-sample ASTVisitor.

Doing it in C

Libclang terminology defines a “translation unit” as the basic unit of compilation, such as a single source file with its header files included. An “index” holds a set of translation units, which may end up linked into an executable or library. A “cursor” is used to to traverse the AST of a translation unit.   We get a CXCursor from a CXTranslationUnit which comes from a CXIndex. The following C code provides just enough scaffolding to to see how we get one each, which provides a bridgehead for connecting Pharo to use libclang.  In the folder where foo.ast was created, do…
$ vi simple.c

#include <clang-c/Index.h>
#include <stdio.h>

void show_cursor_kind(CXCursor cursor)
    enum CXCursorKind cursorKind  = clang_getCursorKind(cursor);
    CXString kindName  = clang_getCursorKindSpelling(cursorKind);
    CXString entityName = clang_getCursorSpelling(cursor);
    printf("%03d  %s(%s)\n",

// todo - accept cursor callback

int main( int argc, const char *const * argv )
{   // Process arguments
    if( argc < 2 )
    {   printf("Usage: %s inputfile {clang-options}\n", argv[0] );
        return -1;
    const char * mainSourceFilename = argv[1];
    const char *const* options = argv + 2;
    int optionCount = argc - 2;

    // Create the Index and Transalation Units
    CXIndex index = clang_createIndex( 0, 0 )
    CXTranslationUnit TU = clang_createTranslationUnit( index, mainSourceFilename );
    if( !TU )
    {   printf("Failed to get Translation Unit\n");
        return -1;

    // The root cursor of a TU is the translation unit itself
    CXCursor rootCursor  = clang_getTranslationUnitCursor( TU );

    // todo - invoke visitor

    clang_disposeTranslationUnit( TU )
    clang_disposeIndex( index );
    return 0;

$ clang-3.5 -I/usr/lib/llvm-3.5/include -lclang -o simple simple.c
$ ./simple foo.ast
300=TranslationUnit (/home/ben/Apps/moose_suite_6_0/libclang-play/foo.c)

Here we see a cursor kind of 300 is the translation unit itself, and its ‘name’ is that of the main source file. If you look back at the dump-ast in part 1, you can see this was the first item.

And now in Pharo…


CIndex provides a shared context for creating translation units. Lets review the relevant declarations from clang-c/index.h

typedef void *CXIndex;

CINDEX_LINKAGE CXIndex clang_createIndex(
                      int excludeDeclarationsFromPCH,
                     int displayDiagnostics );

The CINDEX_LINKAGE can be ignored. In clang-c/Platform.h it shows this is a macro that just chooses between dllexport and dllimport on Microsoft platforms.

Now we shouldn’t need to peek at the function’s implementation, but we can for an interesting insight, CIndex.cpp shows the function wraps creation of a C++ object.

extern "C"
{   CXIndex clang_createIndex(
                 int excludeDeclarationsFromPCH,
                 int displayDiagnostics)
    {   ...
        CIndexer *CIdxr = new CIndexer();
        return CIndexer;

CXIndex type is again similar to an opaque object, but having seen previously that FFIOpaqueObject does’t like zero pointer parity in the function declaration we look for an alternative. Its sibling class FFIExternalObject looks promising since its class comment provides this example which looks like our case…
self ffiCall: #(MyExternalObject someExternalFunction() )

So we try…

FFIExternalObject subclass: #CXIndex
    instanceVariableNames: ''
    classVariableNames: ''
    package: 'Libclang'

Libclang class >> clang_createIndex__excludeDeclarationsFromPCH: excludeDeclarationsFromPCH
                                      displayDiagnostics: displayDiagnostics
    ^ self ffiCall: #( CXIndex clang_createIndex (
	                         int excludeDeclarationsFromPCH,
	                         int displayDiagnostics) )

CXIndex class >> create
    ^ Libclang clang_createIndex__excludeDeclarationsFromPCH: 0
                                   displayDiagnostics: 0.

Now in the playground the following seems to work…
(index := CXIndex create) inspect.
but proof will come when using it in the next section.


Translation units reside in an index. A single translation unit is associated at creation time directly with a single main file plus any of its indirectly included files.  It is defined in clang-c/index.h as…

typedef struct CXTranslationUnitImpl *CXTranslationUnit;

clang_createTranslationUnit(CXIndex CIdx, const char *ast_filename);

clang_getTranslationUnitSpelling(CXTranslationUnit CTUnit);

So in Pharo lets try…

FFIExternalObject subclass: #CXTranslationUnit
    instanceVariableNames: ''
    classVariableNames: ''
    package: 'Libclang'

Liblang class >> clang_createTranslationUnit__cxIndex: cxIndex
                                   astFilename: ast_filename
    ^ self ffiCall: #( CXTranslationUnit clang_createTranslationUnit(
                                         CXIndex CIdx,
                                         String ast_filename) ).

Libclang class >> clang_getTranslationUnitSpelling__translationUnit: translationUnit
     ^ self ffiCall: #( CXString clang_getTranslationUnitSpelling(
                                          CXTranslationUnit translationUnit) ) 

CXIndex >> createTUFromAstFile: astFile
    ^Libclang clang_createTranslationUnit__cxIndex: self
                        astFilename: astFile pathString

CXTranslationUnit >> mainFilename
   ^ (Libclang clang_getTranslationUnitSpelling__translationUnit: self) getString.

I had a minor hiccup where it seems that the filename resolution may differ between Pharo and libclang, so we pass the absolute path string like this…

LibclangTest >> testTranslationUnitMainFilename
    | index astFile txUnit mainFileName |
    index := CXIndex create.
    astFile := '../foo.ast' asFileReference.
    self assert: astFile exists.
    txUnit := index createTUFromAstFile: astFile.
    mainFileName := txUnit mainFilename.
    self assert: (mainFileName includesSubstring: 'foo.c')


Traversing the AST involves a cursor that records the kind of node it is.  Decoding the kind of cursor is specified in clang-c/index.h as…

enum CXCursorKind {
  CXCursor_UnexposedDecl                 = 1,
  CXCursor_StructDecl                    = 2,
  CXCursor_UnionDecl                     = 3,
  CXCursor_FunctionDecl                  = 8,
  CXCursor_TranslationUnit               = 300,

In clang-c/index.h you can see that CXCursorKind is a massive enumeration needing first to be massaged like this…
$ cat CXCursorKind.raw | grep = | sed ‘s/=‘ | sed ‘s/,‘ > CursorKind.clean
plus further manual cleanup of some leftover comments and substitution of some right-hand-side identifiers. Then we need to copy that into Pharo like this…

FFIExternalEnumeration subclass: #CXCursorKind
	instanceVariableNames: ''
	classVariableNames: ''
	package: 'Libclang'

CXCursorKind class >> enumDecl
	^#(   "copy/paste to here from file CursorKind.clean"   )

CXCursorKind class >> initialize
	"self initialize"
	self initializeEnumeration

Don’t forget to evaluate that comment to initialize CXCursorKind!


Now we can use that cursor kind definition to define the cursor type. Here is its C definition…

typedef struct {
  enum CXCursorKind kind;
  int xdata;
  const void *data[3];
} CXCursor;

CXCursor clang_getTranslationUnitCursor(CXTranslationUnit);

We see that CXCursor is a struct similar to CXString.
However a new thing is this “array of three pointers at void”.
In Pharo we can create an instance of a particular type and store it in a
class variable (VoidPointer3) to reference from #fieldsDesc.

FFIExternalStructure subclass: #CXCursor
  instanceVariableNames: ''
  classVariableNames: 'VoidPointer3'
  package: 'Libclang'

CXCursor class >> initialize
    VoidPointer3 := FFITypeArray ofType: 'void*' size: 3.

CXCursor class >> fieldsDesc
    "self initialise; rebuildFieldAccessors"
    ^ #(	CXCursorKind kind;
                int xdata;
                VoidPointer3 data; )

and evaluate that comment to initialise and rebuild CXCursor fields.  Now define the callout and a test that confirms that the root cursor of a translation does itself point at a translation unit.  I expect many tests will need to start with a root cursor, we separate that out.  We need the tests to make use of CursorKind enumeration, so we need to add it as a pool dictionary…

TestCase subclass: #LibclangTest
    instanceVariableNames: ''
    classVariableNames: ''
    poolDictionaries: 'CXCursorKind'
    package: 'Libclang'

Libclang class >> clang_getTranslationUnitCursor__translationUnit: translationUnit
    ^ self ffiCall:
    #( CXCursor clang_getClangVersion ( CXTranslationUnit translationUnit ) ) 

CXTranslationUnit >> rootCursor
    ^  Libclang clang_getTranslationUnitCursor__translationUnit: self

LibclangTest >>getRootCursor
	| index astFile txUnit |
	index := CXIndex create.
	astFile := '../foo.ast' asFileReference.
	self assert: astFile exists.
	txUnit := index createTUFromAstFile: astFile.
	^ txUnit rootCursor.

LibclangTest >>testTranslationUnitRootCursor
        | rootCursor |
        rootCursor := self getRootCursor.
        self assert: rootCursor kind equals: CXCursor_TranslationUnit.

Try it out.  Whoops! Something wrong here … “Error: Invalid value for CXCursorKind enumeration” when #kind was sent.  Unless of course you’re ahead of the game and already corrected the problem.   But first a confession… As I originally developed this tutorial I didn’t create tests as I went – and I got burnt!  That error message made me believe the system had some problem with functions returning  enumerations and as a workaround I changed
CXCursor class >> fieldsDesc to use int kind which (after rebuilding the field accessors) seemed to fix that problem…

CXCursor class >> fieldsDesc
    "self initialise; rebuildFieldAccessors"
    ^ #(	int kind;
                int xdata;
                VoidPointer3 data; )

Except the assert still failed with “TestFailure: Got -1292534328 instead of a CXCursorKind(#CXCursor_TranslationUnit).”  That is a really weird integer. I spent an hour questioning if the index and translation unit were being created correctly, until I wrote #testTranslationUnitMainFilename to prove they were fine.  So that left #rootCursor where I discovered I was performing the wrong callout.  Here is the correct version to compare to above…

Libclang class >> clang_getTranslationUnitCursor__translationUnit: translationUnit
    ^ self ffiCall:
    #( CXCursor clang_getTranslationUnitCursor ( CXTranslationUnit translationUnit ) ) 

The assert was still failing but in an understandable way… “TestFailure: Got 300 instead of a CXCursorKind(#CXCursor_TranslationUnit)” where 300 was output from the ./simple C program at the top of page.  I changed the assert like this…

self assert: rootCursor kind equals: CXCursor_TranslationUnit value.

and it worked! But I was a little disappointed that #value needed to be sent to the enumeration.  It would be nicer to be able to just use the enumeration directly.  Indeed it turned out this wasn’t quite right yet, but I didn’t know until the end of the next section.

Friendly display of cursors

It would be nice if our cursors could be displayed similar to how show_cursor_kind() did. So we need to implement the three methods it used.

enum CXCursorKind clang_getCursorKind(CXCursor);

can be defined like this…

Libclang class >> clang_getCursorKind__cxCursor: cxCursor
    ^ self ffiCall: #( CXCursorKind clang_getCursorKind(CXCursor cxCursor) )

Now I was going to have Cursor>>kind call the library function, but that is existed as a generated method, and if I customised it what would happen if the accessors were regenerated?  But maybe peeking at the struct internals like this…

Cursor >> kind    ^handle signedLongAt: 1

…isn’t good when a library API is supplied to get the value.  Maybe it really would be better Cursor>>kind for it to call the Libclang method.  Some kind soul might advise whether there is some pragma to prevent field accessor generation overwriting custom accessors.  Anyway, for now we’ll have the test call the Libclang method direct, and just for the hell of it we’ll see what happens using the struct internals. .

LibclangTest >> testCursorKind
    | rootCursor cursorKind |
    rootCursor := self getRootCursor.
    cursorKind := Libclang clang_getCursorKind__cxCursor: rootCursor.
    self assert: cursorKind equals: CXCursor_TranslationUnit.
    self assert: cursorKind equals: rootCursor kind.

Run the test.  Hmmmm, thats interesting! The first assert works but the second fails.
So cursorKind ==> a CXCursorKind(#CXCursor_TranslationUnit) is correct and rootCursor kind ==> 300 is wrong.  Actually why is that returning an Integer anyhow?  Whoops! I never undid my earlier workaround changing the type of the kind field from CXCursorKind to int. Lets do that now and rebuild…

CXCursor class >> fieldsDesc
    "self initialise; rebuildFieldAccessors"
    ^ #(	CXCursorKind kind;
                int xdata;
                VoidPointer3 data; )

Try the test again.  Hmmm… now #testTranslationUnitCursor fails, but actually its that #value being sent to the enumeration that had disappointed me.  Get rid of that #testCursorKind is successful.

So it turns of that syntax of FFI that I didn’t like was forced by my error. Now I’ve fixed it the syntax feel very natural.  The FFI developers did a good job with it.  So anyway, continuing on…

CXString clang_getCursorKindSpelling(enum CXCursorKind Kind);

can be defined as…

Libclang class >> clang_getCursorKindSpelling__cursorKind: kind
    ^ self ffiCall: #( CXString clang_getCursorKindSpelling(CXCursorKind kind) )

CXCursorKind class >> ffiLibrary    ^ Libclang

CXCursorKind >> spelling
    ^ self ffiLibrary clang_getCursorKindSpelling__cursorKind: self.

LibclangTest >> testgetCursorKindSpelling
    | rootCursor kind spelling |
    rootCursor := self getRootCursor.
    kind := rootCursor kind.
    spelling := kind spelling.
    self assert: spelling getString equals: 'TranslationUnit'.

The strange dance with CXCursorKind is because the instance side of SharedPool subclasses has a problem resolving global identifiers.  I don’t know why, but this workaround works.  Now last one…

CXString clang_getCursorSpelling(CXCursor);

can be defined as…

Libclang class >> clang_getCursorSpelling__cxCursor: cxCursor
    ^ self ffiCall: #( CXString clang_getCursorSpelling(CXCursor cxCursor) )

CXCursor >> spelling
    ^ Libclang clang_getCursorSpelling__cxCursor: self

LibclangTest >>testgetCursorSpelling
    | rootCursor kind spelling |
    rootCursor := self getRootCursor.
    spelling := rootCursor spelling.
    self assert: (spelling getString includesSubstring: 'foo.c')

Persistance & Memory Management

The C code ends with the disposal of the index and translation unit.  A search of the libclang exports indicates there is no API function to dispose of cursors.  Presumably the *data pointers of the CXCursor struct point to memory allocated as part of the translation unit. So we’ll define these two callouts.

Libclang class >> clang_disposeIndex__cxIndex: index
    ^ self ffiCall: #( void clang_disposeIndex(CXIndex index) )

Libclang class >> clang_disposeTranslationUnit__cxTranslationUnit: txunit
    ^ self ffiCall: #(  void clang_disposeTranslationUnit(CXTranslationUnit txunit) )

Invalidating inter-session data and memory disposal is obviously going to a common requirement for many classes of external object.  So we’ll try using a Trait.  Probably later I’ll learn there is a better way, but for now this is the simplest thing that will work – and using a Trait should make it easier to adapt in the future.  We will need to define the two private methods for each class that uses the Trait.

Trait named: #TLibclangResourceManager
    uses: {}
    category: 'Libclang'

TLibclangResourceManager class >> startUp: resuming
    resuming ifTrue: [
        self allInstances do: [
            :cxs | cxs privateInvalidateSessionData ] ].

TLibclangResourceManager >>privateInvalidateSessionData
    self getHandle atAllPut: 0.

TLibclangResourceManager class >> initialize
    "self initialize."
    SessionManager default registerUserClassNamed: self name

TLibclangResourceManager  >> dispose
    self isNull
        [    Error signal: 'Guard prevented double-dispose' ]
        [    self privateDispose.
             self privateInvalidateSessionData. ] 

TLibclangResourceManager >> autoRelease
    self class finalizationRegistry add: self

TLibclangResourceManager >> finalize
    self dispose.

And now to use it…

FFIExternalObject subclass: #CXIndex
    uses: TLibclangResourceManager
    instanceVariableNames: ''
    classVariableNames: ''
    package: 'LibclangPractice'

CXIndex>> privateDispose
    Libclang clang_disposeIndex__cxIndex: self
FFIExternalObject subclass: #CXTranslationUnit
    uses: TLibclangResourceManager
    instanceVariableNames: ''
    classVariableNames: ''
    package: 'LibclangPractice'

CXTranslationUnit >> privateDispose
    Libclang clang_disposeTranslationUnit__cxTranslationUnit:self
FFIExternalStructure subclass: #CXCursor
    uses: TLibclangResourceManager
    instanceVariableNames: ''
    classVariableNames: 'VoidPointer3'
    package: 'LibclangPractice'

CXCursor >> privateDispose
    "nothing to do"

Parting thoughts

So we’ve defined the three main components that facilitate loading a unit of translation (i.e. a C file) into the index and obtaining a cursor on the top of the AST.

The next part will define the callback from libclang into Pharo to allow us to walk the AST.

This entry was posted in FFI, Pharo. Bookmark the permalink.

Leave a Reply