Pharo Libclang FFI, part 2, simple callout string return

This is my first exposure to using Pharo’s FFI, so before diving in to process some AST, lets try something simpler to gain familiarity with the library.  Something real simple… 
no parameters and just returning a string. The function clang_getClangVersion() seems to fit the bill.  First lets see how it works in pure-C.

Doing it in C

$ vi version.c

#include <clang-c/Index.h>
#include <stdio.h>

void show_clang_version(void) {
        CXString version = clang_getClangVersion();
        printf("%s\n", clang_getCString(version));

int main( int argc, const char *const * argv )
        return 0;

To compile:

  • the  -lclang flag is needed to resolve to the libclang library;
  • the  -I flag is needed to resolve  Index.h. , for which the path was determined using…

$ find /usr -name Index.h.

Thus (for me) the code was compiled and run like this…
$ clang-3.5 -I/usr/lib/llvm-3.5/include -lclang -o version version.c
$ ./version

Debian clang version 3.5.0-10 (tags/RELEASE_350/final) (based on LLVM 3.5.0)

And now in Pharo

So now lets reproduce that in Pharo. First we need to define the C library holding the functions we want to use. I don’t know whether Pharo automatically resolves library versions like -lclang does from the command line, so it was more certain to resolve it manually like this…
$ /sbin/ldconfig -p | grep libclang (libc6) => /usr/lib/i386-linux-gnu/

which in Pharo can be defined like this…

FFILibrary subclass: #Libclang
        instanceVariableNames: ''
        classVariableNames: ''
        package: 'Libclang'

Libclang >> unixModuleName

Libclang class >> ffiLibraryName

The ffiLibraryName:  method facilitates the use of ffiCall: on its own rather than every  ffiCall:module: call needing to repeating  Libclang every FFI call.


The function clang_getClangVersion() returns a CXString. This type needs to be defined for Pharo, so we examine its C definition in Index.h (a file we located earlier.)

/* The CXString type is used to return strings from the interface
 * when the ownership of that string might differ from one call to
 * the next. Use clang_getCString() to retrieve the string data and,
 * once finished with the string data, call clang_disposeString()
 * to free the string.
typedef struct {
  const void *data;
  unsigned private_flags;
} CXString;

At first I considered CXString an opaque object, since we never access its internals, have no information on the structure of  data, and only interact with it via the libclang API clang_getCString() to get a printable string. So I guessed it might be defined like…

FFIOpaqueObject subclass: #CXString
	instanceVariableNames: ''
	classVariableNames: ''
	package: 'Libclang'

After copying verbatim the C typedef struct declaration, the FFI parser complained since it currently doesn’t handle the keywords const and unsigned. Actually const is irrelevant for the runtime FFI interface, being used for compile time static validations, so we can ignore that. Unsigned is meant to modify other types and its use on its own is considered poor practice, but on its own is equivalent to uint. So the final FFI field description and function definition became…

CXString class >> fieldsDesc
    "CXString rebuildFieldAccessors"
    ^ #(
	void *data;
	uint private_flags;

Libclang class >> clang_getClangVersion
    ^ self ffiCall: #( CXString clang_getClangVersion () )

CXString class >> getClangVersion
    ^ Libclang clang_getClangVersion

The comment was evaluated to initialize the field accessors, then  
Libclang clang_getClangVersion was  evaluated, but produced an error FFIDereferencedOpaqueObjectError.  The class comment of this error lead to the FFIOpaqueObject saying it assumes “we always access through a reference … that cannot be used dereferenced,” and we don’t have CXString *  clang_getClangVersion(). Indeed, inspecting “CXString new” shows it consists of only four bytes, the size of a pointer (on my 32-bit system) and not big enough for the struct. Indeed inspecting “CXCursor new” shows 8 bytes. So type class definition was changed to…

FFIExternalStructure subclass: #CXString
	instanceVariableNames: ''
	classVariableNames: ''
	package: 'Libclang'

and then after rebuilding the field accessors the FFI callout was successful…

CXString rebuildFieldAccessors.
CXString getClangVersion.
==>"CXString (
data: 	(void*)@ 16r090C9F30
private_flags: 	1

But we want to display the same version display string output by the C program output earlier.  For that we need to use…

CINDEX_LINKAGE const char *clang_getCString(CXString string)

and define…

Libclang class >> clang_getCString__cxString: aCXString
    ^ self ffiCall: #( String clang_getCString ( CXString self ) )

CXString >> getString
    ^ Libclang clang_getCString__cxString: self.

CXString >> printOn: aStream
    self getString printOn: aStream.
    aStream cr.
    super printOn: aStream.

And now we can do…

Libclang  clang_getClangVersion  inspect.
"'Debian clang version 3.5.0-10 (tags/RELEASE_350/final) '
CXString (  data: (void*)@ 16rB26781D8
            private_flags: 1 )"

So lets make that into a test…

TestCase subclass: #LibclangTest
    instanceVariableNames: ''
    classVariableNames: ''
    package: 'Libclang'

LibclangTest class >> ffiLibraryName
    ^ Libclang. 

LibclangTest >> test_clang_getClangVersion
	| version |
	version := Libclang clang_getClangVersion.
	self assert: (version getString includesSubstring: 'clang')

Inter-session persistency

Persistency between sessions is one of the hallmarks of Pharo. However the C library doesn’t behave like this. Calls like clang_getClangVersion() allocate memory on the C-heap, which is thrown away when the VM stops running. It memory is not saved with the Image at the end of a session. In a new session of that Image all the *data pointers in CXString become invalid. You can prove this yourself. In the Playground try …

version := LibCLang  getClangVersion.
version getString.
"Save and quit image here.  Execute next line upon resuming."
version getString.

Bang! The VM crashed.  Before the save, the CXString assigned to version had a valid data field.  After resuming that field was invalid, so the callout to clang_getCString() crashes due to an invalid pointer.  These are the vagaries of memory management that Smalltalk normally shields us from, but we now forced to deal with it.  One approach is… when resuming a frozen session we could force *data to a null which getString can guard against. We could do it like this…

CXString >> invalidateSessionData
    handle atAllPut: 0. 

CXString class >> startUp: resuming
        ifTrue: [ self allInstances do: [ :cxs | cxs invalidateSessionData ] ].

CXString class >> initialize
    "CXString initialize."
    SessionManager default registerUserClassNamed: self name

After evaluating the comment to initialize CXString with the SessionManager, the previous experiment was re-tried.  It survived!  Instead of a crash, an UndefinedObject(nil) is returned – which Pharo can deal with, but maybe an additional improvement could be…

CXString >> getString
    ^ self data isNull
         ifTrue: ['external memory invalidated by session restart']
         ifFalse:[Libclang clang_getCString__cxString: self].

plus some tests…

LibclangTest >> testCXStringReset
    | version |
    version := CXString getClangVersion.
    version invalidateSessionData.
    self assert: (version printString includesString: 'invalid' ).  

LibclangTest >> testGetLangVersion
    | version |
    version := CXString getClangVersion.
    self assert: (version getString includesSubstring: 'clang')

Intra-session memory management

From the use of clang_disposeString() in our C function show_clang_version()
the memory that  clang_getClangVersion() dynamically allocated on the heap needs to be freed by calling clang_disposeString().  We want to guard against calling that twice, since freeing memory twice usually crashes a program, in this case the VM.

We can leverage what we did for inter-session persistency…

Libclang class >> clang_disposeString__cxString: aCXString
    ^self ffiCall: #( void clang_disposeString ( CXString aCXString ) )

    self isNull
        [    Error signal: 'Guard prevented double-dispose' ]
        [    Libclang clang_disposeString__cxString: self.
             self invalidateSessionData. ] 

LibclangTest >> testCXStringDisposeTwice
        | version errored|
        version := CXString getClangVersion.
        version dispose.
        errored := false.
        [ version dispose ] on: Error do: [ :ex| errored := true ].
        self assert: errored.

For the curious, you can use `top` from the command line to watch the size of the image grow while running the following…

Transcript open; clear.
[ oc := OrderedCollection new.
  1 to: 100 do:
      Transcript crShow: n.
      1000 timesRepeat:[oc add: Libclang getClangVersion].
      oc do: [:s| s dispose].
      oc removeAll.
      Smalltalk garbageCollect.
] forkAt: 35.

You can compare that to running with the dipose message commented out. Unfortunately it seems that while the disposed memory is reused inside the image, it is not entirely released back to the system.  The new Spur memory manager of the opensmalltalk-vm that Pharo runs on was released with a basic garbage collection scheme only.  It is scheduled for improvement – so watch this space.

Autorelease infrastructure

Now its nice that we can dispose of the memory on the C heap, but actually!… we really don’t want to manage this manually. We want that memory on the C-heap to be released automagically when the object is finalized during normal garbage collection. To register an FFI object for finalization you send it #autoRelease.  The curious can peek at its senders and  implementers in the image.  FFI has two main built-in auto release mechanisms. Lets take a look at them behind the scenes…

  • An ExternalAddress‘s #autorelease registers itself directly with the system’s finalization registry. Note that Object is the only implementer of #finalizationRegistry in the image.
Object >> finalizationRegistry
    ^WeakRegistry default

ExternalAddresses >> autoRelease
    ^ self class finalizationRegistry add: self.  

After your object holding an ExternalAddress is GC’d, the ExternalAddress is GC’d sent #finalize

     self isNull ifTrue: [^self].
     self free.

     ^self primitiveFailed
  • An FFIExternalReference‘s #autorelease registers itself for finalisation with a  resource manager that in turn is registered with the system’s WeakRegistry finalization.
FFIExternalResourceManager class >> initialize
    registry := WeakRegistry new

FFIExternalReference >> autoRelease
    FFIExternalResourceManager addResource: self
    "Note, subclasses should implement #resourceData 
    and on the class-side #finalizeResourceData:"

FFIExternalResourceManager class >> addResource: anObject
    self uniqueInstance addResource: anObject

FFIExternalResourceManager >> addResource: anObject
    ^ self addResource: anObject data: anObject resourceData

FFIExternalReference  >> resourceData
     ^ self getHandle

FFIExternalResourceManager >> addResource: anObject data: resourceData
        add: anObject
        executor: (FFIExternalResourceExecutor new
            resourceClass: anObject class
            data: resourceData)

FFIExternalResourceManager >> initialize 
    super initialize.
    session := Smalltalk session  

FFIExternalResourceExecutor >> resourceClass: aResourceClass data: aData
    resourceClass := aResourceClass.
    data := aData

After your object holding an FFIExternalReference is GC’d, the FFIExternalReference is GC’d and its executor is sent #finalize, which in turn sends #finalizeResourceData: to the FFIExternalReference.

FFIExternalResourceExecutor >> finalize
    session = Smalltalk session ifFalse: [^ self ].
    resourceClass finalizeResourceData: data.

FFIExternalReference  class >>
    finalizeResourceData: handle
    handle isNull ifTrue: [ ^ self ].
    handle free.
    handle beNull.

One thing to notice is that FFIExternalResourceExecutor does session management, to avoid finalising/freeing C memory previous thrown away by a session change. Notice also that the executor doesn’t store the object itself but only its resource data and its class. The idea is to keep minimal information (in general, just the handle) to avoid circular references to the object, to help ensure the object holding the handle will be collected.  As a consequence,  in #finalizeResourceData: you cannot access any properties of the object, since it is already garbage collected.

CXString autorelease

Since CXString already has some session management implemented for inter-session persistance, we can leaverage that and for now I’ll just use the simpler form of auto release. Maybe revise it later.

CXString >> autoRelease
    self class finalizationRegistry add: self

CXString >> finalize
    self dispose.

Parting thoughts

You may have noticed I’ve defined all the raw FFI callouts  in the class Libclang and wrapped these with guard logic in the Clang type classes.  As I was first getting into this I had the raw #fficalls scattered through the C type classes.  But as I discovered these needed to be wrapped in some guard logic, methods like #private_clang_dispose became necessary and it felt a bit messy.  Conversely, it just felt neater for all the raw #ffiCalls to be on one class with the same name as the library, and we’ll wrap these in calls from the
C-type objects.  For one thing, it should be easier to search for all-senders if the raw#ffiCall is done only once.

The next part shows how to load the C code AST, with the part after that showing how to walk the AST.

This entry was posted in FFI, Pharo. Bookmark the permalink.

Leave a Reply