libstemmer_c

This document pertains to the C version of the libstemmer distribution, available for download from:

https://snowballstem.org/download.html

What is Stemming?

Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a search for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.

Compiling the library

A simple makefile is provided for Unix style systems. On such systems, it should be possible simply to run "make", and the file "libstemmer.o" and the example program "stemwords" will be generated.

If this doesn't work on your system, you need to write your own build system (or call the compiler directly). The files to compile are all contained in the "libstemmer", "runtime" and "src_c" directories, and the public header file is contained in the "include" directory.

The library comes in two flavours; UTF-8 only, and UTF-8 plus other character sets. To use the utf-8 only flavour, compile "libstemmer_utf8.c" instead of "libstemmer.c".

For convenience "mkinc.mak" is a makefile fragment listing the source files and header files used to compile the standard version of the library. "mkinc_utf8.mak" is a comparable makefile fragment listing just the source files for the UTF-8 only version of the library.

Using the library

The stemming algorithms generally expect the input text to use composed accents (Unicode NFC or NFKC) and to have been folded to lower case already.

The library provides a simple C API. Essentially, a new stemmer can be obtained by using "sb_stemmer_new". "sb_stemmer_stem" is then used to stem a word, "sb_stemmer_length" returns the stemmed length of the last word processed, and "sb_stemmer_delete" is used to delete a stemmer.

Generally you should create a stemmer object and reuse it rather than creating a fresh object for each word stemmed (since there's some cost to creating and destroying the object).

The stemmer code is re-entrant, but not thread-safe if the same stemmer object is used concurrently in different threads.

If you want to perform stemming concurrently in different threads, we suggest creating a new stemmer object for each thread. The alternative is to share stemmer objects between threads and protect access using a mutex or similar but that's liable to slow your program down as threads can end up waiting for the lock.

libstemmer does not currently incorporate any mechanism for caching the results of stemming operations. Such caching can greatly increase the performance of a stemmer under certain situations, so suitable patches will be considered for inclusion.

The standard libstemmer sources contain an algorithm for each of the supported languages. The algorithm may be selected using the english name of the language, or using the 2 or 3 letter ISO 639 language codes. In addition, the traditional "Porter" stemming algorithm for english is included for backwards compatibility purposes, but we recommend use of the "English" stemmer in preference for new projects.

(Some minor algorithms which are included only as curiosities in the snowball website, such as the Lovins stemmer and the Kraaij Pohlmann stemmer, are not included in the standard libstemmer sources. These are not really supported by the snowball project, but it would be possible to compile a modified libstemmer library containing these if desired.)

The stemwords example

The stemwords example program allows you to run any of the stemmers compiled into the libstemmer library on a sample vocabulary. For details on how to use it, run it with the "-h" command line option.

Using the library in a larger system

If you are incorporating the library into the build system of a larger program, I recommend copying the unpacked tarball without modification into a subdirectory of the sources of your program. Future versions of the library are intended to keep the same structure, so this will keep the work required to move to a new version of the library to a minimum.

As an additional convenience, the list of source and header files used in the library is detailed in mkinc.mak - a file which is in a suitable format for inclusion by a Makefile. By including this file in your build system, you can link the snowball system into your program with a few extra rules.

Using the library in a system using GNU autotools

The libstemmer_c library can be integrated into a larger system which uses the GNU autotool framework (and in particular, automake and autoconf) as follows:

  1. Unpack libstemmer_c-*.tar.gz in the top level project directory and rename the resulting directory to remove the version number so that there is a libstemmer_c subdirectory of the top level directory of the project.
  2. Add a file "Makefile.am" to the unpacked libstemmer_c folder, containing:

noinst_LTLIBRARIES = libstemmer.la include $(srcdir)/mkinc.mak noinst_HEADERS = $(snowball_headers) libstemmer_la_SOURCES = $(snowball_sources)

(You may also need to add other lines to this, for example, if you are using compiler options which are not compatible with compiling the libstemmer library.)

  1. Add libstemmer_c to the AC_CONFIG_FILES declaration in the project's configure.ac file.
  2. Add to the top level makefile the following lines (or modify existing assignments to these variables appropriately):

AUTOMAKE_OPTIONS = subdir-objects AM_CPPFLAGS = -I$(top_srcdir)/libstemmer_c/include SUBDIRS=libstemmer_c <name>_LIBADD = libstemmer_c/libstemmer.la

(Where <name> is the name of the library or executable which links against libstemmer.)