What is MEDLINE?

MEDLINE is a collection of 18 million plus citations into the bio-medical literature maintained by the United States National Library of Medicine (NLM). New data is being released at the rate of a million citations per year, which works out to about 5000 per working day.

MEDLINE encodes richly structured data about publications including authors, affiliations, titles, abstracts, grants, medical subject headings (MeSH), etc. LingPipe provides tools to parse the data from its native XML format into a structured Java object.

2010 MEDLINE Version

This tutorial is for the current (2010) MEDLINE/PubMed production year. Every year, NLM releases a new version of MEDLINE, with a revised DTD with a timestamp in its name.

What's in this Tutorial?

Parser, Handler and Word Count Demo

This tutorial consists of examples of how to use the MEDLINE parser, how to write a handler to process the citation objects produced by the parser, and as a simple example, a simple word counter that produces histograms of word counts in MEDLINE citations.

Where's the Downloading and Indexing Tutorial?

As of LingPipe 3.7, we removed the Lucene indexing and FTP download portion of this tutorial, as it was overly complex for explaining the basic MEDLINE handler.

If you're interested in seeing how to programatically download MEDLINE from NLM, verify checksums, and keep an up-to-the-minute index of MEDLINE, check out our new sandbox project:

Running the Demo

To run the demo, change directories to demos/tutorial/medline and then run:

cd $LINGPIPE/demos/tutorial/medline
ant word-count

The tutorial will print the PubMed IDs of the documents that it processes.

This will print the PubMed IDs of the citations as they are indexed:

c:\carp\devguard\lingpipe\trunk\demos\tutorial\medline>ant word-count
Buildfile: build.xml

compile:

jar:

word-count:
     [java] processing pmid=10540283
     [java] processing pmid=10502787
     [java] processing pmid=10737756
...
     [java] processing pmid=18964660
     [java] processing pmid=19771122
     [java]      1067 ,
     [java]       963 .
     [java]       929 of
     [java]       888 the
     [java]       852 -
     [java]       604 and
     [java]       487 in
     [java]       421 (
     [java]       419 )
     [java]       348 to
     [java]       335 a
     [java]       195 for
     [java]       188 with
     [java]       160 The
     [java]       155 that
     [java]       141 :
     [java]       138 was
     [java]       133 were
     [java]       131 is
     [java]       129 by
     [java]       113 as
     [java]       109 /
     [java]        96 from
     [java]        87 %
     [java]        86 on
     [java]        85 or
     [java]        83 1
     [java]        83 Humans
     [java]        78 therapy
     [java]        76 metabolism
     [java]        75 2
     [java]        74 are
     [java]        73 patients
     [java]        72 group
     [java]        70 +
     [java]        69 genetics
     [java]        68 an
     [java]        68 be
...
     [java]        10 kg
     [java]        10 like
     [java]        10 lines
     [java]        10 membrane
     [java]        10 most
     [java]        10 normal
     [java]        10 out
     [java]        10 outcomes
     [java]        10 parameters
     [java]        10 rat
     [java]        10 review
     [java]        10 severe
     [java]        10 some
     [java]        10 surface
     [java]        10 surgery

BUILD SUCCESSFUL
Total time: 2 seconds

What you see is the identifiers of the MEDLINE citations being listed as they're processed, then a set of tokens in order of the number of times they showed up in the documents.

Note On LingPipe Tokenizers

The standard LingPipe tokenizer is case sensitive, includes punctuation and carries out a fine-grained tokenization by splitting hyphenated words, contractions, etc. Other tokenizers normalize case, remove stopwords, reduce words to stems, filter out punctuation, etc.

MEDLINE XML Sample Files

NLM distributes one small sample file in plain XML, which we have included in the LingPipe distribution:

Each medline sample XML file contains a set of citations under a single element MedlineCitationSet, with individual citations being the content of elements MedlineCitation. The gzipped files unpack into a single XML file adhering to the same DTD as the small sample.

The demo may be run directly from the gzipped MEDLINE files. You can save the gzipped files wherever you want and include a path to them as an argument to the Ant targets index-sample or index-baseline.

Code Walkthrough

The source code for the word count demo is in the single file:

The basic process is as follows; we explain the details in the rest of the code walkthrough. The com.aliasi.medline package contains the classes used to parse the MEDLINE distribution files. Parsing is done by a MedlineParser. The parser is configured for a handler in the form of a MedlineHandler. Given data to parse, the parser generates a MedlineCitation object for each citation entry in the distribution file and passes it to the handler's handle(MedlineCitation) method.

A MedlineHandler can perform arbitrary operations on the citation, accumulating results in a database, calculating occurrence statistics for particular MeSH terms, etc.

Parsing

The parser is constructed with a single boolean argument, which indicates whether or not to save the raw XML:

public static void main(String[] args) throws IOException, SAXException {
    boolean saveXML = false;
    MedlineParser parser = new MedlineParser(saveXML);
    WordCountHandler handler = new WordCountHandler();
    parser.setHandler(handler);
    ...

In this case, we're not saving the XML; in the indexer in the sandbox project we save the raw XML to make it easy to reconstruct citation objects. The parser gets constructed with the boolean argument. We then construct the handler, which will actually process the document; the WordCountHandler is a static class defined in the single source file. After constructing the handler, we configure the parser to use the handler we just constructed.

Next, we will walk through the arguments, which are file names, and parse them using our parser:

    ...
    for (String arg : args) {
        if (arg.endsWith(".xml")) {
            InputSource inputSource = new InputSource(arg);
            parser.parse(inputSource);
        } else if (arg.endsWith(".gz")) {
            ...
        } else {
            throw new IllegalArgumentException("arguments must end with .xml or .gz");
        }
    }
    handler.report();
}

We simply loop over the arguments to the main function, and if it's a plain XML file (indicated by the suffix .xml) do one thing, and if it's a gzipped XML file (indicated by .gz), do another. For plain files, we included the action, which is to create a new input source from the argument, then parse the input source with the parser. Input sources are the generic wrapper for files, input streams, readers, and URLs, as defined in the org.xml.sax package.

Once we're done processing the files, we call the handler's report() method to print out results.

For compressed files, check out the source code in src/WordCountMedline.java to see how to handle them directly with Java; it's just another set of calls to create an input source.

Handling

So far, we've only shown how to do the parsing. The actual work is all done by the MedlineHandler implementation, which receives calls to its handle(MedlineCitation) method.

static class WordCountHandler implements MedlineHandler {
    ObjectToCounterMap<String> mCounter = new ObjectToCounterMap<String>();
    ...

We're defining the handler so that is specified to implement com.aliasi.medline.MedlineHandler, which is the type of handler required by the MEDLINE parser. The class has a single member variable, mCounter, which is an object to counter map (instance of com.aliasi.util.ObjectToCounterMap). The type of object being counted is defined as String through the generic argument.

The handler receives MEDLINE citations as callbacks from the parser to its handle(MedlineCitation) method. This is defined as follows:

public void handle(MedlineCitation citation) {
    String id = citation.pmid();
    System.out.println("processing pmid=" + id);

    Article article = citation.article();
    String titleText = article.articleTitleText();
    addText(titleText);

    Abstract abstrct = article.abstrct();
    if (abstrct != null) {
        String abstractText = abstrct.textWithoutTruncationMarker();
        addText(abstractText);
     }

     MeshHeading[] headings = citation.meshHeadings();
     for (MeshHeading heading : headings) {
         for (Topic topic : heading.topics()) {
             String topicText = topic.topic();
             addText(topicText);
         }
    }
}

The MedlineCitation object embodies an object model in Java for a MEDLINE citation. The handler method first extracts the citation, using the method MedlineCitation.pmid(), and prints out the citation; this is where the prints of the identifiers happens in the output.

The next block of code extracts the com.aliasi.medline.Article object from the citation, then the text of the article title from the article object. It then extracts the text of the title from the article, and calls the addText(String) method in the handler (which we show below).

The third block extracts the abstract (that's not a typo; the word abstract is reserved). It then checks of the abstract is null (not every citation has an abstract), then pulls out the text and calls the addText(String) method.

The final block runs through the Medical Subject Headings (MeSH). NLM annotates each citation using the controlled vocabulary of MeSH. The MeSH headings are supplied as an array, and that array always exists. We then iterate over the headings in the array, then for each heading, pulls out the topics for that heading. Then we pull out the text of the topic and add it using the same method.

Counting Words

The counting is all done through the addText(String) method in the handler:

public void addText(String text) {
    char[] cs = text.toCharArray();
    TokenizerFactory factory = IndoEuropeanTokenizerFactory.INSTANCE;
    Tokenizer tokenizer = factory.tokenizer(cs,0,cs.length);
    for (String token : tokenizer) {
        mCounter.increment(token);
    }
}

This method simple takes the text and breaks it down into a character array, then supplies the character array to a tokenizer (in this case, an Indo-European tokenizer defined in com.aliasi.tokenizer), tokenizes, then iterates over the tokens adding them to the counter through the counter's increment(String) method (specified as String) through the generic.

The counter stores counts for objects that have been incremented, and allows them to be traversed by count. The code that does that and prints them out is the report() method in the handler:

public void report() {
    List<String> keysByCount = mCounter.keysOrderedByCountList();
    for (String key : keysByCount) {
        int count = mCounter.getCount(key);
        if (count < 10) break;
        System.out.printf("%9d %s\n",count,key);
    }
}

This method calls the keysOrderedByCountList() method on the object to counter map, then iterates over the keys. It gets the count for a key using the getCount(String) method, and breaks if the count is less than 10 or prints out the key and count otherwise.

Licensing MEDLINE

MEDLINE is distributed in 2 parts: the baseline distribution, and a set of updates files. A new updates file is released once a day (on most weekdays). Updates files contain new citation entries as well as revisions of existing entries. Updates files may also contain instructions to delete existing entries. In order to maintain a single, coherent Lucene index over MEDLINE, each update must be processed in the order in which it is released.

Licensing MEDLINE

MEDLINE data is licensed by the United States National Library of Medicine (NLM), but most of the publishers retain copyright over their contributions. MEDLINE data is free for research to anyone (with registration), but commercial use is restricted to U.S.-based organizations (see the section How to Lease in the above document).

See the following link for more information:

NLM is very responsive and will help you out if you have problems.

An overview of the distribution is avialble at:

Before downloading, you need to register your IP address with NLM. So you'll need a fixed IP address to keep up with MEDLINE.

References