By Mitzi Morris

What's Database Text Mining?

This tutorial shows how to use a relational database management system (RDBMS) to store documents and LingPipe analyses. It uses MEDLINE data as the example data, and MySQL as the example RDBMS. As in the MEDLINE Parsing and Indexing Demo, the LingPipe MedlineParser is used to parse the data from an XML file. Scripts are provided to create the database and database tables. This tutorial is aimed at the novice database programmer, and therefore the database design and the way that the program interacts with the database have been kept as simple as possible.

For expository purposes, we break this task into three parts:

Completing part 1 results in a simple database containing a table containing the titles and abstracts of MEDLINE citations. In part 2 we add two more tables into the database, one for sentence information and one for mentions of named entities found in each sentence. Then we present a program which retrieves text data from the MEDLINE citation table, runs it through the LingPipe sentence and named-entity analyzers, and stores the results into the sentence and mention tables. Once the database has been created and populated, we present a few example queries which can be run against it. Finally, we discuss ways in which this task can be optimized.

MySQL

MySQL runs on most operating systems, including Linux, Unix, Windows, and MacOS, and is available under both a commercial and GPL license. This demo uses MySQL version 5.0, which is available from:

You will only need the "essentials" version for Windows x86 or AMD64. This demo can run in what the installer calls a "typical" install run with the non-custom configuration.

JDBC

The Java Database Connectivity (JDBC) API allows Java programs to access data stored in relational databases which support the standard SQL query language. For information on the JDBC API, see Sun's JDBC Database Access tutorial.

The official JDBC driver for MySQL is available from the

We ran this demo with mysql-connector-java-3.1.13-bin.jar. The jar file containing the MySQL Connector/J driver must be on the runtime classpath. The Ant build.xml file is configured to find the 3.1 Connector/J jar if you put it in the lingpipe/demos/lib directory.

The connector is licensed under the Gnu General Public License.

Part 1: Loading MEDLINE data into a database

Creating the database

We define a very simple database which contains one table of MEDLINE citation data. The name of the database is medline and the name of the table is citation. Each citation record contains the MEDLINE citation PubMed identifier, title, and abstract (if any).

Table: citation
Columns citation_id database id, automatically generated via the MySQL AUTO_INCREMENT attribute
pubmed_id PubMed id
title article title, character encoding is unicode-16
abstract abstract, if any, character encoding is unicode-16

The script sql/medline_citations.sql contains the SQL commands to create the database, and can be executed via the MySQL command-line client:

$ mysql -u root -p  -v < sql/medline_citations.sql

The system prompts for the password interactively, then executes the commands in medline_citations.sql, as seen in the following transcript:

Enter password: ****
--------------
CREATE DATABASE IF NOT EXISTS medline
--------------

--------------
CREATE TABLE IF NOT EXISTS citation (
        citation_id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
        pubmed_id       VARCHAR(10) UNIQUE NOT NULL,
        title   VARCHAR(500) CHARSET utf8 NOT NULL,
        abstract VARCHAR(10000) CHARSET utf8 NULL
)
--------------

Loading MEDLINE citations

Once the database and table have been created, we are ready to run the program LoadMedlineDb.java which loads MEDLINE data into the database. LoadMedlineDb.java parses out MEDLINE citation instances from an XML file.

LoadMedlineDb requires an input file which contains the database server name, port, username, and password. This information is needed to connect to the database using JDBC. This demo uses the file db.properties, a java properties file which contains installation-specific database settings, stored as a set of key-value pairs. This file is the first command-line argument to the program. Remaining arguments are treated as names of XML files containing MEDLINE data.

As in the MEDLINE parsing demo, we define a static class that implements com.aliasi.medline.MedlineHandler. This class is called MedlineDbLoader. This class maintains a connection to the database via the instance variable:

Connection mCon;

The method openDb is used to set this variable to an actual connection to the database, as follows:

Class.forName("com.mysql.jdbc.Driver").newInstance();
mCon = DriverManager.getConnection(dbUrl,username,password);

The first line loads the MySQL implementation of the DriverManager class. (See the Sun DriverManager Overview for more information on loading and registering drivers.) The second line uses the information from the db.properties file to open a connection to the database.

The handle method takes a MedlineCitation object, and stores the PubMed id, article title, and abstract (if any) as a record int the citation table. The SQL statement which inserts a record into the citation table looks like this:

INSERT INTO citation (pubmed_id, title, abstract)
  VALUES ('123456789','some title text','some abstract text')

In order to efficiently execute such a query, the JDBC API provides the PreparedStatement object, which precompiles the database query so that it can be run multiple times. Precompilation is performed on a schematic query, using the question mark (?) to indicate parameters. In this case, each time we execute the query we will be passing in values for pubmed_id, title, and abstract, therefore we write the query as follows:

static final String INSERT_CITATION_SQL =
    "INSERT INTO citation (pubmed_id, title, abstract) VALUES (?,?,?)";

It is not necessary to specify a value for the citation_id column, since it will be automatically populated by MySQL.

The Connection prepareStatement method takes the query string and returns a PreparedStatement object:

pstmt = mCon.prepareStatement(INSERT_CITATION_SQL);

Each question mark (?) in the query string designates a parameter which must be set using the correct setter method. Failing to set a parameter, or setting the a parameter to the wrong type will throw a SQLException. Here is the code which sets these parameters and executes the query:

pstmt.setString(1,citation.pmid());
pstmt.setString(2,citation.article().articleTitleText());
Abstract abstr = citation.article().abstrct();
if (abstr != null)
    pstmt.setString(3,abstr.text());
else
    pstmt.setNull(3,Types.VARCHAR);
pstmt.executeUpdate();

The above code is embedded in a try/catch/finally construction in order to catch exceptions thrown by the database. In the finally block we call the close method on the prepared statement and swallow any SQL exception thrown:

try { pstmt.close(); } catch (SQLException se) {}

When all of the citations have been processed the main method calls the LoadMedlineHandlerDb close method, which closes the connection to the database.

To run part 1 of this demo:

The results for the loadmedline target are:

$ ant loadmedline
Buildfile: build.xml

compile:

loadmedline:
     [java] Indexing file=../../data/medsamp2006.xml
     [java] Handling PMID=10540283
     [java] Handling PMID=10854512
     [java] Handling PMID=10972993
     [java] Handling PMID=11056631
     [java] Handling PMID=11034741
     [java] Handling PMID=11406024
...
     [java] Handling PMID=15611661
     [java] Handling PMID=15611667
     [java] Handling PMID=15968009

BUILD SUCCESSFUL
Total time: 3 seconds

Clearing the Database

Once the Ant target loadmedline has run successfully, attempts to run it again will result in a series of errors, because every record that the program is trying to insert into the database already exists:

$ant loadmedline
Buildfile: build.xml

compile:

loadmedline:
     [java] Indexing file=../../data/medsamp2006.xml
     [java] Handling PMID=10540283
     [java] EXCEPTION HANDLING CITATION=10540283 java.sql.SQLException: Duplicate entry '10540283' for key 2
     [java] Handling PMID=10854512
     [java] EXCEPTION HANDLING CITATION=10854512 java.sql.SQLException: Duplicate entry '10854512' for key 2
...
     [java] Handling PMID=15968009
     [java] EXCEPTION HANDLING CITATION=15968009 java.sql.SQLException: Duplicate entry '15968009' for key 2

BUILD SUCCESSFUL
Total time: 3 seconds

If you wish to run this part of the demo again, then the existing set of citations must be deleted from the database. We have provided a script sql/medline_clear_citations.sql which deletes all records from the citation table, and can be executed via the MySQL command-line client:

$ mysql -u root -p  -v < sql/medline_clear_citations.sql

The system prompts for the password interactively, then executes the commands in medline_clear_citations.sql, as seen in the following transcript:

C:\mycvs\lingpipe\demos\tutorial\db>mysql -u root -p  -v < sql/medline_clear_citations.sql
Enter password: *****
--------------
SELECT COUNT(*) FROM citation
--------------

COUNT(*)
88
--------------
DELETE FROM citation
--------------

--------------
SELECT COUNT(*) FROM citation
--------------

COUNT(*)
0

Now the Ant target loadmedline can be run again.

Part 2: Using LingPipe to annotate data in the database

Part 2 of this demo builds on the database created and populated in part 1. Now that we have a database of text data, we show how to use the LingPipe API to annotate this data, and store the results back into the database.

Annotation consists of indentifying sentences and mentions of named entities in each sentence. To do this we add two tables:

Table: sentence
Columns sentence_id database id, automatically generated via the MySQL AUTO_INCREMENT attribute
citation_id database id of citation containing this sentence.
offset sentence start point (offset from start of text)
length sentence length
type either "title" or "abstract"
Table: mention
Columns mention_id database id, automatically generated via the MySQL AUTO_INCREMENT attribute
sentence_id database id of sentence containing this mention.
offset mention start point (offset from start of sentence)
length mention length
type type assigned to this mention by the language model
text the mention text itself

Because the title and abstract field in the citation record already contain the citation text, the sentence record only stores information pointing back into the citation entry. Since we want to be able to query the database looking for particular mentions, we need to store the mention text in its own column. Given that we store the start index of the mention, and the mention text itself, storing the length of the mention is not strictly necessary. This is a design decision, as is the choice of storing the start and length of the sentence and mention, as opposed to the start and end indices. Databse design is a dark art, and as such, is outside the scope of this tutorial.

The script medline_annotations.sql contains the SQL commands to add these two tables to the medline database, and can be executed via the command-line client:

$ mysql -u root -p  -v < sql/medline_annotations.sql

This command produces the following output (waiting for the password).

Enter password: *****
--------------
CREATE TABLE IF NOT EXISTS sentence (
        sentence_id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
        citation_id INT UNSIGNED NOT NULL REFERENCES citation,
        offset INT UNSIGNED NOT NULL,
        length INT UNSIGNED NOT NULL,
        type VARCHAR(10) NOT NULL
)
--------------

--------------
CREATE TABLE IF NOT EXISTS mention (
        mention_id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
        sentence_id INT UNSIGNED NOT NULL REFERENCES sentence,
        offset INT UNSIGNED NOT NULL,
        length INT UNSIGNED NOT NULL,
        type VARCHAR(40) NOT NULL,
        text VARCHAR(500) CHARSET utf8 NOT NULL
)
--------------

Once we have added these tables to the database, we are ready to run the program AnnotateMedlineDb.java.

AnnotateMedlineDb.java connects to the database in the same way that LoadMedlineDb.java does, and therefore we reuse the openDB and closeDb methods, along with the strategy of storing database information in a properties file which is passed in as a command-line argument to the program.

AnnotateMedlineDb.java creates an instance of com.aliasi.sentences.SentenceChunker to identify sentences in the citation text fields. It reconstitutes a serialized instance of com.aliasi.chunk.Chunker to extract named entities from the sentences. These are instantiated in the AnnotateMedlineDb constructor:

tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE;
sentenceModel  = new IndoEuropeanSentenceModel();
sentenceChunker
    = new SentenceChunker(tokenizerFactory,
                          sentenceModel);
genomicsModelfile
  = new File("../../models/ne-en-bio-genia.TokenShapeChunker");
neChunker
  = (Chunker) AbstractExternalizable
              .readObject(genomicsModelfile);

The SentenceChunker constructor requires a tokenizer and model, and in this example we use the standard Indo-European tokenizer and sentence model that are part of the LingPipe distribution. The Chunker is deserialized using a helper method in util.AbstractExternalizable from the specified file using the ne-en-bio-genia.TokenShapeChunker model that is distributed with the LingPipe demos.

To annotate a citation, we first retrieve the title and abstract columns from the database. As in part 1 of this demo, we use a prepared statement object.

static final String GET_CITATION_TEXT_SQL =
    = "SELECT title, abstract FROM citation where citation_id = ?";
    ...

protected void annotateCitation(int citationId) {
    ...
    pstmt = mCon.prepareStatement(GET_CITATION_TEXT_SQL);
    pstmt.setInt(1,citationId);
    rs = pstmt.executeQuery();
    if (rs.next()) {
        title = rs.getString("title");
        abstr = rs.getString("abstract");
    }
    rs.close();
    pstmt.close();

Then we pass the title and abstract text into annotateSentences. The com.aliasi.sentences.SentenceChunker.chunk method is used to break the text into a set of sentences. As each sentence is stored, the database returns a unique identifier for it. We then annotate the text of the sentence with the following method:

protected void annotateMentions(int sentenceId,
                                String text)
    throws SQLException {

    Chunking chunking
        = neChunker.chunk(text.toCharArray(),
                          0, text.length());
    Set mentions = chunking.chunkSet();
    for (Iterator it = mentions.iterator();
         it.hasNext(); ) {

        Chunk mention = (Chunk)it.next();
        int start = mention.start();
        int end = mention.end();
        storeMention(sentenceId,start,mention.type(),
                     text.substring(start,end));
    }
}

The chunk method of Chunker is used to find all named entities in the sentence. It returns a com.aliasi.chunk.Chunking, which is a set of chunks over a shared underlying character sequence (in this example, the text). The com.aliasi.chunk.Chunking.chunkSet method returns the set of com.aliasi.chunk.Chunk objects. We store the mention start, type, and the text spanned by this chunk in the database.

To run part 2 of this demo:

Running the final target produces the following output.

$ ant annotatemedline

annotatemedline:
    [java] Annotating citation_id=89
    [java] Annotating citation_id=90
    [java] Annotating citation_id=91
...
    [java] Annotating citation_id=174
    [java] Annotating citation_id=175
    [java] Annotating citation_id=176

Note that the identifiers may vary depending on the state of the database.

Clearing the Database

Executing the Ant target annotatemedline for a second time will result in duplicate annotations for each citation, (since for demo purposes we have created an extremely simple database). If you run this part of the demo again, then the existing set of annotations should be deleted from the database first. Run the script medline_clear_annotations.sql to delete the existing records from the sentence and mention tables. Now you can run the Ant target annotatemedline again.

Part 3: Querying the Database

With the medline database in place, we can begin running queries. In the following examples we show the transcript of a session using the MySQL command-line client, which is started without passing in a file argument:

$ mysql -u root -p -v

To check that the data loading process was successful, we run simple queries to get rows counts on tables, and do spot checks on the data. First we check the number of records in the citation, sentence, and mention tables, using the query: "select count(*) from table."

$ mysql -u root -p -v
Enter password: *****

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 18 to server version: 5.0.22-community-nt

mysql> use medline
Database changed
mysql> select count(*) from citation;
--------------
select count(*) from citation
--------------

+----------+
| count(*) |
+----------+
|       88 |
+----------+
1 row in set (0.02 sec)

mysql> select count(*) from sentence;
--------------
select count(*) from sentence
--------------

+----------+
| count(*) |
+----------+
|      555 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from mention;
--------------
select count(*) from mention
--------------

+----------+
| count(*) |
+----------+
|     1570 |
+----------+
1 row in set (0.00 sec)

To check the data we select a few rows from each table, via the query:

"select * from table limit N."

To see the data in each column on a separate line, we terminate the query with "\G" instead of ";".

mysql> select * from mention limit 5 \G
--------------
select * from mention limit 5
--------------

*************************** 1. row ***************************
 mention_id: 1571
sentence_id: 556
     offset: 0
     length: 24
       type: other_name
       text: Transcription regulation
*************************** 2. row ***************************
 mention_id: 1572
sentence_id: 556
     offset: 32
     length: 16
       type: DNA_domain_or_region
       text: nir gene cluster
*************************** 3. row ***************************
 mention_id: 1573
sentence_id: 556
     offset: 58
     length: 17
       type: protein_family_or_group
       text: nitrite reductase
*************************** 4. row ***************************
 mention_id: 1574
sentence_id: 556
     offset: 79
     length: 24
       type: mono_cell
       text: Paracoccus denitrificans
*************************** 5. row ***************************
 mention_id: 1575
sentence_id: 556
     offset: 113
     length: 3
       type: other_name
       text: NNR
5 rows in set (0.00 sec)

Citations are listed in the same way.

mysql> select * from citation limit 1 \G
--------------
select * from citation limit 1
--------------

*************************** 1. row ***************************
citation_id: 89
  pubmed_id: 10540283

  title: Transcription regulation of the nir gene cluster encoding
  nitrite reductase of Paracoccus denitrificans involves NNR and NirI,
  a novel type of membrane protein.

  abstract: The nirIX gene cluster of Paracoccus denitrificans is
  located between the nir and nor gene clusters encoding nitrite and
  nitric oxide reductases respectively. The NirI sequence corresponds
  to that of a membrane-bound protein with six transmembrane helices,
  a large periplasmic domain and cysteine-rich cytoplasmic domains
  that resemble the binding sites of [4Fe-4S] clusters in many
  ferredoxin-like proteins. NirX is soluble and apparently located in
  the periplasm, as judged by the predicted signal sequence. NirI and
  NirX are homologues of NosR and NosX, proteins involved in
  regulation of the expression of the nos gene cluster encoding
  nitrous oxide reductase in Pseudomonas stutzeri and Sinorhizobium
  meliloti. Analysis of a NirI-deficient mutant strain revealed that
  NirI is involved in transcription activation of the nir gene cluster
  in response to oxygen limitation and the presence of N-oxides. The
  NirX-deficient mutant transiently accumulated nitrite in the growth
  medium, but it had a final growth yield similar to that of the wild
  type. Transcription of the nirIX gene cluster itself was controlled
  by NNR, a member of the family of FNR-like transcriptional
  activators. An NNR binding sequence is located in the middle of the
  intergenic region between the nirI and nirS genes with its centre
  located at position -41.5 relative to the transcription start sites
  of both genes. Attempts to complement the NirI mutation via cloning
  of the nirIX gene cluster on a broad-host-range vector were
  unsuccessful, the ability to express nitrite reductase being
  restored only when the nirIX gene cluster was reintegrated into the
  chromosome of the NirI-deficient mutant via homologous recombination
  in such a way that the wild-type nirI gene was present directly
  upstream of the nir operon.

1 row in set (0.00 sec)

The NEChunker assigned types to mentions using a model trained on the GENIA corpus. GENIA labels dozens of categories in the genomics and proteomics domain. For more infomation on this corpus, see the GENIA Project Home Page. To see the distribution of mention types in the medline database, we run the following query:

mysql> select distinct type, count(*)
    -> from mention group by type;
--------------
select distinct type, count(*)
from mention group by type
--------------

+--------------------------+----------+
| type                     | count(*) |
+--------------------------+----------+
| amino_acid_monomer       |        9 |
| atom                     |       10 |
| body_part                |       26 |
| carbohydrate             |        3 |
| cell_component           |       15 |
| cell_line                |       31 |
| cell_type                |       42 |
| DNA_domain_or_region     |       72 |
| DNA_family_or_group      |       16 |
| DNA_molecule             |        5 |
| DNA_N                    |        1 |
| DNA_substructure         |        2 |
| inorganic                |       15 |
| lipid                    |       14 |
| mono_cell                |       14 |
| multi_cell               |      148 |
| nucleotide               |        8 |
| other_artificial_source  |        4 |
| other_name               |      773 |
| other_organic_compound   |      123 |
| peptide                  |        8 |
| polynucleotide           |        6 |
| protein_domain_or_region |       13 |
| protein_family_or_group  |       42 |
| protein_molecule         |      112 |
| protein_N                |        8 |
| protein_substructure     |        5 |
| RNA_domain_or_region     |        3 |
| RNA_family_or_group      |        4 |
| RNA_molecule             |        4 |
| RNA_N                    |        1 |
| RNA_substructure         |        1 |
| tissue                   |       27 |
| virus                    |        5 |
+--------------------------+----------+
34 rows in set (0.00 sec)

To find the 10 most frequently mentioned proteins (regardless of subtype) we use the query:

mysql> select text, count(*) from mention
    -> where type like 'protein%'
    -> group by 1 order by 2
    -> desc limit 10;
--------------
select text, count(*) from mention
where type like 'protein%'
group by 1 order by 2
desc limit 10
--------------

+-------------------+----------+
| text              | count(*) |
+-------------------+----------+
| 201Tl             |        5 |
| PASG              |        4 |
| prostate          |        3 |
| Raf-1             |        3 |
| MAPK              |        3 |
| PKC-alpha         |        3 |
| PKC-delta         |        3 |
| nitrite reductase |        2 |
| NirI              |        2 |
| surgical          |        2 |
+-------------------+----------+
10 rows in set (0.00 sec)

To find sentences which mention a particular named entity, we use the query:

mysql> select sentence_id from mention
    -> where text = 'MAPK';
--------------
select sentence_id from mention
where text = 'MAPK'
--------------

+-------------+
| sentence_id |
+-------------+
|         942 |
|         947 |
|         951 |
+-------------+
3 rows in set (0.01 sec)

Finally, we present a slightly more complex query, which finds all pairs of protein names which co-occur in a sentence. To do this we join the mention table to itself, and restrict our selection to protiens which co-occur in the same sentence. Because this query returns a table containing columns for name1, name2, each pair of names will occur twice in the result, once as the pair {A,B} and once as the pair {B,A}. To remove this duplication, we require the names to occur in alphabetical order - this selects {A,B} and discards {B,A}. The full query is:

mysql> select m1.text, m2.text
    -> from mention m1, mention m2
    -> where m1.sentence_id = m2.sentence_id
    -> and m1.type like 'protein%'
    -> and m2.type like 'protein%'
    -> and m1.text < m2.text
    -> limit 10;
--------------
select m1.text, m2.text
from mention m1, mention m2
where m1.sentence_id = m2.sentence_id
and m1.type like 'protein%'
and m2.type like 'protein%'
and m1.text < m2.text
limit 10
--------------

+-----------------------------------+------------------------+
| text                              | text                   |
+-----------------------------------+------------------------+
| NirI                              | nitrite reductase      |
| membrane protein                  | nitrite reductase      |
| membrane protein                  | NirI                   |
| membrane-bound protein            | NirI sequence          |
| cysteine-rich cytoplasmic domains | NirI sequence          |
| ferredoxin-like proteins          | NirI sequence          |
| cysteine-rich cytoplasmic domains | membrane-bound protein |
| ferredoxin-like proteins          | membrane-bound protein |
| NirI sequence                     | transmembrane helices  |
| membrane-bound protein            | transmembrane helices  |
+-----------------------------------+------------------------+
10 rows in set (0.00 sec)

Finally, the client may be exited as follows:

mysql> quit
Bye

Discussion

Because we chose to present the JDBC API first and the LingPipe annotation API second, we have two separate programs, where one program would be more efficient. It is left as an excercise to the reader to write a program which uses LingPipe to parse and annotate MEDLINE data in one pass.

The medline database developed in this example, and the code that accesses it have been kept as simple as possible, without any attempt to optimize database performance. The data loading process is extremely inefficient - as each new piece of information is identified, it is inserted into the database. For very large data sets, moving to a bulk-loading strategy would be advised. The database tables haven't been tuned with respect to the expected set of queries that will be run against them. For example, if the mention table contains 5 million rows, a query like:

select * from mention where text = 'foobar'

could be quite slow, and if so, adding an index on the text column would be in order.

Database otpimization depends on the performance goals and/or constraints of the application(s) that the database supports. For a good introduction to this subject, we recommend the book SQL Performance Tuning .

References