MedPost/SKR Part of Speech Tagger


MetaMap Transfer
(MMTx)

Rosetta Stone: Metaphor for MetaMap/SKR work
Rosetta Stone

Home


Documentation


Prerequisites


2.4.A Prerequisites


Resources


Download
(Restricted)


Install


Run MMTx


Customize


Trouble Reporter


Review Status
of Trouble Reports


FAQ


Statistics


User's Group
Notes


Administration
(Restricted)
     

Java representation of the MedPost/SKR Part of Speech Tagger for BioMedical Text.

The MedPost Tagger was originally developed by Larry Smith, Tom RindFlesch, and W. John Wilbur from the National Center for Biotechnology Information (NCBI) [Smith, Wilbur], and Lister Hill National Center for Biomedical Communications (LHNCBC) [Rindflesch]. MedPost is currently written in a combination of C++ and Perl. The paper is accessible via the following URL: MedPost: A Part of Speech Tagger for BioMedical Text. Smith et al. Bioinformatics 2004;0:2271-0..

The MedPost/SKR Tagger is a Java-based implementation of the MedPost Tagger specifically formulated for the Semantic Knowledge Representation (SKR) work. MedPost/SKR has modified functionality and only produces SPECIALIST lexicon tags. The base algorithms are consistent between MedPost and MedPost/SKR.

MedPost is a stochastic part of speech tagger employing a hidden Markov model (HMM) to combine contextual information with lexical information to improve on baseline tagging accurracy. MedPost breaks down the original text into sentences and then tokenizes each sentence before finally tagging the text. A static table of bigrams derived during the initial training phase is used to estimate the transition probabilities. The output probabilities of the HMM are determined for words in the lexicon assuming equal probability for the possible tags. Output probabilities for unknown words are based on word orthography (e.g., uper or lowercase, numerics, etc), and word endings up to 4 letters long. The Viterbi algorithm is used to find the most likely tag sequence in the HMM matching the tokens.

MedPost was trained specifically for tagging biological text by using MEDLINE abstracts as the training corpus.

Input is any free formatted text. There are three modes of operation:

  1. Standalone - which accepts a input file and output file. In standalone mode, the tagger breaks the text into processing units at blank lines. Multiple "units" can be included in the input file.

  2. Called method - accepts a unit of text to be tagged. The input should be a single unit of text that requires tagging.

  3. Threaded Called method - same as above except this method is thread-safe via synchronized method.

The output from all operation modes is a prolog formatted string where the text and associated tags are included in a list (see example below).

Example:

Input:
     This is a test.

Result:
     [
       ['This', 'det'],
       ['is', 'aux'],
       ['a', 'det'],
       ['test', 'noun'],
       ['.', 'pd']
     ].
     ^THE_END^


Last Modified: March 30, 2007 ii-public
Links to Our Sites
Indexing Initiative (II)
Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).
Semantic Knowledge Representation (SKR)
Develop programs to provide usable semantic representation of biomedical text. Includes the MetaMap and SemRep programs.
MetaMap Transfer (MMTx)
Distributable version of the MetaMap program.
Word Sense Disambiguation (WSD)
Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.
Medline Baseline Repository (MBR)
Static MEDLINE Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.
Picture of Lister Hill Center Lister Hill National Center for Biomedical Communications   NLM Logo U.S. National Library of Medicine   NIH Logo National Institutes of Health
DHHS Logo Department of Health and Human Services
     Contact Us    |   Copyright    |   Privacy    |   Accessibility    |   Freedom of Information Act    |   USA.gov