Initial commit
This commit is contained in:
199
venv/lib/python3.7/site-packages/nltk/chunk/__init__.py
Normal file
199
venv/lib/python3.7/site-packages/nltk/chunk/__init__.py
Normal file
@@ -0,0 +1,199 @@
|
||||
# Natural Language Toolkit: Chunkers
|
||||
#
|
||||
# Copyright (C) 2001-2019 NLTK Project
|
||||
# Author: Steven Bird <stevenbird1@gmail.com>
|
||||
# Edward Loper <edloper@gmail.com>
|
||||
# URL: <http://nltk.org/>
|
||||
# For license information, see LICENSE.TXT
|
||||
#
|
||||
|
||||
"""
|
||||
Classes and interfaces for identifying non-overlapping linguistic
|
||||
groups (such as base noun phrases) in unrestricted text. This task is
|
||||
called "chunk parsing" or "chunking", and the identified groups are
|
||||
called "chunks". The chunked text is represented using a shallow
|
||||
tree called a "chunk structure." A chunk structure is a tree
|
||||
containing tokens and chunks, where each chunk is a subtree containing
|
||||
only tokens. For example, the chunk structure for base noun phrase
|
||||
chunks in the sentence "I saw the big dog on the hill" is::
|
||||
|
||||
(SENTENCE:
|
||||
(NP: <I>)
|
||||
<saw>
|
||||
(NP: <the> <big> <dog>)
|
||||
<on>
|
||||
(NP: <the> <hill>))
|
||||
|
||||
To convert a chunk structure back to a list of tokens, simply use the
|
||||
chunk structure's ``leaves()`` method.
|
||||
|
||||
This module defines ``ChunkParserI``, a standard interface for
|
||||
chunking texts; and ``RegexpChunkParser``, a regular-expression based
|
||||
implementation of that interface. It also defines ``ChunkScore``, a
|
||||
utility class for scoring chunk parsers.
|
||||
|
||||
RegexpChunkParser
|
||||
=================
|
||||
|
||||
``RegexpChunkParser`` is an implementation of the chunk parser interface
|
||||
that uses regular-expressions over tags to chunk a text. Its
|
||||
``parse()`` method first constructs a ``ChunkString``, which encodes a
|
||||
particular chunking of the input text. Initially, nothing is
|
||||
chunked. ``parse.RegexpChunkParser`` then applies a sequence of
|
||||
``RegexpChunkRule`` rules to the ``ChunkString``, each of which modifies
|
||||
the chunking that it encodes. Finally, the ``ChunkString`` is
|
||||
transformed back into a chunk structure, which is returned.
|
||||
|
||||
``RegexpChunkParser`` can only be used to chunk a single kind of phrase.
|
||||
For example, you can use an ``RegexpChunkParser`` to chunk the noun
|
||||
phrases in a text, or the verb phrases in a text; but you can not
|
||||
use it to simultaneously chunk both noun phrases and verb phrases in
|
||||
the same text. (This is a limitation of ``RegexpChunkParser``, not of
|
||||
chunk parsers in general.)
|
||||
|
||||
RegexpChunkRules
|
||||
----------------
|
||||
|
||||
A ``RegexpChunkRule`` is a transformational rule that updates the
|
||||
chunking of a text by modifying its ``ChunkString``. Each
|
||||
``RegexpChunkRule`` defines the ``apply()`` method, which modifies
|
||||
the chunking encoded by a ``ChunkString``. The
|
||||
``RegexpChunkRule`` class itself can be used to implement any
|
||||
transformational rule based on regular expressions. There are
|
||||
also a number of subclasses, which can be used to implement
|
||||
simpler types of rules:
|
||||
|
||||
- ``ChunkRule`` chunks anything that matches a given regular
|
||||
expression.
|
||||
- ``ChinkRule`` chinks anything that matches a given regular
|
||||
expression.
|
||||
- ``UnChunkRule`` will un-chunk any chunk that matches a given
|
||||
regular expression.
|
||||
- ``MergeRule`` can be used to merge two contiguous chunks.
|
||||
- ``SplitRule`` can be used to split a single chunk into two
|
||||
smaller chunks.
|
||||
- ``ExpandLeftRule`` will expand a chunk to incorporate new
|
||||
unchunked material on the left.
|
||||
- ``ExpandRightRule`` will expand a chunk to incorporate new
|
||||
unchunked material on the right.
|
||||
|
||||
Tag Patterns
|
||||
~~~~~~~~~~~~
|
||||
|
||||
A ``RegexpChunkRule`` uses a modified version of regular
|
||||
expression patterns, called "tag patterns". Tag patterns are
|
||||
used to match sequences of tags. Examples of tag patterns are::
|
||||
|
||||
r'(<DT>|<JJ>|<NN>)+'
|
||||
r'<NN>+'
|
||||
r'<NN.*>'
|
||||
|
||||
The differences between regular expression patterns and tag
|
||||
patterns are:
|
||||
|
||||
- In tag patterns, ``'<'`` and ``'>'`` act as parentheses; so
|
||||
``'<NN>+'`` matches one or more repetitions of ``'<NN>'``, not
|
||||
``'<NN'`` followed by one or more repetitions of ``'>'``.
|
||||
- Whitespace in tag patterns is ignored. So
|
||||
``'<DT> | <NN>'`` is equivalant to ``'<DT>|<NN>'``
|
||||
- In tag patterns, ``'.'`` is equivalant to ``'[^{}<>]'``; so
|
||||
``'<NN.*>'`` matches any single tag starting with ``'NN'``.
|
||||
|
||||
The function ``tag_pattern2re_pattern`` can be used to transform
|
||||
a tag pattern to an equivalent regular expression pattern.
|
||||
|
||||
Efficiency
|
||||
----------
|
||||
|
||||
Preliminary tests indicate that ``RegexpChunkParser`` can chunk at a
|
||||
rate of about 300 tokens/second, with a moderately complex rule set.
|
||||
|
||||
There may be problems if ``RegexpChunkParser`` is used with more than
|
||||
5,000 tokens at a time. In particular, evaluation of some regular
|
||||
expressions may cause the Python regular expression engine to
|
||||
exceed its maximum recursion depth. We have attempted to minimize
|
||||
these problems, but it is impossible to avoid them completely. We
|
||||
therefore recommend that you apply the chunk parser to a single
|
||||
sentence at a time.
|
||||
|
||||
Emacs Tip
|
||||
---------
|
||||
|
||||
If you evaluate the following elisp expression in emacs, it will
|
||||
colorize a ``ChunkString`` when you use an interactive python shell
|
||||
with emacs or xemacs ("C-c !")::
|
||||
|
||||
(let ()
|
||||
(defconst comint-mode-font-lock-keywords
|
||||
'(("<[^>]+>" 0 'font-lock-reference-face)
|
||||
("[{}]" 0 'font-lock-function-name-face)))
|
||||
(add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))
|
||||
|
||||
You can evaluate this code by copying it to a temporary buffer,
|
||||
placing the cursor after the last close parenthesis, and typing
|
||||
"``C-x C-e``". You should evaluate it before running the interactive
|
||||
session. The change will last until you close emacs.
|
||||
|
||||
Unresolved Issues
|
||||
-----------------
|
||||
|
||||
If we use the ``re`` module for regular expressions, Python's
|
||||
regular expression engine generates "maximum recursion depth
|
||||
exceeded" errors when processing very large texts, even for
|
||||
regular expressions that should not require any recursion. We
|
||||
therefore use the ``pre`` module instead. But note that ``pre``
|
||||
does not include Unicode support, so this module will not work
|
||||
with unicode strings. Note also that ``pre`` regular expressions
|
||||
are not quite as advanced as ``re`` ones (e.g., no leftward
|
||||
zero-length assertions).
|
||||
|
||||
:type CHUNK_TAG_PATTERN: regexp
|
||||
:var CHUNK_TAG_PATTERN: A regular expression to test whether a tag
|
||||
pattern is valid.
|
||||
"""
|
||||
|
||||
from nltk.data import load
|
||||
|
||||
from nltk.chunk.api import ChunkParserI
|
||||
from nltk.chunk.util import (
|
||||
ChunkScore,
|
||||
accuracy,
|
||||
tagstr2tree,
|
||||
conllstr2tree,
|
||||
conlltags2tree,
|
||||
tree2conlltags,
|
||||
tree2conllstr,
|
||||
tree2conlltags,
|
||||
ieerstr2tree,
|
||||
)
|
||||
from nltk.chunk.regexp import RegexpChunkParser, RegexpParser
|
||||
|
||||
# Standard treebank POS tagger
|
||||
_BINARY_NE_CHUNKER = 'chunkers/maxent_ne_chunker/english_ace_binary.pickle'
|
||||
_MULTICLASS_NE_CHUNKER = 'chunkers/maxent_ne_chunker/english_ace_multiclass.pickle'
|
||||
|
||||
|
||||
def ne_chunk(tagged_tokens, binary=False):
|
||||
"""
|
||||
Use NLTK's currently recommended named entity chunker to
|
||||
chunk the given list of tagged tokens.
|
||||
"""
|
||||
if binary:
|
||||
chunker_pickle = _BINARY_NE_CHUNKER
|
||||
else:
|
||||
chunker_pickle = _MULTICLASS_NE_CHUNKER
|
||||
chunker = load(chunker_pickle)
|
||||
return chunker.parse(tagged_tokens)
|
||||
|
||||
|
||||
def ne_chunk_sents(tagged_sentences, binary=False):
|
||||
"""
|
||||
Use NLTK's currently recommended named entity chunker to chunk the
|
||||
given list of tagged sentences, each consisting of a list of tagged tokens.
|
||||
"""
|
||||
if binary:
|
||||
chunker_pickle = _BINARY_NE_CHUNKER
|
||||
else:
|
||||
chunker_pickle = _MULTICLASS_NE_CHUNKER
|
||||
chunker = load(chunker_pickle)
|
||||
return chunker.parse_sents(tagged_sentences)
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
52
venv/lib/python3.7/site-packages/nltk/chunk/api.py
Normal file
52
venv/lib/python3.7/site-packages/nltk/chunk/api.py
Normal file
@@ -0,0 +1,52 @@
|
||||
# Natural Language Toolkit: Chunk parsing API
|
||||
#
|
||||
# Copyright (C) 2001-2019 NLTK Project
|
||||
# Author: Edward Loper <edloper@gmail.com>
|
||||
# Steven Bird <stevenbird1@gmail.com> (minor additions)
|
||||
# URL: <http://nltk.org/>
|
||||
# For license information, see LICENSE.TXT
|
||||
|
||||
##//////////////////////////////////////////////////////
|
||||
## Chunk Parser Interface
|
||||
##//////////////////////////////////////////////////////
|
||||
|
||||
from nltk.parse import ParserI
|
||||
|
||||
from nltk.chunk.util import ChunkScore
|
||||
|
||||
|
||||
class ChunkParserI(ParserI):
|
||||
"""
|
||||
A processing interface for identifying non-overlapping groups in
|
||||
unrestricted text. Typically, chunk parsers are used to find base
|
||||
syntactic constituents, such as base noun phrases. Unlike
|
||||
``ParserI``, ``ChunkParserI`` guarantees that the ``parse()`` method
|
||||
will always generate a parse.
|
||||
"""
|
||||
|
||||
def parse(self, tokens):
|
||||
"""
|
||||
Return the best chunk structure for the given tokens
|
||||
and return a tree.
|
||||
|
||||
:param tokens: The list of (word, tag) tokens to be chunked.
|
||||
:type tokens: list(tuple)
|
||||
:rtype: Tree
|
||||
"""
|
||||
raise NotImplementedError()
|
||||
|
||||
def evaluate(self, gold):
|
||||
"""
|
||||
Score the accuracy of the chunker against the gold standard.
|
||||
Remove the chunking the gold standard text, rechunk it using
|
||||
the chunker, and return a ``ChunkScore`` object
|
||||
reflecting the performance of this chunk peraser.
|
||||
|
||||
:type gold: list(Tree)
|
||||
:param gold: The list of chunked sentences to score the chunker on.
|
||||
:rtype: ChunkScore
|
||||
"""
|
||||
chunkscore = ChunkScore()
|
||||
for correct in gold:
|
||||
chunkscore.score(correct, self.parse(correct.leaves()))
|
||||
return chunkscore
|
||||
354
venv/lib/python3.7/site-packages/nltk/chunk/named_entity.py
Normal file
354
venv/lib/python3.7/site-packages/nltk/chunk/named_entity.py
Normal file
@@ -0,0 +1,354 @@
|
||||
# Natural Language Toolkit: Chunk parsing API
|
||||
#
|
||||
# Copyright (C) 2001-2019 NLTK Project
|
||||
# Author: Edward Loper <edloper@gmail.com>
|
||||
# URL: <http://nltk.org/>
|
||||
# For license information, see LICENSE.TXT
|
||||
|
||||
"""
|
||||
Named entity chunker
|
||||
"""
|
||||
from __future__ import print_function
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import os, re, pickle
|
||||
from xml.etree import ElementTree as ET
|
||||
|
||||
from nltk.tag import ClassifierBasedTagger, pos_tag
|
||||
|
||||
try:
|
||||
from nltk.classify import MaxentClassifier
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
from nltk.tree import Tree
|
||||
from nltk.tokenize import word_tokenize
|
||||
from nltk.data import find
|
||||
|
||||
from nltk.chunk.api import ChunkParserI
|
||||
from nltk.chunk.util import ChunkScore
|
||||
|
||||
|
||||
class NEChunkParserTagger(ClassifierBasedTagger):
|
||||
"""
|
||||
The IOB tagger used by the chunk parser.
|
||||
"""
|
||||
|
||||
def __init__(self, train):
|
||||
ClassifierBasedTagger.__init__(
|
||||
self, train=train, classifier_builder=self._classifier_builder
|
||||
)
|
||||
|
||||
def _classifier_builder(self, train):
|
||||
return MaxentClassifier.train(
|
||||
train, algorithm='megam', gaussian_prior_sigma=1, trace=2
|
||||
)
|
||||
|
||||
def _english_wordlist(self):
|
||||
try:
|
||||
wl = self._en_wordlist
|
||||
except AttributeError:
|
||||
from nltk.corpus import words
|
||||
|
||||
self._en_wordlist = set(words.words('en-basic'))
|
||||
wl = self._en_wordlist
|
||||
return wl
|
||||
|
||||
def _feature_detector(self, tokens, index, history):
|
||||
word = tokens[index][0]
|
||||
pos = simplify_pos(tokens[index][1])
|
||||
if index == 0:
|
||||
prevword = prevprevword = None
|
||||
prevpos = prevprevpos = None
|
||||
prevshape = prevtag = prevprevtag = None
|
||||
elif index == 1:
|
||||
prevword = tokens[index - 1][0].lower()
|
||||
prevprevword = None
|
||||
prevpos = simplify_pos(tokens[index - 1][1])
|
||||
prevprevpos = None
|
||||
prevtag = history[index - 1][0]
|
||||
prevshape = prevprevtag = None
|
||||
else:
|
||||
prevword = tokens[index - 1][0].lower()
|
||||
prevprevword = tokens[index - 2][0].lower()
|
||||
prevpos = simplify_pos(tokens[index - 1][1])
|
||||
prevprevpos = simplify_pos(tokens[index - 2][1])
|
||||
prevtag = history[index - 1]
|
||||
prevprevtag = history[index - 2]
|
||||
prevshape = shape(prevword)
|
||||
if index == len(tokens) - 1:
|
||||
nextword = nextnextword = None
|
||||
nextpos = nextnextpos = None
|
||||
elif index == len(tokens) - 2:
|
||||
nextword = tokens[index + 1][0].lower()
|
||||
nextpos = tokens[index + 1][1].lower()
|
||||
nextnextword = None
|
||||
nextnextpos = None
|
||||
else:
|
||||
nextword = tokens[index + 1][0].lower()
|
||||
nextpos = tokens[index + 1][1].lower()
|
||||
nextnextword = tokens[index + 2][0].lower()
|
||||
nextnextpos = tokens[index + 2][1].lower()
|
||||
|
||||
# 89.6
|
||||
features = {
|
||||
'bias': True,
|
||||
'shape': shape(word),
|
||||
'wordlen': len(word),
|
||||
'prefix3': word[:3].lower(),
|
||||
'suffix3': word[-3:].lower(),
|
||||
'pos': pos,
|
||||
'word': word,
|
||||
'en-wordlist': (word in self._english_wordlist()),
|
||||
'prevtag': prevtag,
|
||||
'prevpos': prevpos,
|
||||
'nextpos': nextpos,
|
||||
'prevword': prevword,
|
||||
'nextword': nextword,
|
||||
'word+nextpos': '{0}+{1}'.format(word.lower(), nextpos),
|
||||
'pos+prevtag': '{0}+{1}'.format(pos, prevtag),
|
||||
'shape+prevtag': '{0}+{1}'.format(prevshape, prevtag),
|
||||
}
|
||||
|
||||
return features
|
||||
|
||||
|
||||
class NEChunkParser(ChunkParserI):
|
||||
"""
|
||||
Expected input: list of pos-tagged words
|
||||
"""
|
||||
|
||||
def __init__(self, train):
|
||||
self._train(train)
|
||||
|
||||
def parse(self, tokens):
|
||||
"""
|
||||
Each token should be a pos-tagged word
|
||||
"""
|
||||
tagged = self._tagger.tag(tokens)
|
||||
tree = self._tagged_to_parse(tagged)
|
||||
return tree
|
||||
|
||||
def _train(self, corpus):
|
||||
# Convert to tagged sequence
|
||||
corpus = [self._parse_to_tagged(s) for s in corpus]
|
||||
|
||||
self._tagger = NEChunkParserTagger(train=corpus)
|
||||
|
||||
def _tagged_to_parse(self, tagged_tokens):
|
||||
"""
|
||||
Convert a list of tagged tokens to a chunk-parse tree.
|
||||
"""
|
||||
sent = Tree('S', [])
|
||||
|
||||
for (tok, tag) in tagged_tokens:
|
||||
if tag == 'O':
|
||||
sent.append(tok)
|
||||
elif tag.startswith('B-'):
|
||||
sent.append(Tree(tag[2:], [tok]))
|
||||
elif tag.startswith('I-'):
|
||||
if sent and isinstance(sent[-1], Tree) and sent[-1].label() == tag[2:]:
|
||||
sent[-1].append(tok)
|
||||
else:
|
||||
sent.append(Tree(tag[2:], [tok]))
|
||||
return sent
|
||||
|
||||
@staticmethod
|
||||
def _parse_to_tagged(sent):
|
||||
"""
|
||||
Convert a chunk-parse tree to a list of tagged tokens.
|
||||
"""
|
||||
toks = []
|
||||
for child in sent:
|
||||
if isinstance(child, Tree):
|
||||
if len(child) == 0:
|
||||
print("Warning -- empty chunk in sentence")
|
||||
continue
|
||||
toks.append((child[0], 'B-{0}'.format(child.label())))
|
||||
for tok in child[1:]:
|
||||
toks.append((tok, 'I-{0}'.format(child.label())))
|
||||
else:
|
||||
toks.append((child, 'O'))
|
||||
return toks
|
||||
|
||||
|
||||
def shape(word):
|
||||
if re.match('[0-9]+(\.[0-9]*)?|[0-9]*\.[0-9]+$', word, re.UNICODE):
|
||||
return 'number'
|
||||
elif re.match('\W+$', word, re.UNICODE):
|
||||
return 'punct'
|
||||
elif re.match('\w+$', word, re.UNICODE):
|
||||
if word.istitle():
|
||||
return 'upcase'
|
||||
elif word.islower():
|
||||
return 'downcase'
|
||||
else:
|
||||
return 'mixedcase'
|
||||
else:
|
||||
return 'other'
|
||||
|
||||
|
||||
def simplify_pos(s):
|
||||
if s.startswith('V'):
|
||||
return "V"
|
||||
else:
|
||||
return s.split('-')[0]
|
||||
|
||||
|
||||
def postag_tree(tree):
|
||||
# Part-of-speech tagging.
|
||||
words = tree.leaves()
|
||||
tag_iter = (pos for (word, pos) in pos_tag(words))
|
||||
newtree = Tree('S', [])
|
||||
for child in tree:
|
||||
if isinstance(child, Tree):
|
||||
newtree.append(Tree(child.label(), []))
|
||||
for subchild in child:
|
||||
newtree[-1].append((subchild, next(tag_iter)))
|
||||
else:
|
||||
newtree.append((child, next(tag_iter)))
|
||||
return newtree
|
||||
|
||||
|
||||
def load_ace_data(roots, fmt='binary', skip_bnews=True):
|
||||
for root in roots:
|
||||
for root, dirs, files in os.walk(root):
|
||||
if root.endswith('bnews') and skip_bnews:
|
||||
continue
|
||||
for f in files:
|
||||
if f.endswith('.sgm'):
|
||||
for sent in load_ace_file(os.path.join(root, f), fmt):
|
||||
yield sent
|
||||
|
||||
|
||||
def load_ace_file(textfile, fmt):
|
||||
print(' - {0}'.format(os.path.split(textfile)[1]))
|
||||
annfile = textfile + '.tmx.rdc.xml'
|
||||
|
||||
# Read the xml file, and get a list of entities
|
||||
entities = []
|
||||
with open(annfile, 'r') as infile:
|
||||
xml = ET.parse(infile).getroot()
|
||||
for entity in xml.findall('document/entity'):
|
||||
typ = entity.find('entity_type').text
|
||||
for mention in entity.findall('entity_mention'):
|
||||
if mention.get('TYPE') != 'NAME':
|
||||
continue # only NEs
|
||||
s = int(mention.find('head/charseq/start').text)
|
||||
e = int(mention.find('head/charseq/end').text) + 1
|
||||
entities.append((s, e, typ))
|
||||
|
||||
# Read the text file, and mark the entities.
|
||||
with open(textfile, 'r') as infile:
|
||||
text = infile.read()
|
||||
|
||||
# Strip XML tags, since they don't count towards the indices
|
||||
text = re.sub('<(?!/?TEXT)[^>]+>', '', text)
|
||||
|
||||
# Blank out anything before/after <TEXT>
|
||||
def subfunc(m):
|
||||
return ' ' * (m.end() - m.start() - 6)
|
||||
|
||||
text = re.sub('[\s\S]*<TEXT>', subfunc, text)
|
||||
text = re.sub('</TEXT>[\s\S]*', '', text)
|
||||
|
||||
# Simplify quotes
|
||||
text = re.sub("``", ' "', text)
|
||||
text = re.sub("''", '" ', text)
|
||||
|
||||
entity_types = set(typ for (s, e, typ) in entities)
|
||||
|
||||
# Binary distinction (NE or not NE)
|
||||
if fmt == 'binary':
|
||||
i = 0
|
||||
toks = Tree('S', [])
|
||||
for (s, e, typ) in sorted(entities):
|
||||
if s < i:
|
||||
s = i # Overlapping! Deal with this better?
|
||||
if e <= s:
|
||||
continue
|
||||
toks.extend(word_tokenize(text[i:s]))
|
||||
toks.append(Tree('NE', text[s:e].split()))
|
||||
i = e
|
||||
toks.extend(word_tokenize(text[i:]))
|
||||
yield toks
|
||||
|
||||
# Multiclass distinction (NE type)
|
||||
elif fmt == 'multiclass':
|
||||
i = 0
|
||||
toks = Tree('S', [])
|
||||
for (s, e, typ) in sorted(entities):
|
||||
if s < i:
|
||||
s = i # Overlapping! Deal with this better?
|
||||
if e <= s:
|
||||
continue
|
||||
toks.extend(word_tokenize(text[i:s]))
|
||||
toks.append(Tree(typ, text[s:e].split()))
|
||||
i = e
|
||||
toks.extend(word_tokenize(text[i:]))
|
||||
yield toks
|
||||
|
||||
else:
|
||||
raise ValueError('bad fmt value')
|
||||
|
||||
|
||||
# This probably belongs in a more general-purpose location (as does
|
||||
# the parse_to_tagged function).
|
||||
def cmp_chunks(correct, guessed):
|
||||
correct = NEChunkParser._parse_to_tagged(correct)
|
||||
guessed = NEChunkParser._parse_to_tagged(guessed)
|
||||
ellipsis = False
|
||||
for (w, ct), (w, gt) in zip(correct, guessed):
|
||||
if ct == gt == 'O':
|
||||
if not ellipsis:
|
||||
print(" {:15} {:15} {2}".format(ct, gt, w))
|
||||
print(' {:15} {:15} {2}'.format('...', '...', '...'))
|
||||
ellipsis = True
|
||||
else:
|
||||
ellipsis = False
|
||||
print(" {:15} {:15} {2}".format(ct, gt, w))
|
||||
|
||||
|
||||
def build_model(fmt='binary'):
|
||||
print('Loading training data...')
|
||||
train_paths = [
|
||||
find('corpora/ace_data/ace.dev'),
|
||||
find('corpora/ace_data/ace.heldout'),
|
||||
find('corpora/ace_data/bbn.dev'),
|
||||
find('corpora/ace_data/muc.dev'),
|
||||
]
|
||||
train_trees = load_ace_data(train_paths, fmt)
|
||||
train_data = [postag_tree(t) for t in train_trees]
|
||||
print('Training...')
|
||||
cp = NEChunkParser(train_data)
|
||||
del train_data
|
||||
|
||||
print('Loading eval data...')
|
||||
eval_paths = [find('corpora/ace_data/ace.eval')]
|
||||
eval_trees = load_ace_data(eval_paths, fmt)
|
||||
eval_data = [postag_tree(t) for t in eval_trees]
|
||||
|
||||
print('Evaluating...')
|
||||
chunkscore = ChunkScore()
|
||||
for i, correct in enumerate(eval_data):
|
||||
guess = cp.parse(correct.leaves())
|
||||
chunkscore.score(correct, guess)
|
||||
if i < 3:
|
||||
cmp_chunks(correct, guess)
|
||||
print(chunkscore)
|
||||
|
||||
outfilename = '/tmp/ne_chunker_{0}.pickle'.format(fmt)
|
||||
print('Saving chunker to {0}...'.format(outfilename))
|
||||
|
||||
with open(outfilename, 'wb') as outfile:
|
||||
pickle.dump(cp, outfile, -1)
|
||||
|
||||
return cp
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
# Make sure that the pickled object has the right class name:
|
||||
from nltk.chunk.named_entity import build_model
|
||||
|
||||
build_model('binary')
|
||||
build_model('multiclass')
|
||||
1488
venv/lib/python3.7/site-packages/nltk/chunk/regexp.py
Normal file
1488
venv/lib/python3.7/site-packages/nltk/chunk/regexp.py
Normal file
File diff suppressed because it is too large
Load Diff
646
venv/lib/python3.7/site-packages/nltk/chunk/util.py
Normal file
646
venv/lib/python3.7/site-packages/nltk/chunk/util.py
Normal file
@@ -0,0 +1,646 @@
|
||||
# Natural Language Toolkit: Chunk format conversions
|
||||
#
|
||||
# Copyright (C) 2001-2019 NLTK Project
|
||||
# Author: Edward Loper <edloper@gmail.com>
|
||||
# Steven Bird <stevenbird1@gmail.com> (minor additions)
|
||||
# URL: <http://nltk.org/>
|
||||
# For license information, see LICENSE.TXT
|
||||
from __future__ import print_function, unicode_literals, division
|
||||
|
||||
import re
|
||||
|
||||
from nltk.tree import Tree
|
||||
from nltk.tag.mapping import map_tag
|
||||
from nltk.tag.util import str2tuple
|
||||
from nltk.compat import python_2_unicode_compatible
|
||||
|
||||
##//////////////////////////////////////////////////////
|
||||
## EVALUATION
|
||||
##//////////////////////////////////////////////////////
|
||||
|
||||
from nltk.metrics import accuracy as _accuracy
|
||||
|
||||
|
||||
def accuracy(chunker, gold):
|
||||
"""
|
||||
Score the accuracy of the chunker against the gold standard.
|
||||
Strip the chunk information from the gold standard and rechunk it using
|
||||
the chunker, then compute the accuracy score.
|
||||
|
||||
:type chunker: ChunkParserI
|
||||
:param chunker: The chunker being evaluated.
|
||||
:type gold: tree
|
||||
:param gold: The chunk structures to score the chunker on.
|
||||
:rtype: float
|
||||
"""
|
||||
|
||||
gold_tags = []
|
||||
test_tags = []
|
||||
for gold_tree in gold:
|
||||
test_tree = chunker.parse(gold_tree.flatten())
|
||||
gold_tags += tree2conlltags(gold_tree)
|
||||
test_tags += tree2conlltags(test_tree)
|
||||
|
||||
# print 'GOLD:', gold_tags[:50]
|
||||
# print 'TEST:', test_tags[:50]
|
||||
return _accuracy(gold_tags, test_tags)
|
||||
|
||||
|
||||
# Patched for increased performance by Yoav Goldberg <yoavg@cs.bgu.ac.il>, 2006-01-13
|
||||
# -- statistics are evaluated only on demand, instead of at every sentence evaluation
|
||||
#
|
||||
# SB: use nltk.metrics for precision/recall scoring?
|
||||
#
|
||||
class ChunkScore(object):
|
||||
"""
|
||||
A utility class for scoring chunk parsers. ``ChunkScore`` can
|
||||
evaluate a chunk parser's output, based on a number of statistics
|
||||
(precision, recall, f-measure, misssed chunks, incorrect chunks).
|
||||
It can also combine the scores from the parsing of multiple texts;
|
||||
this makes it significantly easier to evaluate a chunk parser that
|
||||
operates one sentence at a time.
|
||||
|
||||
Texts are evaluated with the ``score`` method. The results of
|
||||
evaluation can be accessed via a number of accessor methods, such
|
||||
as ``precision`` and ``f_measure``. A typical use of the
|
||||
``ChunkScore`` class is::
|
||||
|
||||
>>> chunkscore = ChunkScore() # doctest: +SKIP
|
||||
>>> for correct in correct_sentences: # doctest: +SKIP
|
||||
... guess = chunkparser.parse(correct.leaves()) # doctest: +SKIP
|
||||
... chunkscore.score(correct, guess) # doctest: +SKIP
|
||||
>>> print('F Measure:', chunkscore.f_measure()) # doctest: +SKIP
|
||||
F Measure: 0.823
|
||||
|
||||
:ivar kwargs: Keyword arguments:
|
||||
|
||||
- max_tp_examples: The maximum number actual examples of true
|
||||
positives to record. This affects the ``correct`` member
|
||||
function: ``correct`` will not return more than this number
|
||||
of true positive examples. This does *not* affect any of
|
||||
the numerical metrics (precision, recall, or f-measure)
|
||||
|
||||
- max_fp_examples: The maximum number actual examples of false
|
||||
positives to record. This affects the ``incorrect`` member
|
||||
function and the ``guessed`` member function: ``incorrect``
|
||||
will not return more than this number of examples, and
|
||||
``guessed`` will not return more than this number of true
|
||||
positive examples. This does *not* affect any of the
|
||||
numerical metrics (precision, recall, or f-measure)
|
||||
|
||||
- max_fn_examples: The maximum number actual examples of false
|
||||
negatives to record. This affects the ``missed`` member
|
||||
function and the ``correct`` member function: ``missed``
|
||||
will not return more than this number of examples, and
|
||||
``correct`` will not return more than this number of true
|
||||
negative examples. This does *not* affect any of the
|
||||
numerical metrics (precision, recall, or f-measure)
|
||||
|
||||
- chunk_label: A regular expression indicating which chunks
|
||||
should be compared. Defaults to ``'.*'`` (i.e., all chunks).
|
||||
|
||||
:type _tp: list(Token)
|
||||
:ivar _tp: List of true positives
|
||||
:type _fp: list(Token)
|
||||
:ivar _fp: List of false positives
|
||||
:type _fn: list(Token)
|
||||
:ivar _fn: List of false negatives
|
||||
|
||||
:type _tp_num: int
|
||||
:ivar _tp_num: Number of true positives
|
||||
:type _fp_num: int
|
||||
:ivar _fp_num: Number of false positives
|
||||
:type _fn_num: int
|
||||
:ivar _fn_num: Number of false negatives.
|
||||
"""
|
||||
|
||||
def __init__(self, **kwargs):
|
||||
self._correct = set()
|
||||
self._guessed = set()
|
||||
self._tp = set()
|
||||
self._fp = set()
|
||||
self._fn = set()
|
||||
self._max_tp = kwargs.get('max_tp_examples', 100)
|
||||
self._max_fp = kwargs.get('max_fp_examples', 100)
|
||||
self._max_fn = kwargs.get('max_fn_examples', 100)
|
||||
self._chunk_label = kwargs.get('chunk_label', '.*')
|
||||
self._tp_num = 0
|
||||
self._fp_num = 0
|
||||
self._fn_num = 0
|
||||
self._count = 0
|
||||
self._tags_correct = 0.0
|
||||
self._tags_total = 0.0
|
||||
|
||||
self._measuresNeedUpdate = False
|
||||
|
||||
def _updateMeasures(self):
|
||||
if self._measuresNeedUpdate:
|
||||
self._tp = self._guessed & self._correct
|
||||
self._fn = self._correct - self._guessed
|
||||
self._fp = self._guessed - self._correct
|
||||
self._tp_num = len(self._tp)
|
||||
self._fp_num = len(self._fp)
|
||||
self._fn_num = len(self._fn)
|
||||
self._measuresNeedUpdate = False
|
||||
|
||||
def score(self, correct, guessed):
|
||||
"""
|
||||
Given a correctly chunked sentence, score another chunked
|
||||
version of the same sentence.
|
||||
|
||||
:type correct: chunk structure
|
||||
:param correct: The known-correct ("gold standard") chunked
|
||||
sentence.
|
||||
:type guessed: chunk structure
|
||||
:param guessed: The chunked sentence to be scored.
|
||||
"""
|
||||
self._correct |= _chunksets(correct, self._count, self._chunk_label)
|
||||
self._guessed |= _chunksets(guessed, self._count, self._chunk_label)
|
||||
self._count += 1
|
||||
self._measuresNeedUpdate = True
|
||||
# Keep track of per-tag accuracy (if possible)
|
||||
try:
|
||||
correct_tags = tree2conlltags(correct)
|
||||
guessed_tags = tree2conlltags(guessed)
|
||||
except ValueError:
|
||||
# This exception case is for nested chunk structures,
|
||||
# where tree2conlltags will fail with a ValueError: "Tree
|
||||
# is too deeply nested to be printed in CoNLL format."
|
||||
correct_tags = guessed_tags = ()
|
||||
self._tags_total += len(correct_tags)
|
||||
self._tags_correct += sum(
|
||||
1 for (t, g) in zip(guessed_tags, correct_tags) if t == g
|
||||
)
|
||||
|
||||
def accuracy(self):
|
||||
"""
|
||||
Return the overall tag-based accuracy for all text that have
|
||||
been scored by this ``ChunkScore``, using the IOB (conll2000)
|
||||
tag encoding.
|
||||
|
||||
:rtype: float
|
||||
"""
|
||||
if self._tags_total == 0:
|
||||
return 1
|
||||
return self._tags_correct / self._tags_total
|
||||
|
||||
def precision(self):
|
||||
"""
|
||||
Return the overall precision for all texts that have been
|
||||
scored by this ``ChunkScore``.
|
||||
|
||||
:rtype: float
|
||||
"""
|
||||
self._updateMeasures()
|
||||
div = self._tp_num + self._fp_num
|
||||
if div == 0:
|
||||
return 0
|
||||
else:
|
||||
return self._tp_num / div
|
||||
|
||||
def recall(self):
|
||||
"""
|
||||
Return the overall recall for all texts that have been
|
||||
scored by this ``ChunkScore``.
|
||||
|
||||
:rtype: float
|
||||
"""
|
||||
self._updateMeasures()
|
||||
div = self._tp_num + self._fn_num
|
||||
if div == 0:
|
||||
return 0
|
||||
else:
|
||||
return self._tp_num / div
|
||||
|
||||
def f_measure(self, alpha=0.5):
|
||||
"""
|
||||
Return the overall F measure for all texts that have been
|
||||
scored by this ``ChunkScore``.
|
||||
|
||||
:param alpha: the relative weighting of precision and recall.
|
||||
Larger alpha biases the score towards the precision value,
|
||||
while smaller alpha biases the score towards the recall
|
||||
value. ``alpha`` should have a value in the range [0,1].
|
||||
:type alpha: float
|
||||
:rtype: float
|
||||
"""
|
||||
self._updateMeasures()
|
||||
p = self.precision()
|
||||
r = self.recall()
|
||||
if p == 0 or r == 0: # what if alpha is 0 or 1?
|
||||
return 0
|
||||
return 1 / (alpha / p + (1 - alpha) / r)
|
||||
|
||||
def missed(self):
|
||||
"""
|
||||
Return the chunks which were included in the
|
||||
correct chunk structures, but not in the guessed chunk
|
||||
structures, listed in input order.
|
||||
|
||||
:rtype: list of chunks
|
||||
"""
|
||||
self._updateMeasures()
|
||||
chunks = list(self._fn)
|
||||
return [c[1] for c in chunks] # discard position information
|
||||
|
||||
def incorrect(self):
|
||||
"""
|
||||
Return the chunks which were included in the guessed chunk structures,
|
||||
but not in the correct chunk structures, listed in input order.
|
||||
|
||||
:rtype: list of chunks
|
||||
"""
|
||||
self._updateMeasures()
|
||||
chunks = list(self._fp)
|
||||
return [c[1] for c in chunks] # discard position information
|
||||
|
||||
def correct(self):
|
||||
"""
|
||||
Return the chunks which were included in the correct
|
||||
chunk structures, listed in input order.
|
||||
|
||||
:rtype: list of chunks
|
||||
"""
|
||||
chunks = list(self._correct)
|
||||
return [c[1] for c in chunks] # discard position information
|
||||
|
||||
def guessed(self):
|
||||
"""
|
||||
Return the chunks which were included in the guessed
|
||||
chunk structures, listed in input order.
|
||||
|
||||
:rtype: list of chunks
|
||||
"""
|
||||
chunks = list(self._guessed)
|
||||
return [c[1] for c in chunks] # discard position information
|
||||
|
||||
def __len__(self):
|
||||
self._updateMeasures()
|
||||
return self._tp_num + self._fn_num
|
||||
|
||||
def __repr__(self):
|
||||
"""
|
||||
Return a concise representation of this ``ChunkScoring``.
|
||||
|
||||
:rtype: str
|
||||
"""
|
||||
return '<ChunkScoring of ' + repr(len(self)) + ' chunks>'
|
||||
|
||||
def __str__(self):
|
||||
"""
|
||||
Return a verbose representation of this ``ChunkScoring``.
|
||||
This representation includes the precision, recall, and
|
||||
f-measure scores. For other information about the score,
|
||||
use the accessor methods (e.g., ``missed()`` and ``incorrect()``).
|
||||
|
||||
:rtype: str
|
||||
"""
|
||||
return (
|
||||
"ChunkParse score:\n"
|
||||
+ (" IOB Accuracy: {:5.1f}%%\n".format(self.accuracy() * 100))
|
||||
+ (" Precision: {:5.1f}%%\n".format(self.precision() * 100))
|
||||
+ (" Recall: {:5.1f}%%\n".format(self.recall() * 100))
|
||||
+ (" F-Measure: {:5.1f}%%".format(self.f_measure() * 100))
|
||||
)
|
||||
|
||||
|
||||
# extract chunks, and assign unique id, the absolute position of
|
||||
# the first word of the chunk
|
||||
def _chunksets(t, count, chunk_label):
|
||||
pos = 0
|
||||
chunks = []
|
||||
for child in t:
|
||||
if isinstance(child, Tree):
|
||||
if re.match(chunk_label, child.label()):
|
||||
chunks.append(((count, pos), child.freeze()))
|
||||
pos += len(child.leaves())
|
||||
else:
|
||||
pos += 1
|
||||
return set(chunks)
|
||||
|
||||
|
||||
def tagstr2tree(
|
||||
s, chunk_label="NP", root_label="S", sep='/', source_tagset=None, target_tagset=None
|
||||
):
|
||||
"""
|
||||
Divide a string of bracketted tagged text into
|
||||
chunks and unchunked tokens, and produce a Tree.
|
||||
Chunks are marked by square brackets (``[...]``). Words are
|
||||
delimited by whitespace, and each word should have the form
|
||||
``text/tag``. Words that do not contain a slash are
|
||||
assigned a ``tag`` of None.
|
||||
|
||||
:param s: The string to be converted
|
||||
:type s: str
|
||||
:param chunk_label: The label to use for chunk nodes
|
||||
:type chunk_label: str
|
||||
:param root_label: The label to use for the root of the tree
|
||||
:type root_label: str
|
||||
:rtype: Tree
|
||||
"""
|
||||
|
||||
WORD_OR_BRACKET = re.compile(r'\[|\]|[^\[\]\s]+')
|
||||
|
||||
stack = [Tree(root_label, [])]
|
||||
for match in WORD_OR_BRACKET.finditer(s):
|
||||
text = match.group()
|
||||
if text[0] == '[':
|
||||
if len(stack) != 1:
|
||||
raise ValueError('Unexpected [ at char {:d}'.format(match.start()))
|
||||
chunk = Tree(chunk_label, [])
|
||||
stack[-1].append(chunk)
|
||||
stack.append(chunk)
|
||||
elif text[0] == ']':
|
||||
if len(stack) != 2:
|
||||
raise ValueError('Unexpected ] at char {:d}'.format(match.start()))
|
||||
stack.pop()
|
||||
else:
|
||||
if sep is None:
|
||||
stack[-1].append(text)
|
||||
else:
|
||||
word, tag = str2tuple(text, sep)
|
||||
if source_tagset and target_tagset:
|
||||
tag = map_tag(source_tagset, target_tagset, tag)
|
||||
stack[-1].append((word, tag))
|
||||
|
||||
if len(stack) != 1:
|
||||
raise ValueError('Expected ] at char {:d}'.format(len(s)))
|
||||
return stack[0]
|
||||
|
||||
|
||||
### CONLL
|
||||
|
||||
_LINE_RE = re.compile('(\S+)\s+(\S+)\s+([IOB])-?(\S+)?')
|
||||
|
||||
|
||||
def conllstr2tree(s, chunk_types=('NP', 'PP', 'VP'), root_label="S"):
|
||||
"""
|
||||
Return a chunk structure for a single sentence
|
||||
encoded in the given CONLL 2000 style string.
|
||||
This function converts a CoNLL IOB string into a tree.
|
||||
It uses the specified chunk types
|
||||
(defaults to NP, PP and VP), and creates a tree rooted at a node
|
||||
labeled S (by default).
|
||||
|
||||
:param s: The CoNLL string to be converted.
|
||||
:type s: str
|
||||
:param chunk_types: The chunk types to be converted.
|
||||
:type chunk_types: tuple
|
||||
:param root_label: The node label to use for the root.
|
||||
:type root_label: str
|
||||
:rtype: Tree
|
||||
"""
|
||||
|
||||
stack = [Tree(root_label, [])]
|
||||
|
||||
for lineno, line in enumerate(s.split('\n')):
|
||||
if not line.strip():
|
||||
continue
|
||||
|
||||
# Decode the line.
|
||||
match = _LINE_RE.match(line)
|
||||
if match is None:
|
||||
raise ValueError('Error on line {:d}'.format(lineno))
|
||||
(word, tag, state, chunk_type) = match.groups()
|
||||
|
||||
# If it's a chunk type we don't care about, treat it as O.
|
||||
if chunk_types is not None and chunk_type not in chunk_types:
|
||||
state = 'O'
|
||||
|
||||
# For "Begin"/"Outside", finish any completed chunks -
|
||||
# also do so for "Inside" which don't match the previous token.
|
||||
mismatch_I = state == 'I' and chunk_type != stack[-1].label()
|
||||
if state in 'BO' or mismatch_I:
|
||||
if len(stack) == 2:
|
||||
stack.pop()
|
||||
|
||||
# For "Begin", start a new chunk.
|
||||
if state == 'B' or mismatch_I:
|
||||
chunk = Tree(chunk_type, [])
|
||||
stack[-1].append(chunk)
|
||||
stack.append(chunk)
|
||||
|
||||
# Add the new word token.
|
||||
stack[-1].append((word, tag))
|
||||
|
||||
return stack[0]
|
||||
|
||||
|
||||
def tree2conlltags(t):
|
||||
"""
|
||||
Return a list of 3-tuples containing ``(word, tag, IOB-tag)``.
|
||||
Convert a tree to the CoNLL IOB tag format.
|
||||
|
||||
:param t: The tree to be converted.
|
||||
:type t: Tree
|
||||
:rtype: list(tuple)
|
||||
"""
|
||||
|
||||
tags = []
|
||||
for child in t:
|
||||
try:
|
||||
category = child.label()
|
||||
prefix = "B-"
|
||||
for contents in child:
|
||||
if isinstance(contents, Tree):
|
||||
raise ValueError(
|
||||
"Tree is too deeply nested to be printed in CoNLL format"
|
||||
)
|
||||
tags.append((contents[0], contents[1], prefix + category))
|
||||
prefix = "I-"
|
||||
except AttributeError:
|
||||
tags.append((child[0], child[1], "O"))
|
||||
return tags
|
||||
|
||||
|
||||
def conlltags2tree(
|
||||
sentence, chunk_types=('NP', 'PP', 'VP'), root_label='S', strict=False
|
||||
):
|
||||
"""
|
||||
Convert the CoNLL IOB format to a tree.
|
||||
"""
|
||||
tree = Tree(root_label, [])
|
||||
for (word, postag, chunktag) in sentence:
|
||||
if chunktag is None:
|
||||
if strict:
|
||||
raise ValueError("Bad conll tag sequence")
|
||||
else:
|
||||
# Treat as O
|
||||
tree.append((word, postag))
|
||||
elif chunktag.startswith('B-'):
|
||||
tree.append(Tree(chunktag[2:], [(word, postag)]))
|
||||
elif chunktag.startswith('I-'):
|
||||
if (
|
||||
len(tree) == 0
|
||||
or not isinstance(tree[-1], Tree)
|
||||
or tree[-1].label() != chunktag[2:]
|
||||
):
|
||||
if strict:
|
||||
raise ValueError("Bad conll tag sequence")
|
||||
else:
|
||||
# Treat as B-*
|
||||
tree.append(Tree(chunktag[2:], [(word, postag)]))
|
||||
else:
|
||||
tree[-1].append((word, postag))
|
||||
elif chunktag == 'O':
|
||||
tree.append((word, postag))
|
||||
else:
|
||||
raise ValueError("Bad conll tag {0!r}".format(chunktag))
|
||||
return tree
|
||||
|
||||
|
||||
def tree2conllstr(t):
|
||||
"""
|
||||
Return a multiline string where each line contains a word, tag and IOB tag.
|
||||
Convert a tree to the CoNLL IOB string format
|
||||
|
||||
:param t: The tree to be converted.
|
||||
:type t: Tree
|
||||
:rtype: str
|
||||
"""
|
||||
lines = [" ".join(token) for token in tree2conlltags(t)]
|
||||
return '\n'.join(lines)
|
||||
|
||||
|
||||
### IEER
|
||||
|
||||
_IEER_DOC_RE = re.compile(
|
||||
r'<DOC>\s*'
|
||||
r'(<DOCNO>\s*(?P<docno>.+?)\s*</DOCNO>\s*)?'
|
||||
r'(<DOCTYPE>\s*(?P<doctype>.+?)\s*</DOCTYPE>\s*)?'
|
||||
r'(<DATE_TIME>\s*(?P<date_time>.+?)\s*</DATE_TIME>\s*)?'
|
||||
r'<BODY>\s*'
|
||||
r'(<HEADLINE>\s*(?P<headline>.+?)\s*</HEADLINE>\s*)?'
|
||||
r'<TEXT>(?P<text>.*?)</TEXT>\s*'
|
||||
r'</BODY>\s*</DOC>\s*',
|
||||
re.DOTALL,
|
||||
)
|
||||
|
||||
_IEER_TYPE_RE = re.compile('<b_\w+\s+[^>]*?type="(?P<type>\w+)"')
|
||||
|
||||
|
||||
def _ieer_read_text(s, root_label):
|
||||
stack = [Tree(root_label, [])]
|
||||
# s will be None if there is no headline in the text
|
||||
# return the empty list in place of a Tree
|
||||
if s is None:
|
||||
return []
|
||||
for piece_m in re.finditer('<[^>]+>|[^\s<]+', s):
|
||||
piece = piece_m.group()
|
||||
try:
|
||||
if piece.startswith('<b_'):
|
||||
m = _IEER_TYPE_RE.match(piece)
|
||||
if m is None:
|
||||
print('XXXX', piece)
|
||||
chunk = Tree(m.group('type'), [])
|
||||
stack[-1].append(chunk)
|
||||
stack.append(chunk)
|
||||
elif piece.startswith('<e_'):
|
||||
stack.pop()
|
||||
# elif piece.startswith('<'):
|
||||
# print "ERROR:", piece
|
||||
# raise ValueError # Unexpected HTML
|
||||
else:
|
||||
stack[-1].append(piece)
|
||||
except (IndexError, ValueError):
|
||||
raise ValueError(
|
||||
'Bad IEER string (error at character {:d})'.format(piece_m.start())
|
||||
)
|
||||
if len(stack) != 1:
|
||||
raise ValueError('Bad IEER string')
|
||||
return stack[0]
|
||||
|
||||
|
||||
def ieerstr2tree(
|
||||
s,
|
||||
chunk_types=[
|
||||
'LOCATION',
|
||||
'ORGANIZATION',
|
||||
'PERSON',
|
||||
'DURATION',
|
||||
'DATE',
|
||||
'CARDINAL',
|
||||
'PERCENT',
|
||||
'MONEY',
|
||||
'MEASURE',
|
||||
],
|
||||
root_label="S",
|
||||
):
|
||||
"""
|
||||
Return a chunk structure containing the chunked tagged text that is
|
||||
encoded in the given IEER style string.
|
||||
Convert a string of chunked tagged text in the IEER named
|
||||
entity format into a chunk structure. Chunks are of several
|
||||
types, LOCATION, ORGANIZATION, PERSON, DURATION, DATE, CARDINAL,
|
||||
PERCENT, MONEY, and MEASURE.
|
||||
|
||||
:rtype: Tree
|
||||
"""
|
||||
|
||||
# Try looking for a single document. If that doesn't work, then just
|
||||
# treat everything as if it was within the <TEXT>...</TEXT>.
|
||||
m = _IEER_DOC_RE.match(s)
|
||||
if m:
|
||||
return {
|
||||
'text': _ieer_read_text(m.group('text'), root_label),
|
||||
'docno': m.group('docno'),
|
||||
'doctype': m.group('doctype'),
|
||||
'date_time': m.group('date_time'),
|
||||
#'headline': m.group('headline')
|
||||
# we want to capture NEs in the headline too!
|
||||
'headline': _ieer_read_text(m.group('headline'), root_label),
|
||||
}
|
||||
else:
|
||||
return _ieer_read_text(s, root_label)
|
||||
|
||||
|
||||
def demo():
|
||||
|
||||
s = "[ Pierre/NNP Vinken/NNP ] ,/, [ 61/CD years/NNS ] old/JJ ,/, will/MD join/VB [ the/DT board/NN ] ./."
|
||||
import nltk
|
||||
|
||||
t = nltk.chunk.tagstr2tree(s, chunk_label='NP')
|
||||
t.pprint()
|
||||
print()
|
||||
|
||||
s = """
|
||||
These DT B-NP
|
||||
research NN I-NP
|
||||
protocols NNS I-NP
|
||||
offer VBP B-VP
|
||||
to TO B-PP
|
||||
the DT B-NP
|
||||
patient NN I-NP
|
||||
not RB O
|
||||
only RB O
|
||||
the DT B-NP
|
||||
very RB I-NP
|
||||
best JJS I-NP
|
||||
therapy NN I-NP
|
||||
which WDT B-NP
|
||||
we PRP B-NP
|
||||
have VBP B-VP
|
||||
established VBN I-VP
|
||||
today NN B-NP
|
||||
but CC B-NP
|
||||
also RB I-NP
|
||||
the DT B-NP
|
||||
hope NN I-NP
|
||||
of IN B-PP
|
||||
something NN B-NP
|
||||
still RB B-ADJP
|
||||
better JJR I-ADJP
|
||||
. . O
|
||||
"""
|
||||
|
||||
conll_tree = conllstr2tree(s, chunk_types=('NP', 'PP'))
|
||||
conll_tree.pprint()
|
||||
|
||||
# Demonstrate CoNLL output
|
||||
print("CoNLL output:")
|
||||
print(nltk.chunk.tree2conllstr(conll_tree))
|
||||
print()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
demo()
|
||||
Reference in New Issue
Block a user