class8.dh.bgu.2017

שיעור שמיני Named entities recognition TEI/Named entities מקורות: David Nadeau, Satoshi Sekine: A survey of named entity recognition and classification Christopher Manning s class, Information extraction and NER Magdalena Turska presentation on Encoding names and named entities James Cumming s slides, Oxford Summer school 2010

Why to facilitate a more detailed and explicit encoding source documents (historical materials for example) which are primarily of interest because they concern objects in the real world to support the encoding of "data-centric" documents, such as authority files, biographical or geographical dictionaries and gazeteers etc. to represent and model in a uniform way data which is only implicit in readings of many different documents

Reference Theory Reference is a fundamental semiotic concept We can talk about the real world using natural languages because we know that some types of word are closely associated with real, specific, objects Proper names and technical terms are canonical examples of this kind of word Matt Elliott à real world entity; Lyon and River Thames à a specific place, a specific river respectively When we translate between natural languages, usually the proper names don't change, or are conventionally equivalent

אביגור-רותם, גבריאלה (1946) סופרת, משוררת ועורכת נולדה בבואנוס איירס, ארגנטינה. עלתה לארץ ב- 1950. בעלת תואר ראשון בספרות עברית ובספרות אנגלית ובעלת תעודת הוראה מאוניברסיטת תל אביב. במהלך שנות התשעים שימשה בתפקידי עריכה מגוונים: לקטורית ועורכת ספרי ביכורים בהוצאת עם עובד, לאחר מכן לקטורית ספרי ילדים ונוער עבור המכון לתרגום ספרות עברית. כיום משמשת כמרצה וכעורכת ספרי עיון בהוצאת הספרים של אוניברסיטת חיפה. גרשון שקד (2006), שאביגור-רותם ערכה את אחד מספריו, שיבח אותה במילים חמות: "הרבה עורכים ועורכות היו לי בימי חלדי, והיא ודאי אחת מן המעולים שבהם: רגישות לשונית, הבנה ספרותית ותבונה אנושית נתמזגו בה ועשאוה לעורכת... כאותה מיילדת הנוטלת את התינוק שלנו בלידתו, מטהרת ומנקה אותו ומחזירה אותו לנו מטוהר ונקי".

אביגור-רותם, גבריאלה (1946) סופרת, משוררת ועורכת נולדה בבואנוס איירס, ארגנטינה. עלתה לארץ ב- 1950. בעלת תואר ראשון בספרות עברית ובספרות אנגלית ובעלת תעודת הוראה מאוניברסיטת תל אביב. במהלך שנות התשעים שימשה בתפקידי עריכה מגוונים: לקטורית ועורכת ספרי ביכורים בהוצאת עם עובד, לאחר מכן לקטורית ספרי ילדים ונוער עבור המכון לתרגום ספרות עברית. כיום משמשת כמרצה וכעורכת ספרי עיון בהוצאת הספרים של אוניברסיטת חיפה. גרשון שקד (2006) ש, אביגור-רותם ערכה את אחד מספריו, שיבח אותה במילים חמות: "הרבה עורכים ועורכות היו לי בימי חלדי, והיא ודאי אחת מן המעולים שבהם: רגישות לשונית, הבנה ספרותית ותבונה אנושית נתמזגו בה ועשאוה לעורכת... כאותה מיילדת הנוטלת את התינוק שלנו בלידתו, מטהרת ומנקה אותו ומחזירה אותו לנו מטוהר ונקי".

רקע Information extraction כנס: (MUC) Message Understanding Conference named entities זיהוי וסיווג ישויות בטקסטים שימושים: תרגום זיהוי ארגונים / ישויות מסחריות בטקסט במשימת sentiment analysis

Named Entities Definition (MUC-6, 1996) Subset of entities referred by a rigid designator Rigid designator: expression that always refers to the same thing in all possible worlds Task: Identify named entities Classify named entities

שושנה ארבלי אלמוזלינו, קרית ענבים, עשן הזמן the automotive company created by Henry Ford in 1903 This building, he, the tree over there are not, because they can stand for different entities, even within the same text.

סוגים של named entities Personal names Locations enamex Institutions (? בפברואר כדאי לקנות פילים but ב בשבט,14/3/12,2010) Dates Currency Genes, Medicines, Rivers, Mountains,.. העליה השניה Wars, Events: מה שיש בויקיפדיה? מה שיש ב- Dbpedia? Named entity disambiguation

Locations: City, state, country Person: Politician Entertainer עורכת סופרת Emails בלקסיקון הספרות אולי שמות של ספרים? או של כתבי עת? פרסומים?

זיהוי ביטויי שם בעזרת חוקים... שמות פרטיים, מקומות וכו זיהוי מספרי טלפון זיהוי תאריכים

Supervised learning Train with corpus of labeled text (labels are entity types) Annotated corpus Influenced by choice of features Strongly influenced by domain of corpus Expensive human needs to annotate

Encoding classes for sequence labeling

The ML sequence model approach to NER Training Collect a set of representative training documents Label each token for its entity class or other (O) Design feature extractors appropriate to the text and classes Train a sequence classifier to predict the labels from the data Testing Receive a set of testing documents Run sequence model inference to label each token Appropriately output the recognized entities

Independent Classifiers Classify each word in isolation Nai ve Bayes model Logistic regression Decision tree Support vector machine

Correlated Classifiers Jointly classify all words while taking into account correlations between some labels Hidden Markov Model Conditional Random Field Adjacent words (phrases) often have correlated labels Identical words often have the same label

Semi supervised learning Bootstrapping a small degree of supervision seeds are searched in a corpus, then contextual information is derived from them contextual information is then used to find different words in similar contexts, considered to be NEs of the same type the process is then repeated multiple times

Unsupervised learning Techniques that rely on lexical resources (e.g., WordNet), on lexical patterns and on statistics computed on a large unannotated corpus Might use simple heuristics e.g., if a type is followed by the phrase such as, the next word will probably a NE of this type ( countries such as Germany ) The typical approach in unsupervised learning is clustering For example, from clustered groups based on the similarity of context

Features Characteristic attributes of words/phrases The more features two words share, the more likely they are to have the same type Choice of features is important for a system's performance there are three types of features: word-level features list lookup features document and corpus features

Word Level

Features for sequence labeling Words Current word(essentially like a learned dictionary) Previous/next word (context) Other kinds of inferred linguistic classification Part-of-speech tags Label context Previous (and perhaps next) label

List lookup features gazetteer, lexicon and dictionary are often used interchangeably with the term list

exact match: easiest way (word is on the list or not), often too strict stemming or lemmatizing: words are stripped of affixes edit-distance: if a word is similar enough to one on the list, it counts Soundex algorithm: words are compared by how they sound rather than how they are spelled

Document and corpus features

Evaluations Types of errors: Wrong identification Non-identification Partial identification Wrong label MUC: correct type, exact text

The Named Entity Recognition Task

A combined measure: F

Exact Match only NEs whose type and boundaries are recognized correctly are counted systems are compared using the F-Score (or F-Measure) doesn't take into account that partially recognized NEs can be useful already, for a query in information retrieval for example it can be enough to find a NE in a sentence, its exact boundaries are not required

ACE evaluation each entity type has its own worth, for example a correct NE of the type person might be worth as much as two NEs of the type organization this allows two compensate for frequency effects (rare types are harder to detect, giving them a high value rewards systems who can find them

http://dh2016.adho.org/abstracts/296 זיהוי שמות פרטיים בקומיקס

TEI ways of marking up names and nominal expressions: <rs> ("referring string") -- any phrase which refers to a person or place, e.g. the girl you mentioned, my husband... <name> - any lexical item recognized as a proper name e.g. Siegfried Sassoon, Calais, John Doe... <persname>, <placename>, <orgname>: syntactic sugar for <name type="person"> etc. A rich set of elements for the components of such nominal expressions, e.g. <surname>, <forename>, <geogname>, <geogfeat> etc.

http://dbpedia.org/page/nathan_alterman

<subfield code="a">university of the Negev (Beer-Sheva)</subfield> <subfield code="9">lat</subfield> </datafield> <datafield tag="410" ind1="2" ind2=" "> <subfield code="a">universitat Ha-Negev (Beer Sheva)</subfield> <subfield code="9">lat</subfield> </datafield> <datafield tag="410" ind1="2" ind2=" "> <subfield code="a">bgun</subfield> <subfield code="9">lat</subfield> </datafield> <datafield tag="410" ind1="2" ind2=" "> <subfield code="a">b.g.u.n.</subfield> <subfield code="9">lat</subfield> </datafield> <datafield tag="410" ind1="1" ind2=" "> <subfield שבע<" code="a < subfield />.באר בן גוריון בנגב<" code="b <subfield <subfield code="9">heb</subfield> </datafield> <datafield tag="410" ind1="2" ind2=" "> בנגב code="a"><<ה>> <subfield <subfield code="9">heb</subfield> </datafield> <datafield tag="410" ind1="1" ind2=" "> < subfield />מכון < subfield />אוניברסיטת להשכלה גבוהה