Building and Exploring Web Corpora (WAC3 - 2007)

Proceedings of the 3rd web as corpus workshop, incorporating cleaneval
Première édition

WACMore and more people are using Web data for linguistic and NLP research. The Web as Corpusworkshop (WAC) provides a venue for exploring how we can use it effectively and the advancementsto which this could lead.This book is a collection of... Lire la suite

WAC

More and more people are using Web data for linguistic and NLP research. The Web as Corpusworkshop (WAC) provides a venue for exploring how we can use it effectively and the advancementsto which this could lead.This book is a collection of the talks presented at the 3 rd WAC in Louvain-la-Neuve (Belgium).The focus is on the description of Web corpus collection projects, the exploration of Web datacharacteristics from a linguistics/NLP perspective, and on the use of crawled Web data for NLPpurposes.

CLEANEVAL

Any use of Web data requires that it be cleaned in order to get rid of unwanted material including,for example, HTML markup, navigation bars, advertisements. To date there has been no sharingof resources or expertise in this particular domain and the cleaning has often been done minimally.Cleaneval was an exercise aimed at promoting collaboration and improving our understandingof the issues. Results and perspectives are presented in this book.


PDF (PDF) - En anglais 9,00 €
Livre broché - En anglais 19,70 €

InfoPour plus d'informations à propos de la TVA et d'autres moyens de paiement, consultez la rubrique "Paiement & TVA".

Spécifications


Éditeur
Presses universitaires de Louvain
Partie du titre
Numéro 4
Édité par
Cédrick Fairon, Hubert Naets, Adam Kilgarriff, Gilles-Maurice de Schryver,
Collection
Cahiers du CENTAL
Langue
anglais
BISAC Subject Heading
LAN009000 LANGUAGE ARTS & DISCIPLINES / Linguistics
Code publique Onix
06 Professionnel et académique
CLIL (Version 2013-2019 )
3147 Linguistique, Sciences du langage
Date de première publication du titre
01 janvier 2007
Type d'ouvrage
Monographie
Langue originale
anglais

Livre broché


Date de publication
01 janvier 2007
ISBN-13
9782874630828
Ampleur
Nombre de pages de contenu principal : 182
Code interne
76399
Format
16 x 24 x 1 cm
Poids
510 grammes
Prix
19,70 €
ONIX XML
Version 2.1, Version 3

PDF


Date de publication
01 janvier 2007
ISBN-13
9782874635045
Ampleur
Nombre de pages de contenu principal : 182
Code interne
76399PDF
ONIX XML
Version 2.1, Version 3

Google Livres Aperçu


Publier un commentaire sur cet ouvrage

Sommaire


Table of Contents .................................................................................................... vii

Preface ..................................................................................................................... 1

WAC3 ..................................................................................................................... 3

Kevin P. SCANNELL, The Crúbadán Project: Corpus building for underresourced

languages ..........................................................................................5

Sebastian BLOHM, Philipp CIMIANO, A Human Evaluation of Filtering

Functions for Pattern-based Extraction of Arbitrary Relations from the

Web .....................................................................................................................17

Emmanuel CARTIER, TextBox, a Written Corpus Tool for Linguistic Analysis ...... 33

William H. FLETCHER, Implementing a BNC-Compare-able Web Corpus ............ 43

Fabrice ISSAC, Yet Another Web Crawler ................................................................ 57

Igor LETURIA, Antton GURRUTXAGA, Iñaki ALEGRIA, Aitzol EZEIZA, CorpEus,

a 'web as corpus' tool designed for the agglutinative nature of Basque ...........69

Serge SHAROFF, Classifying Web corpora into domain and genre using

automatic feature identification .........................................................................83

Anil Kumar SINGH, Jagadeesh GORLA, Identification of Languages and

Encodings in a Multilingual Document ............................................................. 95

CLEANEVAL .......................................................................................................... 109

Daniel BAUER, Judith DEGEN, Xiaoye DENG, Priska HERGER, Jan GASTHAUS,

Eugenie GIESBRECHT, Lina JANSEN, Christin KALINA, Thorben KRÜGER,

Robert MÄRTIN, Martin SCHMIDT, Simon SCHOLLER, Johannes STEGER,

Egon STEMLE, Stefan EVERT, FIASCO: Filtering the Internet by Automatic

Subtree Classification, Osnabrück ..................................................................... 111

Stefan EVERT, StupidOS: A high-precision approach to boilerplate removal ........ 123

Weizheng GAO, Tony ABOU-ASSALEH, GenieKnows Web Page Cleaning

System ................................................................................................................. 135

Christian GIRARDI, Htmcleaner: Extracting the Relevant Text from the Web Pages ..... 141

Katja HOFMANN, Wouter WEERKAMP, Web Corpus Cleaning using Content

and Structure ...................................................................................................... 145

Michal MAREK, Pavel PECINA, Miroslav SPOUSTA, Web Page Cleaning with

Conditional Random Fields ............................................................................... 155

Xabier SARALEGI, Igor LETURIA, Kimatu, a tool for cleaning non-content text

parts from HTML docs ....................................................................................... 163