The Annotator

This report started out as a README that would accompany the published feature of pattern.en.paternalism, an addition we wanted to propose to the Pattern natural language processing Python toolkit. The feature would detect if and to what extent a text could be considered ‘paternalist’.

Machine-learning algorithms that partially automate data processing still need to be trained for every new form, or every new kind of topic the algorithm might deal with. (…) Such work of alignment is not a bug — it is the condition of possibility for keeping humans and automation working in the same world. http://www.publicbooks.org/nonfiction/justice-for-data-janitor

Motivation

We developed the pattern.en.paternalism feature during Cqrrelations, a worksession that offered poetry to the statistician, science to the dissident and detox to the data-addict. Artists, academics, programmers and designers worked on impure, missing, invisible, broken or suspicious data. As we slowly got to grips with the practice of data-mining, and more specifically with the text-mining software package Pattern, we understood that this practice typically depends on the following interrelated elements:

→ The availability of abundant machine-readable data or sources
→ A corpus of pre-analysed and pre-parsed data, for testing and training purposes
→ A ‘Gold Standard’ derived from manually (validated by humans) annotated corpi
→ Standard parsing algorithms that can pre-process texts for more efficient analysis
→ A pattern recognition algorithm
→ Training software (Machine learning?) that allows the algorithms to be optimized against the Gold Standard

Pattern

Pattern is a popular text mining module for the Python programming language.The module is developed by CLiPS (Computational Linguistics & Psycholinguistics research center), associated with the Linguistics department of the faculty of Arts of the University of Antwerp. CLiPS states on their website that ‘Most of the CLiPS research is based on competitively acquired research funding’ and that its goal is ‘to produce internationally recognized top research’.

Pattern offers tools to process on-line sources such as Google search results, Tweets and Wikipedia pages. It includes tools for data mining, natural language processing, machine learning, network analysis and visualization. The module is licensed under a BSD license and comes with many examples, such as ‘Summarization’, ‘Style detection’ and ‘Finding negation and speculation’, which allowed us to interrogate many of the elements involved in the practice of text-mining.

The Annotator

annot-1024x768

From the start we were interested in how a Gold Standard is established, a paradoxical situation where human input is considered a source of truth, but made invisible. Annotation here means the manual work of ‘scoring’ large amounts of data that can than be used for ‘training’ algorithms. This scored data becomes a reference against which the algorithm is trained and tested. The Annotator is typically a student or Mechanical Turk worker, or sometimes the work has been already done for another reason, such as in the case of the sentiment analysis algorithm, where the Gold Standard for deciding between positive or negative language patterns, is based on a large corpus of movie reviews along with explicit rating of the described movies.

In-between the solution-oriented and mystifying descriptions of several algorithms for text-mining that we looked at, the actual conditions, context and work of annotation felt surprisingly undervalued and under-documented. Only in a few cases, and often hidden far away in software sources, we found descriptions of the actual methods of annotation.

It seems that annotation always implies a contextual perspective. Scoring sources is also time consuming and boring; it can only speed up when the annotator does not doubt her opinions. Through the development of pattern.en.paternalism we wanted to both experience and challenge this practice. Our decision to work with a contested ‘polarity’ such as paternalism, was of course deliberate.

We wanted to:

→ Understand the work of annotation
→ Expose the place/role of the annotator
→ See what place dissent could have in text-mining

pattern.en.paternalism

The Annotators decided to work on a controversial topic, one that produce disagreement and would force each of the Annotators to question his or her understanding of the ‘common sense’ around it. The Annotators therefore took into account that the sources had to be interesting enough to spend time on, realizing that the desired outcome related to the subject we chose and the sources we selected. The data itself had to trigger discussion and debate. Paternalism had different connotations in each of the Annotators native languages, but the fact the subject worked so well with the library’s name was the deciding factor.

The Annotators selected 20 sources for their dataset (see appendix). From these sources, 600 paragraphs were selected, meaning we did not use the ingestion tools available through Pattern because we were interested in specific data. For Gutenberg sources, paragraphs were automatically scraped. For Wikipedia sources, Annotators copy-pasted the paragraphs into a spreadsheet by hand. Paragraph titles and graphic elements were ignored.

We did not anonymize the dataset, and relied on the fact that Annotators could link individual records with specific authors. We decided to ask the Annotators to take the context of the source (date of publication, author etc.) into account while scoring the data for paternalism.

Instructions for The Annotators

The annotation process was guided by two desires. First of all we wanted to produce enough results in one day that would be usable in Pattern, but also to create space for discussing the interpretation of every paragraph and documenting these discussions. To this end we came up with the following guidelines:

→ Read each paragraph, and rate it on the level of paternalism. If you consider the statement not paternalist (this could mean the statement is considered: liberal, anarchist, anti-colonial, feminist …), score it with ‘-1’. If you think it is a neutral statement or a statement that can not be identified in relation to paternalism, score it ‘0’. A statement you consider paternalist, you mark ‘1’.
→ Mark paragraphs that contain ‘noise’ with a [lowercase] x in column G.
→ Each paragraph will be scored by at least two annotators. You are invited to discuss amongst each other.
→ Decide whether you want to take into account title, date of publishing, author or not. Report on this decision in the designated column.
→ Try to keep a log of the annotation process: At what time the score was given, in what circumstance/context, with what sentiment, what were your considerations.
→ Try to score at least 15 paragraphs per hour, but if you need more time that is OK.

About The Annotators

We made an attempt to anonymize The Annotators without ignoring their specific cultural backgrounds and particular interests. This means that we wanted to link individual annotations with additional information about the annotator’s point of view, to provide context to their scores.

Annotator 001 (f, 1982) is a French author living in Belgium. She is currently involved in a research to the life of Anna Kavan, and is interested in digital writing.
Annotator 002 (f, 1969) is a Dutch designer/artist living in Belgium. She is a feminist and interested in tools, practice and Free Software.
Annotator 003 (m, 1990) is a Dutch artist living in The Netherlands. He is interested in infrastructures and networks.
Annotator 004 (f, 1988) is a French artist living in The Netherlands. She is interested in the physical location of the web, and enjoys the act of making web pages.
Annotator 005 (f, 1989) is a Dutch designer living in The Netherlands. She is interested in language philosophy and computational linguistics.
Annotator 006 (f, 1991) is a Romanian curator living in The Netherlands. She is interested in the conditional aspect of interfaces.
Annotator 007 (m, 1982) is a Hungarian researcher living in Spain. He is interested in collaborative production practices and cybernetics as an ideological formation.
Annotator 008 (m, 1984) wishes that he was 007. Why? Because 7, 8, 9… No 008 has an English mother-tongue and has been annotating considerably in professional and non professional contexts (including with building ontological frameworks since 2008 (fuck me, that was a long time ago). Besides, the reading material themes in 008s remit is close to ‘natural reading environment’ He has been permitted to retire at 250, unless somebody is able to catch up with him!
Annotator 009 (f, 1975) is a French researcher/teacher living in France. She is interested in bots.

Method

Once The Annotators settled on the selected sources, guidelines and make-up of the annotation team, we started scoring the dataset. Some meta-notes on the annotation process:

When annotator 008 reached 50, he went back to the beginning and decided to make some of the earlier annotations more neutral. Some paragraphs can not be definitive without wider context – therefore he took a more conservative approach.

Annotator 003: Made most of my decisions based on ‘keywords’ that I thought the algorithm should learn to ‘flag’ as paternalist. Found it difficult to gauge the level of more factual texts

Annotator 005: wasn’t that familiar with the term ‘paternalism’, which evolved along the process of classifying. An other strategy was applied later in the process: to focus more on writing style, rather than the content. Although, this was not always possible to apply. In the later classifications, there are comments valued with a capital “W” for a focus on the writing style, and a capital “C” for a focus on the content.

Annotator 006: Took context (year of publication) into account.

Annotator 002: Felt it was most difficult to decide whether style or content should be taken into account.

The Removal of Pascal

Once the dataset was scored we could start establishing a classifier for detecting pat(t)ernalism.

While training our K-Nearest Neighbour algorithm, results seemed skewed towards a few French terms in the sources, most notably ‘autre’. Closer scrutiny revealed the term was part of a quote of Blaise Pascal used in ‘How to observe morals and manners’ by Harriet Martineau, one of the sources used. Since our algorithm was not performing according to our expectations, we decided to remove the paragraph that created the unwanted result. This is the sentence that was removed:

Une différente coutume donnera d’autres principes naturels. Cela se voit par expérience; et s’il y en a d’ineffaçables à la coutume, il y en a aussi de la coutume ineffaçables à la nature.” (A different custom will cause different natural principles. This is seen in experience; and if there are some natural principles ineradicable by custom, there are also some customs opposed to nature, ineradicable by nature, or by a second custom.) http://www.gutenberg.org/files/18269/18269-h/18269-h.htm#p_92

Unfortunately The Removal of Pascal did not improve the performance of our algorithm.

A process of normalization

When writing up this report a few months later, we remember how many times we were told that also in text-mining, ‘there is no free lunch’. Even when algorithms promise universal and undisputable outcomes, there is always a need to tailor data and it’s treatment to achieve it. Otherwise said, while the practice of text-mining seems full of normalizing processes, out there is supposedly a treasure trove of discoveries that we could not have dreamt up on our own.

Looking back on our modest experiment we start to see the interplay between the process of creating an algorithm, feeding into a self-fullfilling narrative of necessity and relevance in relation to desired, possible and applicable results. The removal of Pascal is just one example in our own process that included many moments of normalization.

For text-mining to work, normalization needs to happen on many interconnected levels. The available dataset need to be aligned with the desired outcome (or the desired outcome needs to be aligned with the available sources), The Gold Standard needs to validate the training data, while the training data needs to validate the Golden Standard. Available sources include online reviews of goods, desired outcomes includes sentiment analysis of what people think of products.

Text-mining is an industry aimed at producing predictable, conventional and plausible results. In other words it is about avoiding exceptions, uncertainties and surprises. At the same time it promises to have overcome ideology and the need for models, but relies on the extrapolation of the common sense of The Annotator.

pattern.en.paternalism was developed by Catherine Lenoble, Anne Laforet, Femke Snelting, Roel Roscam Abbing, Manetta Berends, Julie Boschat Thorez, Cristina Cochior, Maxigas and Johnny xxx

Report delivered by: Roel Roscam Abbing and Femke Snelting

APPENDIX

Definitions of paternalism

From: https://en.wikipedia.org/wiki/Paternalism
Paternalism (or parentalism) is behavior, by a person, organization or state, which limits some person or group’s liberty or autonomy for that person’s or group’s own good. Paternalism can also imply that the behavior is against or regardless of the will of a person, or also that the behavior expresses an attitude of superiority.
The word paternalism is from the Latin pater for father, though paternalism should be distinguished from patriarchy. Some, such as John Stuart Mill, think paternalism to be appropriate towards children: “It is, perhaps, hardly necessary to say that this doctrine is meant to apply only to human beings in the maturity of their faculties. We are not speaking of children, or of young persons below the age which the law may fix as that of manhood or womanhood.” Paternalism towards adults is sometimes thought to treat them as if they were children.
Examples of paternalism include laws requiring the use of motorcycle helmets, a parent forbidding their children to engage in dangerous activities, and a psychiatrist confiscating sharp objects from someone who is suicidally depressed.

From: https://fr.wikipedia.org/wiki/Paternalisme
Le paternalisme est une doctrine politique qui définit comme moralement souhaitable qu’un agent privé ou public puisse décider à la place d’un autre pour son bien propre. Cette doctrine s’oppose au libéralisme.
Par exemple, quand l’État interdit aux agents de fumer ou de boire, il mène une politique paternaliste. D’un point de vue libéral, on ne peut pas chercher à faire le bien d’un individu contre son gré.
Le paternalisme est une attitude qui consiste à se conduire comme un père envers d’autres personnes sur lesquelles on exerce ou tente d’exercer une autorité. Cette attitude peut être volontaire, comme involontaire et inconsciente.
Ce terme est notamment utilisé dans des domaines comme l’économie, la morale ou la politique. On parle alors de paternalisme économique, moral, politique, social etc.
L’attitude paternaliste revient à considérer des adultes comme des enfants. Un paternaliste infantilise ceux sur qui il exerce, ou cherche à exercer, une autorité. À l’inverse que c’est parce que ceux-ci sont déjà infantiles que cela suscite en retour une tendance paternaliste.

From: https://nl.wikipedia.org/wiki/Paternalisme
Paternalisme verwijst naar een houding of beleid vergelijkbaar met het hiërarchische familiepatroon waarbij de vader (pater in het Latijn) aan het hoofd van de familie staat en de vader beslissingen neemt voor de andere familieleden (vrouw en kinderen), ook als die beslissing niet in overeenstemming is met wat zij wensen.
Paternalisme is het optreden van de overheid tegenover het volk, of van een overheersend volk in vreemd gebied (kolonie of vroegere kolonie) of van een gezaghebber als een vader of voogd die het goede met het volk, zijn kinderen of pupillen voorheeft, maar hen geen invloed van belang geeft op hun eigen aangelegenheden.

From: http://dexonline.ro/definitie/paternalism (there is no wikipedia entry for Paternalism in the Romanian Wikipedia)
Paternalism s. n. 1. (Ec. pol.) Concep?ie care desemneaz? interesul pe care îl manifest? patronii pentru bun?starea muncitorilor sau pentru atmosfera familial? din întreprindere, raporturile dintre patroni ?i muncitori caracterizate prin afec?iune reciproc?, autoritate ?i respect. 2. Protec?ie, protejare, tutelare excesiv? a propriului copil. – Din fr. Paternalisme.

Meta-mining

After one day, 244 paragraphs were classified and ready for training .
Annotators disagreed on whether a paragraph was paternalist on 49 occasions, bringing the annotator disagreement rate to 20.08967213114754%

Group A (001, 004, 007):
Paragraphs scored: 174. 20 of those paragraphs were annotated by 1 person (and not taken into account)
Disagreements: 18
Noise: 7

Group B (002, 005, 008):
Paragraphs scored: 55. 5 of those paragraphs were annotated by 1 person (and not taken into account )
Disagreements: 21
Noise: 2

Group C (003, 006, 009):
Paragraphs scored: 61. 10 of those paragraphs were annotated by 1 person (and not taken into account )
Disagreements: 10
Noise: 2

Annotation files

A : unique ID
B : url of the source
C : title of the source
D : year of publication
E : paragraph (content)
F : the ID number of the annotator
G : classifier (-1/0/1/x)
H : comment

name: main-the-annotator-paragraphs-[ID-number].ods
example: main-the-annotator-paragraphs-005.ods

Annotation results

x = noise
d = disagreement
n = not annotated
p = annotated by 1 person

All annotations:

https://gitorious.org/cqrrelations/cqrrelations/source/f86b6aec968a58103f59a931a31939d92906897f:share/the-annotator/all-annotations-abc.html

Comments

List of paragraphs that are classified as paternalistic, combined with the notes that were taken during the annotation process:
https://gitorious.org/cqrrelations/cqrrelations/source/f86b6aec968a58103f59a931a31939d92906897f:share/the-annotator/paternalism-classifications.html

Examples:

annotator 004 on paragraph #442 : “What gives Bernard Shaw the aptitude to reveal the deep nature of men and woman?”
annotator 007 on paragraph #549 : “Reduces questions of political agency to physiological problems.”
annotator 008 on paragraph #334 : “Analysis of individual’s history and philosophical outlook.”
annotator 003 on paragraph #416 : “strenously controlling sex”
annotator 006 on paragraph #416 : “For 1908 it raises feminist issues”
annotator 009 on paragraph #416 : “I write 0 not because it’s neutral but as a kind of balance as i couldn’t choose between -1 and 1. there are elements that can be considered emancipatory, against paternalism (the text is from 1908), but there are also elements which are paternalist as well.”

Disagreements

List of paragraphs that were disagreed on by The Annotators, and so are not taken into account in the training , combined with the notes that were taken during the annotation process:
https://gitorious.org/cqrrelations/cqrrelations/source/f86b6aec968a58103f59a931a31939d92906897f:share/the-annotator/disagreement-list-selection.html

Data

Gutenberg project

J. B. Bury, The Idea Of Progress, 1920, http://www.gutenberg.org/cache/epub/4557/pg4557.txt
Maud Churton Braby, Modern Marriage and How To Bear It, 1908, https://www.gutenberg.org/files/31529/31529-0.txt
Harriet Martineau, How to Observe Morals and Manners, 1838, http://www.gutenberg.org/cache/epub/33944/pg33944.txt
Irwin Edman, Human Traits and their Social Significance, 1920, http://www.gutenberg.org/cache/epub/22306/pg22306.txt
James Hayden Tufts, The Ethics of Cooperation, 1918, http://www.gutenberg.org/cache/epub/29508/pg29508.txt
James Harvey Robinson, The Mind in the Making: The Relation of Intelligence to Social Reform, 1921, http://www.gutenberg.org/cache/epub/8077/pg8077.txt
Helen Kendrick Johnson, Woman And The Republic, 1897, https://www.gutenberg.org/cache/epub/7300/pg7300.txt
Charles Darwin, On the Origin of species, 1859, http://www.gutenberg.org/cache/epub/1228/pg1228.txt
Emma Goldman, Anarchism and other essays, 1910, http://www.gutenberg.org/cache/epub/2162/pg2162.txt
John F. Hume, The Abolitionists (Together With Personal Memories Of The Struggle For Human Rights), 1830-1864, http://www.gutenberg.org/cache/epub/13176/pg13176.txt

Wikipedia

Mining: https://en.wikipedia.org/wiki/Mining
Textile Industry: https://en.wikipedia.org/wiki/Textile_industry
History of computing hardware: https://en.wikipedia.org/wiki/History_of_computing_hardware
Marissa Mayer: https://en.wikipedia.org/wiki/Marissa_Mayer
Larry Page: https://en.wikipedia.org/wiki/Larry_Page
Liberty: https://en.wikipedia.org/wiki/Liberty
Choice: https://en.wikipedia.org/wiki/Choice
Sabotage: http://en.wikipedia.org/wiki/Sabotage
Social Darwinism : http://en.wikipedia.org/wiki/Social_Darwinism
Anarchism: https://en.wikipedia.org/wiki/Anarchism