User Tools

Site Tools


slopeq

SlopeQ for the BNC

The first version of SlopeQ is currently available here: http://nkjp.uni.lodz.pl/PPHome/corpora/bnc.jsp.

The new version of SlopeQ for the BNC is available here SlopeQ BNC 2.

You can also use the SlopeQ Desktop to access the BNC and other corpora.

SlopeQ 1

The good old SlopeQ /’sləʊpək/ for the BNC is available available here.

Introduction

This is a help page for the British National Corpus (BNC) version of the SlopeQ concordancer. SlopeQ supports lexical queries for words, phrases and their variants in large language corpora. A similar concordancer is available for the National Corpus of Polish.

The BNC SlopeQ available here also supports part-of-speech sensitive queries and it’s particularly useful when searching for variable-order English idioms and multiword expressions.

Why yet another concordancer? I wrote this concordancer mainly for my students and colleagues because I needed a lightweight Web browser-based concordancer with a simple layout, which makes it easy to copy the results right into a spreadsheet, or better still share them instantly encoded in a compressed URL.

SlopeQ syntax

Single words

To find the occurrences of a single word or an exact phrase, simply type it in the search box and click the Search button or hit the Enter key. Don’t use any punctuation. Don’t use quotes for phrases like you would in Google.

You can specify how many examples you want per page using the Paging drop-down box, and whether you want to sort the results by the left or right context, which can be useful when looking for simple collocations of the search word. If you get more results than specified by the paging option, use the arrow buttons (» and «) to navigate through them.

Here is how you would search for: preponderance

Register analysis

In addition to sorting by the match, left and right contexts, one can sort the result by the medium, genre and domain. I have used David Lee’s BNC File Descriptors (notes) to implement this feature. A local copy of those files is also available in the Attach tab of this page, just in case you can’t download it from the original address.

Here is how you would check the distribution of the word preponderance across different registers. Turns out it’s a rather fancy word used mainly by academics trying to impress the reader with their vocabulary range.

Better still, you can always click on the Profile button in the search form to get an overview of the BNC distribution of the matching results computed from up to 50 000 occurrences.

You can also apply the Medium, Genre and Domain filters to search only a specific subcorpus of the BNC. For example, you may restrict your search for the occurrences of the the word ludicrous to the spoken register only. Notice that the number of matched sentences drops from 424 to 28 after applying this filter.

And now for something completely different… You may find it very useful to know that by clicking on the URL button you can generate a direct link to the results of your current query with all the settings preserved. You can save this link, shorten it using services such as tinyurl.com, and share the results, or just keep them for future reference.

Variants

If you want to search for a number of orthographic, morphological or lexical variants in one go, you can use the pipe symbol to enumerate them, as in:

call|calls|calling|called link

Wild cards

The two wild cards ? and * can be used to represent variants, e.g.

c?ll link This query will obviously fetch occurrences of both call and cell, and possibly other matches (e.g. cull), since the question mark stands for any single character.

The asterisk wild card, on the other hand, stands for any zero or more characters. Thus,

cull* link will fetch matches like cull, culled, culls, culling, but also potential false positives such as Cullam or Culler (unless they are exactly what you wanted).

Unfortunatelly, you cannot currently use the wild cards at the beginning of a string, as in “*ing”. Enabling this kind of search for the whole BNC (especially for single word queries), means that if you type in “*ing”, the retrieval engine has to check every term in the index to see if it terminates in ‘ing’, which could compromise the whole idea of having a fast index-based retrieval system. You can do this type of queries with brute force concordancers, such as WordSmith, but it’s not the smartest thing to enable for online searching.[#1]

What you can do for now is use some initial characters followed by a wildcard, followed by whatever you like, e.g. free*ing. If the wild card produces to many possible term expansions, you will get a strange error message in Polish. Try reducing the ambiguity, by adding literal characters, e.g. instead of using ‘c*ing’, use ‘ch*ing’.

POS Tagset

This feature is still a bit experimental, but you can make your queries part-of-speech sensitive. Let’s say you are in the occurrences of call as a noun in phrases like make a call, pick up a call, etc. Here is how you could do it:

call<NN1> link In order to indicate the part-of-speech of the search word, you need to put the appropriate POS symbol in square brackets and append it right at the end of the word.

How do you know the right POS symbol? Simply look it up in the BNC Tagset.

Here is a better tip. Use wild cards. Most of the time, you won’t need to delve into the depths of the C5 tagset. Just remember that <N*> stands for any noun, <V*> stands for any verb, <AJ*> stands for any adjective, <AV*>. for any adverb, <PRP> for any preposition, etc. So to search for call as a noun, you only need to type this much:

call<N*> link A more precise query for call as a noun would combine the POS syntax with the variant syntax introduced above:

call<N*>|calls<N*> link Needless to say, POS queries are only as good as the annotation the BNC comes with (allegedly > 95% accurate), so you will see occassional mistakes.

Slop factor

To search for a phrase, simply type it in the query box. Do not use any punctuation or quotation marks. Here is where things get interesting. By default the slop factor is set to 0. This means that an exact match of the phrase is required. So if you type in something like: finish up, you will get matches, with these two words following each other immediately in the order specified. However, you can relax these conditions, to increase recall. For example, you can allow up to 1 ‘intruding’ words between finish and up, by setting the slop value to 1. This will indeed increase the recall of your result set, because you will get examples like finished that up, at the expense of its precision, due to false positives, such as finished pulling up. Check how it works in combination with the variant syntax you are already familiar with:

finish|finished|finishes|finishing up (slop=1) link More examples:

keep|kept|keeps|keeping in the dark about (slop=2) link give|gives|giving|given|gave the sack (slop=3) link

Relaxing the word order

To further increase the recall of variable-order multiword idioms, you can try unchecking the Preserve order box. As an example, to find the potential collocations of the words high and hopes:

high|higher|highest hope<N*>|hopes<N*> (slop=2, uncheck Preserve order) link Interestingly, this query results in matches such as: VW has high hopes for the Polo…,…hopes are too high… or hopes sky high, but also …high but distant hopes… and high as their hopes had been…. The last two occurrences wouldn’t have been matched, had the word order condition not been relaxed.

POS place-holders

Rather than just increasing the slop, you can specify a POS search term in order to strike a better balance between precision and recall in the search result sets. Let’s start with a simple example. To find the occurrences of the noun face premodified by any adjective or an -ing form of a lexical verb, use the following query:

<AJ*>|<VVG> face|faces (paging=200) link To get some occurrences of verbs followed by the phrase “on the dole” use:

<V*> on the dole link Here is how you can search for hate + to infinitive:

hate|hates|hating|hated to <V*> link To find occurrences of hate followed by a gerund, use:

hate|hates|hating|hated <VVG> link Finally, try these examples of queries with intruding POS place-holders:

strike|struck|striking|strikes a <AJ*> balance link face<N*> turn* <AJ*> (face as a noun + turn + adjective, set the slop to 1) link Grouping Grouping is a really helpful feature when analysing large result sets. Let’s say you want to analyse the different senses of the adjective tricky by looking at a sample of its frequent nominal collocates. With this query, you can get the different nominal collocations of the adjective tricky and sort them by descending frequency. The frequency of a given collocation is given next to each grouped result. Remember to set the sort option to GroupCount if oyu want to order the results by descending frequency.

Due to performance issues, grouping is only applied on the sample limited by the paging factor set by the user, so you may have to page the results if your query returns more than 10,000 ungrouped hits.

Implementation

SlopeQ is powered by Apache Lucene with special adaptations in the query syntax and term highlighting.

Known issues

SlopeQ may choke if you give it a silly query which is likely to match almost every sentence in the corpus, such as <N*>. That’s because unlike other concordancers which only check the first n examples in a corpus, SlopeQ always attempts to calculate the total number of sentences containing a match.

Match highlighting needs some further work for variable-order queries (results are generally correct, but they could be rendered better in some cases).

slopeq.txt · Last modified: 2015/12/10 06:34 by pezik