Site Loader

Continuing a series of “test drives” of freely available English language linguistic search engines, I turn to Lengusa, pointed out to me by a reader of Searching for Ludwig. Lengusa promotes itself as the first machine learning-powered sentence search engine online, so I am curious to see if the machine learning element gives it the edge over Ludwig and Sentence Stack.

(1) Search for a phrase or idiom in context

(2) Translate phrases to English (using Google translate)

(3) Look up definition and synonyms for a single word

(4) Compare frequency of phrases

(5) Fill in the blank * search

(6)

(7) Compare frequency of single words

The numbering of the above list matches in (1) – (5) and (7) the list given in Searching with Ludwig. Item (6) is blank as a feature offered by Ludwig but not Lengusa (paraphrase a sentence = substitute a synonym). I will test the same examples for the features shared with Searching with Ludwig and Sentence Stack to help compare the resources.

(1) Searching for the idiom “up to here” as in “had enough” produces a list of 20 results for this sequence of words in the corpus of texts: 8 exact matches under the heading Sentence examples for up to here from high quality English sources; 6 exact matches under the heading Use up to here in a sentence together with 2 examples featuring the words up to here separated, in a different order; and 4 near examples featuring one or more of the words under the heading up to here sentence examples (one of these is the odd choice But like most a Jack’s ideas it never come to pass, from the New Yorker – expanding the context reveals this to be an extract from a short story written in imitation of dialect). The significance of the three headings (sentence examples for X, use X in a sentence, X sentence examples) remains opaque to me.

At the bottom of the page, clicking on the next page icon brings up a further set of 20 examples, but none of them having the words up to here contiguous. Proceeding to the next page of 20 sentences likewise brings no useful examples. I gave up going any further as it seems probable that the initial pages display the best matches.

For each sentence there is the option to expand the context to give the surrounding sentences, obviating the need to locate the original source, an option to turn the “sentence to fragments” that gives a link to each word of the sentence enabling a rapid lookup of any unfamiliar words, and clicking on the loudspeaker icon allows you to hear the sentence read aloud by a robotic voice, should you so desire.

As well as the sought after idiom “(have it) up to here” (In the same speech, Mr Chavez said angrily that he had “had it up to here” with corruption and was contemplating “extraordinary measures” to deal with it, from the Economist, and I’ve had it up to here with you people! from the New Yorker), sentences are given in which this word combination bears a different meaning. For example, the sentence Sure, we are for Islamic self-esteem, but what on earth was Obama up to here? combines “to be up to” (be doing something) and “here”. In other words, the examples give all sentences with the given combination of words, with no division according to meaning. (This remark also applies to Ludwig and Sentence Stack.) Remaining meanings include “up to here” as in up to this point (“This success here has set an amazing platform for me”; Froome said yesterday, “going forward the experience of everyting I’ve done building up to here has really been a massive learning curve, as much as this Tour itself has been, from the Independent – the vapid language of sports commentary here punctured by a typo to boot…).

Making the more restrictive search “have it up to here” gives 20 sentences, all near matches only; none are among those obtained with the search “up to here”, and none actually feature the idiom sought. Instead a seemingly random array of near and not so near hits are presented in which (some of) the constituent words are sprinkled. (How ‘similar’ really is Suppose that, as incompatibilists might hold, had the causes of Elena’s decision determined that outcome, it would not have been up to Elena whether she decided to A?)

(2) Upon typing the Danish word hygge into the search box, 20 sentence examples are given in English featuring this recently adopted word (Collins English dictionary in 2016 declared hygge the runner-up after Brexit as English word of the year). There appears to be no way to signal that a translation of the Danish word is being sought. Even asking for a definition of the word as understood in English gives a blank entry. The pronunciation of hygge offered by the robot (the first syllable rhyming with high) differs from the various English pronunciations in actual use (try e.g. the entry for hygge in Cambridge English Dictionary to hear these). That one of the sentence examples is Hygge (pronounced HOO-gah, like a football cheer in a Scandinavian accent) is the Danish word for cozy, from the New York Times, makes this only more unfortunate. Looking up the timeline for frequency of usage for hygge (as provided by Google) clearly does not distinguish in this case between English and Danish occurrences.

(3) Entering fun into the search box produces four definitions of fun as a noun (one as an attribute, functioning as an adjective), including a couple of synonyms for each meaning, and the derivation funny. (The informal Americanism to fun that featured in Sentence Stack and Ludwig does not appear here; the number of synonyms offered is fewer than in Ludwig’s and Google/Lexico’s dictionary used by Sentence Stack.) The WordNet database for lexical connections and in-context definitions appears to offer a succinct dictionary usefully different to those of Lexico and Ludwig.

(4) Under item (4) of Searching with Ludwig I discuss the question of whether to hyphenate such adverb plus participle phrases as “well known” and “well defined”. Entering well known VS well-known in the search box, the frequency ratio is 50% to 50%, a list of 20 sentences for either variant is given that features 8 occurrences without the hyphen, and subsequent pages 11 and 2 unhyphenated occurrences out of 20 (so unhyphenated 21 out of 60, or 35% of the time). At the bottom of the page, Lengusa’s machine-learned declaration is that “well known is more popular than well-known across all sources. The frequency of well known is 50.00% , while the frequency of well-known is 50.00%.” Inscrutable…

The search well defined VS well-defined reports a ratio of 50% to 50% for frequencies in the corpus, but only 12 of the first 100 example sentences feature the unhyphenated version, which seems to go against this statistic. Similarly to the case of well(-)known, the machine learner opines that “well defined is more popular than well-defined across all sources. The frequency of well defined is 50.00% , while the frequency of well-defined is 50.00%.”

The other frequency comparison I tested on Ludwig and Sentence Stack was of high interest vs of great interest. Lengusa gives 82% for the former, 18% for the latter, but the first page of 20 sentences all have the sense of “high interest rates”. Entering then to be of high interest VS to be of great interest in order to eliminate these irrelevant examples produces the statistic of 0% vs 100%, as opposed to Sentence Stack’s and Ludwig’s approx. 10% vs 90%. Only, a look at the sentences offered as examples and only the first in fact features the phrase “to be of high/great interest” (with appropriate grammatical form of the verb), the rest only being near matches, and ‘near’ is putting it generously (e.g. “I will be out of the hospital and back home soon.” The search be of high interest VS be of great interest does at least produce half a dozen relevant sentence examples (and the same 0% vs 100% statistic). As remarked for Sentence Stack and Ludwig, it would be helpful to force the search to limit itself to contiguous occurrences of the search words forming a phrase (ideally with provision for grammatical variation in the words e.g. “is” for “to be”).

(5) Entering “come * with”, in which * could for instance be “up” (“come up with” = produce something, especially when pressured or challenged), “down” (“come down with” = to begin to suffer from [an illness]), “out” (“come out with” = say something in a sudden, rude, or incautious way) or “back” (“come back with” = to make a reply or response of) produces on the first page a list of example 20 sentences, 13 of which feature come up with, 1 come back with, 1 come along with, and 3 Come Dine With (Me). All the example sentences have the exact word “come” (not e.g. comes or coming). Searching for “comes * with” produces on the first page a much more varied set of possibilities (comes across with, comes up with, comes down with, comes along with, comes through with, comes together with, comes out with, just to mention those where the missing word is a preposition; the others are not phrasal verbs but collocations such as comes complete with and comes equipped with, along with such contingent combinations as comes fortified with. Searching for “coming * with” produces example sentences featuring coming up with (12 out of 20), coming out with, coming down with, and coming forward with. As in Ludwig and Sentence Stack, multiple searches with variant grammatical forms are needed to obtain a representative sample of frequent combinations.

(7) Comparing (the frequency of) words is a special case of using VS – see (4) above. In Searching with Ludwig, item (7), I discuss the distinctions among the three variants different to/than/from. Searching for “different [to than]” in the corpus used by Lengusa, the proportion is 54:46 in favour of “different to” (compared to Ludwig’s proportion of 3:1 and Sentence Stack’s proportion of 2:1 in favour of “different than”). All the example sentences in the first five pages (and more?) feature different to alone. (A separate search has to be made for different than.) You can also only search for two alternatives at a time (just as for the VS function): Searching for “different [from to than]” is the same as searching for “different [from to]”. The latter search ”shows “different from” to be predominant, appearing over 5 times as often as “different to”. What conclusions you can reach from such statistics is unclear: looking at the example sentences may, or may not, lend support to the account given in Merriam-Webster’s discussion.

Conclusion: Lengusa, Ludwig and Sentence Stack can all produce a helpful selection of sentences from a generally reliable corpus of English texts featuring a given word or phrase, but, for sometimes unfathomable reasons, they can also splurge out an unhelpful scattering of irrelevant sentences, and it can be hard to find illuminating examples for a given sense of the word or phrase in question. (While the dictionary definitions discriminate between meanings, the example sentences are not grouped according to these meanings; traditional dictionaries such as Cambridge online dictionary offer more in this respect.)
Lengusa potentially can produce, for free, a limitless list of sentence examples (as opposed to Ludwig which caps its free offerings) and does seem to restrict itself to a corpus of texts from more reliable sources than Sentence Stack allows. All three search engines suffer from ultimately treating words in isolation: ambiguity is then let play, as only in context do words have a clear meaning. All three also do not cater for simple grammatical variation in a word (e.g. the tense of a verb) when searching for occurrences in the corpus.

Share

Andrew Goodall

Leave a Reply

Your email address will not be published. Required fields are marked *