Engine/Trustwords
Trustwords are 65536 words per ISO 639-1 languagecode, ordered in Unicode collation.
Nota bene: the current “de” wordlist contains only 45324 words, the “en” wordlist contains 51903 words. The entropy hence drops to 15.46 bits (de) or 15.66 bits (en) per trust word.
Trustwords are coming out of dictionary files; they are manually decided of which subtag they’re coming from. For example, en.csv
isn’t built from en-GB
but from en-US
. The decision is being done by deciding which version most people will accept without issues. It is a psychological decision.
The process to build a trustword list is the following:
- Use
dic2csv.py --full
to create a first list - A native speaker has to manually edit this list removing all swearwords
- The list
Questions:
- Why are these wordlists shorter than 64k words? Why do they contain hard to distinguish words like PENICILLIN and PENIZILLIN, BOXCALF and BOXKALF etc.
- “de.csv” contains uncommon words and derivations where simpler words are missing (BUCHBINDERHANDWERK, but BUCH is missing, BUCHUNGS is in the list, but BUCHUNG is missing etc.)