Browse Source
- EDict is now used to find readings when there is a conversion error - Better detection for non-kanji hintsmaster

7 changed files with 259863 additions and 38 deletions
@ -0,0 +1,54 @@ |
|||
import re |
|||
|
|||
entryRegex = r'(.*)\s\[(.*)\]\s\/(.*)' |
|||
entryKatakanaRegex = r'(.*)\s\/(.*)' |
|||
|
|||
edict = [] |
|||
|
|||
class EdictEntry: |
|||
def __init__(self, word, reading, restOfEntry): |
|||
self.word = word |
|||
self.reading = reading |
|||
self.restOfEntry = restOfEntry |
|||
|
|||
def __str__(self): |
|||
return "{} = {} {}".format(self.word, self.reading, self.restOfEntry) |
|||
|
|||
def findEntries(inputText): |
|||
possibleEntries = [] |
|||
for entry in edict: |
|||
if entry.word == inputText: |
|||
possibleEntries.append(entry) |
|||
|
|||
if not possibleEntries: |
|||
print('Failed to find "{}" in edict'.format(inputText)) |
|||
return possibleEntries |
|||
|
|||
def loadEdict(): |
|||
global edict |
|||
if not edict: |
|||
print('Loading Edict...') |
|||
edictFile = open('edict/edict', 'r', encoding='euc-jp') |
|||
for line in edictFile: |
|||
isKatakana = False |
|||
entryMatch = re.search(entryRegex, line) |
|||
if not entryMatch: |
|||
# Loan words only have katakana readings |
|||
entryMatch = re.search(entryKatakanaRegex, line) |
|||
if not entryMatch: |
|||
print("Error: could not parse dictionary line: \n\t{}".format(line)) |
|||
continue |
|||
else: |
|||
isKatakana = True |
|||
if isKatakana: |
|||
edict.append(EdictEntry(entryMatch.group(1), entryMatch.group(1), entryMatch.group(2))) |
|||
else: |
|||
edict.append(EdictEntry(entryMatch.group(1), entryMatch.group(2), entryMatch.group(3))) |
|||
print('Loading Edict complete. {} entries found'.format(len(edict))) |
|||
|
|||
|
|||
loadEdict() |
|||
|
|||
if __name__ == '__main__': |
|||
# print(findEntries('一回目')[0].reading) |
|||
print(findEntries('日中')[0].reading) |
File diff suppressed because it is too large
Binary file not shown.
@ -0,0 +1,616 @@ |
|||
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
|||
<HTML> |
|||
<HEAD> |
|||
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=euc-jp"> |
|||
<META NAME="Generator" CONTENT="Jim's Markup Program - V0.99"> |
|||
<TITLE> JMdict/EDICT Project</TITLE> |
|||
</HEAD> |
|||
<BODY BGCOLOR="white"> |
|||
<!-- DO NOT EDIT!! |
|||
This HTML document was generated by the "markup" program. |
|||
Edit the original file instead. --> |
|||
<H1 ALIGN=CENTER> JMdict/EDICT </H1> |
|||
<P> |
|||
</P> |
|||
<H2 ALIGN=CENTER> JAPANESE/ENGLISH DICTIONARY PROJECT</H2> |
|||
<BASEFONT SIZE="3"> |
|||
<P> |
|||
<I>Copyright (C) 2017 </I> |
|||
<a HREF="http://www.edrdg.org/">The Electronic Dictionary Research and Development Group. </a> |
|||
</P> |
|||
<P> |
|||
<h2>Contents</h2> |
|||
<a href="#IREF00">INTRODUCTION</a> |
|||
<a href="#IREF01">CURRENT VERSION & DOWNLOAD</a> |
|||
<a href="#IREF01a">PROJECT FORUM </a> |
|||
<a href="#IREF01B">DATABASE and UPDATING </a> |
|||
<a href="#IREF02">FORMAT</a> |
|||
<a href="#IREF03">PROJECT HISTORY</a> |
|||
<a href="#IREF04">COPYRIGHT</a> |
|||
<a href="#IREF05">LEXICOGRAPHICAL DETAILS</a> |
|||
<a href="#IREF06">OTHER LANGUAGES</a> |
|||
<a href="#IREF08a">RELATED PROJECTS</a> |
|||
<a href="#IREF09">ACKNOWLEDGEMENTS</a> |
|||
<a href="#IREF10">PUBLICATIONS</a> |
|||
</P> |
|||
<P> |
|||
<a name="IREF00"><h2>INTRODUCTION</h2></a> |
|||
</P> |
|||
<P> |
|||
The JMdict/EDICT project has as its goal the production of a freely |
|||
available Japanese/English Dictionary in machine-readable form. |
|||
</P> |
|||
<P> |
|||
The project began in 1991 with the expansion of the "EDICT" simple |
|||
Japanese-English dictionary file. (See below under History) |
|||
</P> |
|||
<P> |
|||
At present the project has the following dictionary files available: |
|||
</P> |
|||
<UL> |
|||
<P> |
|||
</P> |
|||
<LI>the full JMdict file in XML format. The JMdict file is aimed at |
|||
being a multilingual lexical database with Japanese as the pivot language |
|||
and also includes |
|||
translations of words and phrases in a number of languages other |
|||
than English. More information is available from the |
|||
<a HREF="j_jmdict.html">JMdict overview page. </a> |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>the EDICT file, which contains a reduced amount of information, and |
|||
is provided to maintain support for software which uses the original |
|||
EDICT file format. |
|||
Note that this form of the dictionary is |
|||
now obsolete and is only being made available for legacy systems. All new |
|||
projects, apps, etc. are advised to use either the JMdict or EDICT2 formats; |
|||
<P> |
|||
A short |
|||
<a HREF="http://www.edrdg.org/jmdict/edict.html">EDICT overview page </a> |
|||
is available which lists some of the software which uses this file; |
|||
</P> |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>the EDICT2 file, which is in an expanded format and contains almost |
|||
all the information in the JMdict file; |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>the EDICT_SUB file, which contains about 20% of the most common |
|||
entries in the EDICT file. |
|||
</LI> |
|||
</UL> |
|||
<P> |
|||
The dictionary data is held in a database (details below) and new |
|||
editions of the JMdict and EDICT files are generated and distributed daily. |
|||
</P> |
|||
<P> |
|||
The files are copyright, and distributed in accordance with the |
|||
Licence Statement, which can found at the WWW site of the |
|||
<a HREF="http://www.edrdg.org/">Electronic Dictionary Research and Development Group </a> |
|||
who are the owners of the copyright. |
|||
</P> |
|||
<P> |
|||
<a name="IREF01"><h2>CURRENT VERSION & DOWNLOAD</h2> </a> |
|||
</P> |
|||
<P> |
|||
The project's master database is continuously being updated and new |
|||
versions of the files are generated daily. The date of generation is |
|||
included in the header of the files. |
|||
</P> |
|||
<P> |
|||
The files are currently distributed via the Monash University |
|||
<a HREF="http://ftp.monash.edu/pub/nihongo/00INDEX.html">ftp server, </a> |
|||
which also provides an rsync service. The main files available are: |
|||
</P> |
|||
<UL> |
|||
<LI> |
|||
<a HREF="http://ftp.monash.edu/pub/nihongo/JMdict.gz">JMdict.gz </a> |
|||
- the full JMdict file, including English, German, French, Russian and Dutch glosses; |
|||
</LI> |
|||
<LI> |
|||
<a HREF="http://ftp.monash.edu/pub/nihongo/JMdict_e.gz">JMdict_e.gz </a> |
|||
- the JMdict file with only English glosses; |
|||
</LI> |
|||
<LI> |
|||
<a HREF="http://ftp.monash.edu/pub/nihongo/edict.gz">edict.gz </a> |
|||
- the "traditional" EDICT file. |
|||
</LI> |
|||
<LI> |
|||
<a HREF="http://ftp.monash.edu/pub/nihongo/edict2.gz">edict2.gz </a> |
|||
- the extended EDICT2 file. |
|||
</LI> |
|||
</UL> |
|||
<P> |
|||
<a name="IREF01a"><h2>PROJECT FORUM</h2></a> |
|||
</P> |
|||
<P> |
|||
The are several forums where this project is actively discussed. |
|||
</P> |
|||
<P> |
|||
The original forum was the |
|||
<TT> sci.lang.japan</TT> |
|||
<a HREF="http://groups.google.com/group/sci.lang.japan">Usenet newsgroup. </a> |
|||
More recently a |
|||
<a HREF="http://groups.yahoo.com/group/edict-jmdict/">mailing list </a> |
|||
specifically for project discussion has begun. (Mail to |
|||
<TT> edict-jmdict-subscribe@yahoogroups.com</TT> |
|||
to initiate subscription.) |
|||
</P> |
|||
<P> |
|||
<a name="IREF01B"><h2>DATABASE and UPDATING</h2></a> |
|||
</P> |
|||
<P> |
|||
The dictionary data is all held in a PostgreSQL database and maintained |
|||
using the |
|||
<a HREF="http://www.edrdg.org/wiki/index.php/JMdictDB_Project">JMdictDB online system. </a> |
|||
The JMdict version is generated directly from the database. From this |
|||
the EDICT/EDICT2 versions are generated using utility software. |
|||
You can explore the database and propose |
|||
edits and new entries via its |
|||
<a HREF="http://www.edrdg.org/jmdictdb/cgi-bin/srchform.py?svc=jmdict&sid=">Search Form. </a> |
|||
</P> |
|||
<P> |
|||
The |
|||
<a HREF="http://www.edrdg.org/wiki/index.php/Main_Page#The_JMdict.2FEDICT_Project">EDRDG Wiki </a> |
|||
has a wealth of information about the dictionary database, including sugeestions about |
|||
<a HREF="http://www.edrdg.org/wiki/index.php/JMdict:_Getting_Started">getting started, </a> |
|||
the detailed |
|||
<a HREF="http://www.edrdg.org/wiki/index.php/Editorial_policy">editorial policy and guidelines, </a> |
|||
etc. etc. |
|||
</P> |
|||
<P> |
|||
<a name="IREF02"><h2>FORMAT</h2></a> |
|||
</P> |
|||
<P> |
|||
The basic format of the entries in the dictionary files can be seen in |
|||
detail by examining the |
|||
<a HREF="http://www.edrdg.org/jmdict/jmdict_dtd_h.html">DTD </a> |
|||
(Document Type Declaration) of the XML-format JMdict file. The DTD is |
|||
heavily annotated with content and structural information. |
|||
<a HREF="dtd-jmdict.xml">(download) </a> |
|||
</P> |
|||
<P> |
|||
In summary, each dictionary entry is independent, although there may |
|||
be cross-reference fields pointing to other entries. Each entry consists of |
|||
</P> |
|||
<OL type="a"> |
|||
<P> |
|||
</P> |
|||
<LI>kanji elements, i.e. headwords containing at least one kanji character, |
|||
plus associated tags indicating some status or characteristic of the |
|||
headword. Where there are multiple headwords, they have been ordered |
|||
according to frequency of usage, as far as this can be determined; |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>reading elements, containing either the reading in kana of the headword, |
|||
or the headword itself in the case of headwords only in kana. The elements |
|||
also include tags indicating some status or characteristics. As with the |
|||
kanji headwords, where there are multiple readings they have been ordered |
|||
according to frequency of usage, as far as this can be determined; |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>general coded information relating to the entry as a whole, such as |
|||
original language, date-of-creation, etc. |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>sense elements, containing the translational equivalents or glosses of |
|||
the headword(s). As Japanese is not highly polysemous, there is often only |
|||
one sense. Associated with the sense elements is other coded data indicating |
|||
the part-of-speech, field of application, miscellaneous information, etc. |
|||
As with headwords and readings, the glosses are ordered with the most common |
|||
appearing first. |
|||
</LI> |
|||
</OL> |
|||
<P> |
|||
The format and coding of the distributed files is as follows: |
|||
</P> |
|||
<OL type="a"> |
|||
<P> |
|||
</P> |
|||
<LI>the JMdict file contains the complete dictionary information |
|||
in XML format as per the |
|||
<a HREF="http://www.edrdg.org/jmdict/jmdict_dtd_h.html">DTD. </a> |
|||
This file is in Unicode/ISO-10646 coding using UTF-8 encapsulation. |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>the EDICT file is in a relatively simple format based on the text data |
|||
file of the SKK input-method. Each entry is in the form: |
|||
<P> |
|||
</P> |
|||
<DL><DD> |
|||
KANJI [KANA] /(general information) gloss/gloss/.../ |
|||
</DL> |
|||
<P> |
|||
or |
|||
</P> |
|||
<P> |
|||
</P> |
|||
<DL><DD> |
|||
KANA /(general information) gloss/gloss/.../ |
|||
</DL> |
|||
<P> |
|||
Where there are multiple senses, these are indicated by (1), (2), etc. |
|||
before the first gloss in each sense. As this format only allows a single |
|||
kanji headword and reading, entries are generated for each possible |
|||
headword/reading combination. As the format restricts Japanese characters |
|||
to the kanji and kana fields, any cross-reference data and other |
|||
informational fields are omitted. |
|||
</P> |
|||
<P> |
|||
The EDICT file is distributed in JIS X 0208 coding in EUC-JP encapsulation; |
|||
</P> |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>the EDICT2 file is in an expanded form of the original EDICT format. |
|||
The main differences are the inclusion of multiple kanji headwords and |
|||
readings, and the inclusion of cross-reference and other information |
|||
fields, e.g.: |
|||
<P> |
|||
</P> |
|||
<DL><DD> |
|||
KANJI-1;KANJI-2 [KANA-1;KANA-2] /(general information) (see xxxx) gloss/gloss/.../ |
|||
</DL> |
|||
<P> |
|||
In addition, the EDICT2 has as its last field the sequence number of the |
|||
entry. This matches the "ent_seq" entity value in the XML edition. The |
|||
field has the format: EntLnnnnnnnnX. The EntL is a unique string to help |
|||
identify the field. The "X", if present, indicates that an audio clip |
|||
of the entry reading is available from the JapanesePod101.com site. |
|||
</P> |
|||
<P> |
|||
The EDICT2 file is distributed in JIS X 0208 and JIS X 0212 codings in EUC-JP |
|||
encapsulation; |
|||
</P> |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>the EDICT_SUB file is in the same format as the EDICT file. |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
</OL> |
|||
<P> |
|||
None of the files have the entries in any particular order. |
|||
</P> |
|||
<P> |
|||
<a name="IREF03"><h2>PROJECT HISTORY</h2></a> |
|||
</P> |
|||
<P> |
|||
The project was begun in 1991 by the current editor |
|||
<a HREF="http://nihongo.monash.edu/">(Jim Breen) </a> |
|||
when an early DOS-based Japanese word-processor |
|||
(MOKE - Mark's Own Kanji Editor) was released, containing an initial |
|||
small version of the EDICT file. This was progressively expanded and edited over |
|||
the following years. In 1999 the EDICT, which by this time contained |
|||
about 60,000 entries, was converted into an expanded format and the first |
|||
XML-format JMdict file released. From that point both JMdict and EDICT |
|||
have been generated from the same source data. |
|||
</P> |
|||
<P> |
|||
The EDICT2 format was created in 2003, primarily for use with the |
|||
<a HREF="http://nihongo.monash.edu/cgi-bin/wwwjdic.cgi?1C">WWWJDIC </a> |
|||
dictionary server. |
|||
</P> |
|||
<P> |
|||
The growth in entries in the file is largely due to the efforts of Jim and the |
|||
many people who contributed entries to it over the years. The increase in entry |
|||
numbers has slowed as the file has achieved coverage of a large proportion |
|||
of the Japanese lexicon. Much of the editorial work in recent years has |
|||
concentrated on amendments and expansion to existing entries. |
|||
</P> |
|||
<P> |
|||
A more expanded explanation of the early developments in the EDICT file |
|||
can be found in the |
|||
<a HREF="http://www.edrdg.org/jmdict/edict_doc_old.html">original documentation. </a> |
|||
</P> |
|||
<P> |
|||
<a name="IREF04"><h2>COPYRIGHT</h2></a> |
|||
</P> |
|||
<P> |
|||
Dictionary copyright is a difficult point, because clearly the first |
|||
lexicographer who published "inu means dog" could not claim a copyright |
|||
violation over all subsequent Japanese dictionaries. While it is usual to |
|||
consult other dictionaries for "accurate lexicographic information", as |
|||
Nelson put it, wholesale copying is, of course, not permissible, and |
|||
contributors have been advised to avoid direct copying from other sources. |
|||
What makes |
|||
each dictionary unique (and copyright-able) is the particular selection of |
|||
words, the phrasing of the meanings, the presentation of the contents (a very |
|||
important point in the case of this project), and the means of publication. |
|||
</P> |
|||
<P> |
|||
The files of the project are copyright, and distributed in accordance with the |
|||
Licence Statement, which can found at the WWW site of the |
|||
<a HREF="http://www.edrdg.org/">Electronic Dictionary Research and Development Group </a> |
|||
who are the current owners of the copyright. As explained in the licence, the |
|||
files are available for use for most purposes provided acknowledgement |
|||
and distribution of the documentation is made. |
|||
</P> |
|||
<P> |
|||
<a name="IREF05"><h2>LEXICOGRAPHICAL DETAILS</h2></a> |
|||
</P> |
|||
<P> |
|||
</P> |
|||
<OL type="A"> |
|||
<LI>Inflections, etc. |
|||
<P> |
|||
In general no inflections of verbs or adjectives have been included, |
|||
except in idiomatic expressions. Adverbs |
|||
formed from adjectives (e.g., -ku or -ni) are generally not included. |
|||
Verbs are, of course, in the plain or "dictionary" form. |
|||
</P> |
|||
<P> |
|||
Composed forms, such as adverbs taking the "to" particle, keiyoudoushi |
|||
adjectives, etc. are only included in their root from, however the |
|||
part-of-speech (POS) marker is used to indicate their status. |
|||
</P> |
|||
<P> |
|||
Nouns which can form a verb withe the auxiliary verb "suru" only appear |
|||
in their noun form, but have a POS marker: "vs", to indicate the existence |
|||
of a verbal form. In general the gloss only relates to the noun itself, but |
|||
entries are being progressively expanded to include the verbal glosses as well. |
|||
</P> |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>Part of Speech Marking |
|||
<P> |
|||
The dictionary includes one or more Part of Speech (POS) markings on almost |
|||
every entry. Examples include: "adj-i" (adjective - 形容詞), "n" (noun - |
|||
名詞), "prt" (particle - 助詞), etc. |
|||
<a HREF="http://www.edrdg.org/jmdictdb/cgi-bin/edhelp.py?svc=jmdict&sid=#kw_pos">(Full POS list) </a> |
|||
</P> |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>Field of Application |
|||
<P> |
|||
A number of entries are marked with a specific field of application, e.g. |
|||
"chem" (chemistry), "math" (mathematics), etc. |
|||
<a HREF="http://www.edrdg.org/jmdictdb/cgi-bin/edhelp.py?svc=jmdict&sid=#kw_fld">(Full field list) </a> |
|||
</P> |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>Miscellaneous Markings |
|||
<P> |
|||
A number of miscellaneous tags are included in entries to provide |
|||
additional information is a standardized form, e.g. "col" (colloquialism), |
|||
"sl" (slang), "uk" (term usually in kana), etc. |
|||
<a HREF="http://www.edrdg.org/jmdictdb/cgi-bin/edhelp.py?svc=jmdict&sid=#kw_misc">(Full list) </a> |
|||
</P> |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>Word Priority Marking |
|||
<P> |
|||
The ke_pri and equivalent re_pri fields in the JMdict file |
|||
are provided to record |
|||
information about the relative commonness or priority of the entry, and consist |
|||
of codes indicating the word appears in various references which |
|||
can be taken as an indication of the frequency with which the word |
|||
is used. This field is intended for use either by applications which |
|||
want to concentrate on entries of a particular priority, or to |
|||
generate subset files. |
|||
The current values in this field are: |
|||
</P> |
|||
<OL type="a"> |
|||
<LI>news1/2: appears in the "wordfreq" file compiled by Alexandre Girardi |
|||
from the Mainichi Shimbun. (See the Monash ftp archive for a copy.) |
|||
Words in the first 12,000 in that file are marked "news1" and words |
|||
in the second 12,000 are marked "news2". |
|||
</LI> |
|||
<LI>ichi1/2: appears in the "Ichimango goi bunruishuu", Senmon Kyouiku |
|||
Publishing, Tokyo, 1998. (The entries marked "ichi2" were |
|||
demoted from ichi1 because they were observed to have low |
|||
frequencies in the WWW and newspapers.) |
|||
</LI> |
|||
<LI>spec1 and spec2: a small number of words use this marker when they |
|||
are detected as being common, but are not included in other lists. |
|||
</LI> |
|||
<LI>gai1/2: common loanwords, also based on the wordfreq file. |
|||
</LI> |
|||
<LI>nfxx: this is an indicator of frequency-of-use ranking in the |
|||
wordfreq file. "xx" is the number of the set of 500 words in which |
|||
the entry can be found, with "01" assigned to the first 500, "02" |
|||
to the second, and so on. |
|||
</LI> |
|||
</OL> |
|||
<P> |
|||
Entries with news1, ichi1, spec1/2 and gai1 values are marked with |
|||
a "(P)" in the EDICT and EDICT2 files. |
|||
</P> |
|||
<P> |
|||
While the priority markings accurately reflect the status of entries with |
|||
regard to the various sources, they must be seen as |
|||
only providing a crude indication of how common a word or expression actually |
|||
is in Japanese. The "(P)" markings in the EDICT and EDICT2 files appear to |
|||
identify a useful subset of "common" words, but there are clearly some |
|||
marked entries which are not very common, and there are clearly unmarked |
|||
entries which are in common use, particularly in the spoken language. |
|||
</P> |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>Okurigana Variants |
|||
<P> |
|||
Okurigana variants in headwords are handled by including each variant form |
|||
as a headword. This is to enable software to match with variant forms. |
|||
</P> |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>Spellings |
|||
<P> |
|||
As far as possible variants of English translation and spelling are included. |
|||
Where appropriate different translations are included for |
|||
national variants (e.g. autumn/fall, tap/faucet, etc.). Common spelling |
|||
variations such as -our/-or and -ize/-ise are handled either by repeating |
|||
the gloss in both spellings or appending spelling variants in parentheses. |
|||
No attempt is made to tag English spellings according to country of usage. |
|||
</P> |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>Loanwords and Regional Words |
|||
<P> |
|||
For loanwords (gairaigo) which have not been derived from English words, |
|||
the source language and the word in that language are included. Languages have |
|||
been coded in the three-letter codes from the ISO 639-2:1998 "Codes for the |
|||
representation of names of languages" standard, e.g. "(fre: avec)" in the |
|||
EDICT/EDICT2 files and |
|||
<lsource xml:lang="fre">avec</lsource> in the JMdict |
|||
file. |
|||
<a HREF="http://www.edrdg.org/jmdictdb/cgi-bin/edhelp.py?svc=jmdict&sid=#kw_lang">(Full list </a> |
|||
of language tags) |
|||
</P> |
|||
<P> |
|||
In the case of gairaigo which have a meaning which is not apparent from the |
|||
original (usually English) words, the words in the source language are |
|||
included as: "lang: original words", e.g. |
|||
</P> |
|||
<P> |
|||
</P> |
|||
<DL><DD> |
|||
コンクール /(n) competition (fre: concours)/contest/ |
|||
</DL> |
|||
<P> |
|||
In some cases the entries are pseudo-loanwords that have been constructed |
|||
in Japan from foreign (usually English) words or word fragments |
|||
(e.g. 和製英語 - waseieigo). These are tagged with "wasei" in EDICT/EDICT2 |
|||
entries, e.g. |
|||
</P> |
|||
<DL><DD> |
|||
アゲンストウィンド /(n) head wind (wasei: against wind)/adverse wind/ |
|||
</DL> |
|||
<P> |
|||
and in JMdict with the "ls_wasei" attribute |
|||
e.g. <lsource ls_wasei="y">against wind</lsource> |
|||
</P> |
|||
<P> |
|||
A number of tags |
|||
are used to indicate that a word or phrase is associated with a particular |
|||
regional language variant within Japan, e.g. "ksb" (Kansai-ben). |
|||
<a HREF="http://www.edrdg.org/jmdictdb/cgi-bin/edhelp.py?svc=jmdict&sid=#kw_dial">(Full list) </a> |
|||
</P> |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
</OL> |
|||
<P> |
|||
<a name="IREF06"><h2>OTHER LANGUAGES</h2></a> |
|||
</P> |
|||
<P> |
|||
The JMdict file has the capacity to record glosses for Japanese headwords in |
|||
many languages. As part of the daily build of the file, the Japanese |
|||
headwords are matched against a number of other dictionary files and |
|||
glosses included for those languages. JMdict is currently distributed in |
|||
two versions: a basic version |
|||
in which there are only English glosses, and a full version in which there are |
|||
glosses included in German (111,000 entries), Russian (77,000), |
|||
Hungarian (51,000), Spanish (39,000), Italian (38,000), Dutch (29,000), |
|||
Swedish (16,000), French (15,000) and Slovenian (9,000). |
|||
</P> |
|||
<P> |
|||
Details of the dictionary files used for the non-English glosses |
|||
in JMdict can be found in the |
|||
<a HREF="http://www.edrdg.org/wwwjdic/wwwjdicinf.html#dicfilf_tag">WWWJDIC documentation. </a> |
|||
</P> |
|||
<P> |
|||
<a name="IREF08a"><h2>RELATED PROJECTS</h2></a> |
|||
</P> |
|||
<P> |
|||
A number of other Japanese dictionary projects are closely related to this |
|||
one. Among them are: |
|||
</P> |
|||
<OL type="a"> |
|||
<P> |
|||
</P> |
|||
<LI>the |
|||
<a HREF="http://www.edrdg.org/enamdict/enamdict_doc.html">ENAMDICT/JMnedict </a> |
|||
Japanese Proper Names Dictionary project, which currently has nearly |
|||
740,000 named entities. The files are available in EDICT or XML formats. |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>the |
|||
<a HREF="http://www.edrdg.org/kanjidic/kanjidic.html">KANJIDIC </a> |
|||
and |
|||
<a HREF="http://www.edrdg.org/kanjidic/kanjd2index.html">KANJIDIC2 </a> |
|||
project, which maintains and distributes databases of information about |
|||
kanji. |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>the |
|||
<a HREF="http://www.edrdg.org/jmdict/compdic_doc.html">COMPDIC </a> |
|||
file in EDICT format of computing and telecomms terminology. In 2008 the |
|||
COMPDIC material was included in the main EDICT/JMdict database with tagging |
|||
indication the entries relate to ICT. A separate "COMPDIC" file is extracted |
|||
for distribution. |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
<LI>the |
|||
<a HREF="http://www.edrdg.org/krad/kradinf.html">RADKFILE/KRADFILE </a> |
|||
file of visual elements in kanji, which can be used for finding kanji |
|||
in dictionaries. |
|||
<P> |
|||
</P> |
|||
</LI> |
|||
</OL> |
|||
<P> |
|||
<a name="IREF09"><h2>ACKNOWLEDGEMENTS</h2></a> |
|||
</P> |
|||
<P> |
|||
Since 1991 a large number of people have contributed to this project; far too |
|||
many to list here. All their contributions have been most welcome, indeed |
|||
without the assistance of speakers and students of Japanese this |
|||
project would not have achieved as much. |
|||
</P> |
|||
<P> |
|||
The EDICT/JMdict has been granted approval to use material from the |
|||
<a HREF="http://compling.hss.ntu.edu.sg/wnja/index.en.html">Japanese WordNet. </a> |
|||
This approval is most welcome. |
|||
</P> |
|||
<P> |
|||
<a name="IREF10"><h2>PUBLICATIONS</h2></a> |
|||
</P> |
|||
<P> |
|||
Some publications by Jim Breen about the EDICT/JMdict project: |
|||
</P> |
|||
<UL> |
|||
<LI>paper about JMdict presented at the COLING Multilingual |
|||
Linguistic Resources Workshop in Geneva in August 2004. |
|||
<a HREF="http://www.edrdg.org/~jwb/paperdir/jmdictart.html">(html) </a> |
|||
<a HREF="http://www.edrdg.org/~jwb/paperdir/jmdictart.pdf">(pdf) </a> |
|||
<BR> |
|||
<I>(This paper should be referenced when citing the dictionary in a publication.)</I> |
|||
</LI> |
|||
<LI>an earlier |
|||
<a HREF="ws2002_paper.html">JMdict paper </a> |
|||
about some of the practical issues, presented at the Papillon |
|||
Project workshop in Tokyo in July 2002. |
|||
</LI> |
|||
<LI>a 1999 workshop paper about WWWJDIC; |
|||
<a HREF="http://www.edrdg.org/~jwb/paperdir/wwwjdic_article2.html">(updated 2003 version) </a> |
|||
<a HREF="http://nihongo.monash.edu/wwwjdic_article/wwwjdic_article.html">(1999 version). </a> |
|||
</LI> |
|||
<LI>an overview paper about EDICT presented at the JSAA conference in 1995; |
|||
<a HREF="http://www.edrdg.org/~jwb/paperdir/hpaper.html">(html) </a> |
|||
<!-- #url ftp://ftp.monash.edu/pub/nihongo/elec_dic.ps.gz (postscript) --> |
|||
</LI> |
|||
<LI>An early technical report from 1993; |
|||
<a HREF="http://www.edrdg.org/~jwb/paperdir/ejdic_report1.pdf">(pdf) </a> |
|||
<a HREF="ftp://ftp.monash.edu/pub/nihongo/ejdic_report1.ps.gz">(postscript) </a> |
|||
</LI> |
|||
</UL> |
|||
</BODY></HTML> |
Loading…
Reference in new issue