Projects | NLP Laboratory

NEGAR is a Persian script system and an add-on to the MS-Word editor. The purpose of this system is to debug Persian texts and convert texts to the standards of the Persian Language and Literature Academy. This system is written in C #. This system has been added to the Microsoft word editor and allows users to edit Persian texts. The log is an extension for Word software that is available to ordinary users to edit Persian texts. This plugin has four main sections:
       1- Standardization
       2- Edit and modify spacing between text words
       3- Correction of punctuation
       4- Non-Farsi numbers in Farsi
       5- Converting Numbers
       6- Digit numbers to say

Word Plugin : 32 bits Version 64 bits Version

‌FarsVazhe‌ is a collection of Farsi words that consist of the combination of Genesis, Farsi, dictionaries, and a number of words that are manually entered and edited. This collection contains about 72,000 words and has been gathered by the students of the Natural Language Processing Laboratory of the University of Shahid Beheshti. Each word has a textual form, a phonetic face, a syntactic label, a frequency, an integer / non-textual form, composite / non-derivation, non-derivative, or singular word in the sum of Max's sum.

With the increasing growth of text documents on the Web, choosing the right information in a limited time is difficult. Using tools such as digitizers, this massive amount of information can be managed by generating draft summaries. The ‌ Proposed Summarization ‌method consists of three stages of preprocessing, processing, and summary generation for news texts

1- Preprocessing Step : This step of its preprocessing stage includes segmentation (detection of the range of sentences and words), the elimination of expressions or verbs, the identification of numerical values and particular names, rooting with the steppe, and extracting the semantic information required from Foreance.

2- Processing Step : This step at the processing stage is a feature rating for each entry sentence using eight apparent features in the text, and the likelihood and similarity score for each pair of sentences is calculated by applying extracted data from the Forex. Then the sentences are clustered in three main clusters containing the same sentences, related sentences, and sentences.

3-Final Step : This step in the final stage is generated by selecting sentences from the clusters in either the "feature rating" or "the number of similar and related sentences".

In the era of the World Wide Web, searching for information is simply done using search engines and online databases. Although this has played a significant role in sharing and disseminating knowledge, it also makes it harder to protect property rights against abusive practices. Smuggling systems or similarity find documents trying to discover these types of abuses. The moonlight system is one of the projects that have been defined in the field of fraud detection in scientific documents at the Shahid Beheshti University Natural Language Processing Laboratory

The ‌ MAHTAB‌ project is a similarity system on scientific documents in the field of electrical and computer science. This system compares query documents with a database of twenty thousand papers and thesis in the field of power and computer, and compares database documents based on their similarity to the query document and displays it to the user. To give In addition, the system determines the percentage of the overall similarity of each query document with the source, and can also display the exact location of the similarity between the two documents and determine the percentage of this similarity independently. In this system, document images are also compared and will be effective in determining the overall percentage similarity of the documents. The MAHTAB system is now able to identify the exact types of replicas, copying with changes, and some techniques for manipulating text such as inserting and deleting sentences, dividing and integrating sentences, moving and replacing words with their synonyms. The MAHTAB system is based on data retrieval methods, which has enabled the system to run on massive databases. This system is now able to support Persian and English languages, and the similarity of interlanguage in Persian and English languages is one of the prospects for the MAHTAB system.

‌ Machine translation ‌ is one of the most widely used areas of natural language processing, which has faced many problems due to the ambiguities and complexity of the rules of natural language in the source language and destination. Of course, the efficiency of the machine translation depends on the performance of the basic processing of the natural language used in it. Machine translation typically refers to the concept of translating a text from the source language to the destination. Speech-to-speech translation combining the components of speech-to-speech and text-to-speech with written translation.

There are two basic, rule-based, and corpus-based tendencies in machine translation. In a rule-based tendency with linguistic studies, there is a base of rules for translation that normally do not have enough coverage. It is also possible that translated sentences are not fluent. In a corpus-oriented orientation, the linguistic knowledge required to translate into a machine is derived from a parallel entity. The parallel figure containing millions of sentences is equivalent to two source languages and destinations.

468/5000 The statistical translation method has been taken into account in a corpuscular orientation since the early 1990s, and most recent research is about this method. In this method, with the learning of different possibilities of the language, the output sentence is produced with the highest probability. The product is a Persian-English translator which has been trained in the field of classical literature using a figure of about one million sentences. In the statistical model of this translator, it has been tried to teach the difference in word order in both Persian and English languages.

‌ STeP-1 : Standard Text preparation for Persian language ‌

For many natural language processing programs, it is necessary for a lot of preprocessing to be made on the input text so that the text becomes a suitable format for higher-level processing. One of these preprocesses can be the segmentation, rooting, and tagging of a syntactic category. Users of natural language processing need a simple, simple interface for basic processing on the text. Stepup is a software package that includes basic processing of Persian language. This package includes the Persian text editor and editor, the root and analyst of word making and tagging of the syntactic category. This software is written in C #. The following step-by-step text analysis software sub-systems can be described as follows.

Segmentation Subsystem : This subsystem divides the text into its constituent words and sentences. In this system, the spaces and half-spaces between the Persian words are corrected. Also, the system modifies the text to a degree based on the principles of writing the Persian Academy of Persian Language and Literature.

Root subsystem : This subsystem is capable of root all the words, a number of derived words and their structural analysis.

Syntactic Tag Subsystem : This subset specifies the syntactic category of words in a sentence. To do this, a tool called TNT has been used.

SteP-1 is an API that is provided to users of Farsi-language processing.

‌ FarhangYar ‌ is an instrument for the development of a comprehensive Persian language language, which is being pursued as a national plan by the Persian Language and Literature Academy. This culture is being prepared on the basis of selection of excerpts from selected Persian language texts of different periods and periods. In addition to providing the possibility to hold this huge collection of selected texts and searches for them, it provides the possibility of a bibliography based on a detailed and complex dictionary of the dictionary, compiled in the Department of Dictionary of the Persian Language and Literature Academy, which provides a range of definitions of entries, their editorial workflow Manage and search for entries, manage users, and provide a dictionary version of the dictionary in MS Word format.

Also, since this software is web-based, the possibility for the academics to contribute to the academics from anywhere in the world is possible. Against the estimates, it seems that this software with its implemented features is a unique example of a dictionary that has been created for Persian language.

‌FarsNet‌ is the first and most accurate Persian version that has been developed at Shahid Beheshti University and supported by the Telecommunication Research Center of Iran. The latest released version of FarsNet is version 2 and version 3 has been prepared with 100,000 entries, which can be browsed from the bottom of the page. The Farsi version 2 has over 30,000 lexical entries (words or phrases) that contain about 20,000 syntaxes. For each entry, at least one meaning is defined and each meaning in one and only a sequence of combinations. All collections in the series are either in the hierarchy or as the head of the group. In addition, each sequence set or at least one of its members participates in at least one non-hierarchical relationship. Also, each syntax is mapped to the equivalent sequence set, if possible, on the Princeton English WordNet.

یافتن معادل فارسی واژگان بیگانه یکی از دغدغه‌های پژوهشگران می‌باشد، از آنجایی که این معادل فارسی باید رسا، منطقی و تاحدممکن مورد قبول عام باشد، نیاز است جست‌وجوی معادل، ثبت و تصویب آن در سامانه‌ای جمع‌سپاری شده باشد.
واژه یار‌ سامانه‌ای است که پژوهشگران می‌توانند واژه‌های بیگانه را با فیلترهای حوزه و انواع عبارات منظم جست‌وجو نمایند و به معادل‌های پیشنهاد شده، رأی مثبت یا منفی دهند، چراکه این رأی‌ها در پیشنهاد این معادل به عنوان معادلِ مصوب فرهنگستان زبان و ادب فارسی مؤثر خواهد بود‌. پنل مدیریت این سامانه جهت استفاده افراد با دسترسی ویژه به گونه‌ای طراحی و پیاده‌سازی شده که کاربر را قادر می‌سازد: گروه‌های کاربری با امتیاز تصویب متفاوت تعریف کند و دسترسی کاربران ثبت‌نام شده و حوزه‌ها را مدیریت نماید، پیشینه‌ی جست‌وجوی ناموفق را مشاهده کند و همچنین واژگانی که معادل مصوب ندارند را ملاحظه کرده و درصورت نیاز با توجه به رأی افراد، معادل مصوب را بازبینی نماید. کاربر ثبت‌نام شده که دسترسی عادی دارد، می‌تواند لیست واژگان به سامانه بی افزاید و برای واژگان موجود معادل تعریف کند. واژگانی که توسط این کاربران تعریف می شود، پس از تایید کاربر با دسترسی ویژه در سامانه جهت استفاده عام قرار می‌گیرد.

Tools Demo

Input Sentence

STep-1 - Tokenizer

STep-1 - Stemming

STep-1 - POS Tagger

STep-1 - Chunker

STep-1 - NER

تبدیل محاوره به رسمی

Natural Language Processing Laboratory

Faculty of Computer Science and Engineering

Tools Demo

Beheshti

Faculty of Computer Science and Engineering - NLP Laboratory

Copyright © 2018 Beheshti Natural Language Processing Lab