|

Natural Language Processing Notes – Class 10 AI (417)

Master Class 10 AI Natural Language Processing!
Simplified notes on NLP Applications, Chatbots, Text Processing (Sentance Segmentation, Tokenization, Stemming, Lemmatization), Bag of Words and and TFIDF. Includes diagrams for easy and quick exam revision. This Natural Language Processing Notes is prepared as per latest CBSE curriculum.

Natural Language Processing

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that enables computers to understand, interpret, and process human language. It helps machines analyze text or speech and extract meaningful information, allowing them to communicate with humans more effectively.

Why NLP is important?

  • It enables communication between humans and computer systems.
  • It helps computers understand intent and context in language.
  • It supports the development of tools and techniques for better interaction with machines.

Applications of Natural Language Processing (NLP)

  • Autogenerated Captions: Converts speech into text in real-time, improving accessibility of videos.
    Example: YouTube captions, Google Meet subtitles
  • Voice Assistants: Understand and process spoken language to perform tasks.
    Example: Google Assistant, Alexa, Siri
  • Language Translation: Translates text or speech from one language to another, enabling global communication.
    Example: Google Translate
  • Sentiment Analysis: Identifies whether a text expresses positive, negative, or neutral feelings.
    Use: Understanding customer opinions and feedback
  • Keyword Extraction: Automatically finds important words or phrases from text.
    Use: Analyzing trends and improving customer service
  • Text Classification: Categorizes text into predefined groups.
    Example: News articles classified as Sports, Food, Politics

Stages of Natural Language Processing (NLP)

The stages of Natural Language Processing (NLP) typically involve the following:

  1. Lexical Analysis
  2. Syntactic Analysis
  3. Sematic Analysis
  4. Discourse Integration
  5. Pragmatic Analysis

Lexical Analysis

  • It’s the process of breaking down the input text into tokens such as words, sentences, or structured paragraphs.
  • Lexicon stands for a collection of various words and phrases used in a language.

Syntactic Analysis

  • It’s the process of checking grammar of sentences and phrases.
  • Its forms relation among words and eliminate logically incorrect sentences.

Semantic Analysis

  • In this stage, input text is checked for meaning, and every word and phrase is checked for meaningfulness.

Discourage Integration

  • It refers to how sentences in a conversation or text connect to create meaningful and clear communication.
  • It ensures that each sentence is logically related to the sentences before and after it.

Pragmatic Analysis

  • It focuses on understanding the intended meaning of a sentence based on context, tone, and real-world knowledge rather than just its literal meaning.

Chatbots

  • One of the most common applications of Natural Language Processing (NLP) is a chatbot.
  • A chatbot is a computer program designed to simulate human conversation through text or voice.
  • It can interact with users, understand their queries, and respond accordingly.
  • Over time, chatbots can learn and improve their interactions.
  • Chatbots are widely used to answer questions, solve customer problems, provide support, generate leads, and boost sales on e-commerce platforms.
  • They help businesses communicate with users quickly and efficiently.
  • Some popular chatboats are: Eliza, Mitsuku, Cleverbot, Singtel

Types of Chatbots

  • Script-bot
  • Smart-bot
                   Script BotSmart Bot
Easy to makeFlexible and powerful
Works around a pre-defined script programmed in themWorks on bigger databases and other resources directly
Mostly free and easy to integrate with messaging platformsLearns with more data
No or little language processing skillsRequires coding to develop
Limited functionalityWide functionality

Text Processing

Text processing is the process in NLP where raw human language is cleaned, simplified, and prepared so that computers can understand and work with it effectively.

Steps of Text Processing:

  1. Text Normalisation:

  2. Bag of Words:
    After cleaning, the text is converted into numerical form by representing it as words and their frequency, so that computers can analyze it easily.

Text Normalization

In text normalization, the text is cleaned and simplified by removing unnecessary elements like punctuation, symbols, and converting text into a standard form.

this process, we often work with a collection of text from multiple documents, which is called a corpus. All the text from these documents together forms the corpus on which text normalisation techniques are applied.

Steps in Text Normalization

  • Sentence Segmentation
  • Tokenization
  • Removing Stop words, Special Characters, Numbers
  • Converting Text to Common Case
  • Stemming
  • Lemmatization

Sentence Segmentation

In sentence segmentation, the entire corpus is divided onto sentences

Working on textual data from multiple documents put together is called Corpus.

Natural Language Processing is interesting. It helps computers understand human language. It is widely used in chatbots.

Tokenization

  • After segmenting the sentences, each sentence is then further divided into tokens.
  • Tokens is a term used for any word or number or special character occurring in a sentence.
Natural Language Processing is interesting.It helps computers understand human language.

Removing Stop words, Special Characters and Numbers

Stop words (such as articles, pronouns, and prepositions) are removed from the token list because they carry little meaningful information and are not important for analysis.

Special characters and numbers may also be removed if they are not useful, but they can be kept when they carry important information (like in email IDs).

Converting Text to a Common Case

After removing stop words, the text is converted into a single case (usually lowercase). This prevents the computer from treating the same words as different just because of capital letters.

Stemming

stemming is the process in which the affixes of words are removed and the words are converted to their base form.

In stemming, words are reduced to their base form by removing prefixes or suffixes. However, the resulting word may not always be meaningful (e.g., studies → studi). Stemming does not check meaning—it simply removes affixes, which makes it a fast process.

WordAffixesStem
goinginggo
goesesgo
studiesesstudi

Lemmatization

  • Stemming and Lemmatization both are alternative processes to each other as both used to remove affixes.
  • But in Lemmatization, the word we get after affix removal is a meaningful one.
WordAffixesStem
goinginggo
goesesgo
studiesesstudy

Bag of Words

  • Bag of Words is a Natural Language Processing model which helps in extracting features out of the text which can be helpful in machine learning algorithms.
  • In the bag of words, we get the occurrences of each word and construct the vocabulary for the corpus.
  • Bag of words gives us two things:
    • Vocabulary of words for corpus
    • Frequency of these words

Steps to implement Bag of Words

  • Text Processing
  • Create a Dictionary
  • Create a Document Vector
  • Create Document Vectors for all the documents

Step 1: Collect Data & Pre-process

We start with documents and apply text normalisation.

Original Documents:

  • Document 1: Aman and Avni are stressed
  • Document 2: Aman went to a therapist
  • Document 3: Avni went to download a health chatbot

After Text Normalisation (Tokens):

  • Document 1: [aman, and, avni, are, stressed]
  • Document 2: [aman, went, to, a, therapist]
  • Document 3: [avni, went, to, download, a, health, chatbot]

Step 2: Create Dictionary (Vocabulary)

List all unique words from all documents.

Dictionary:

  • aman, and, avni, are, stressed
  • went, to, a, therapist
  • download, health, chatbot

(Each word is written only once, even if repeated in documents.)

Step 3: Create Document Vector (Example: Document 1)

For each word in the dictionary:

  • Put 1 if the word is present
  • Put 0 if the word is absent
Wordamanandavniarestressedwenttoatherapistdownloadhealthchatbot
Document 1111110000000

Step 4: Create Vectors for All Documents

Documentamanandavniarestressedwenttoatherapistdownloadhealthchatbot
Document 1111110000000
Document 2100001111000
Document 3001001110111

Final Output

  • The table above is called the Document Vector Table
  • Rows = Documents
  • Columns = Words (Vocabulary)
  • Values = Frequency (0 or 1 here)

TFIDF

  • It refers to Term Frequency and Inverse Document Frequency
  • It helps us identify the value of each word.
  • The purpose of TFIDF is to find those rare and valuable words which occurs the least but add the most value to the corpus.

Term Frequency

  • Term frequency is the frequency of a word in one document.
  • Term frequency can easily be found in the document vector table
AmanAndAvniAreStressedWentToATherapistDownloadHealthChatbot
111110000000
1000011112000
0010011120111

Inverse Document Frequency

  • Document Frequency is the number of documents in which the word occurs irrespective of how many times it has occurred in those documents.
  • Inverse document frequency, we need to put the document frequency in the denominator while the total number of documents is the numerator.

Document Frequency

AmanAndAvniAreStressedWentToATherapistDownloadHealthChatbot
212112221111

Inverse Document Frequency

AmanAndAvniAreStressedWentToATherapistDownloadHealthChatbot
3/23/13/23/13/13/23/23/23/13/13/13/1

Final TFIDF Formula

TFIDF(W) = TF(W)*log(IDF(W))

Exp: 1*log(3/2)

AmanAndAvniAreStressedWentToATherapistDownloadHealthChatbot
3/23/13/23/13/13/23/23/23/13/13/13/1
AmanAndAvniAreStressedWentToATherapistDownloadHealthChatbot
0.1760.4770.1760.4770.4770.477000000
0.17600000.1760.1760.1760.477000
000.176000.1760.1760.17600.4770.4770.477

TFIDF – Key Points

  • Words that appear frequently in all documents get low TF-IDF values and are usually treated as stop words.
  • A word gets a high TF-IDF value when it appears often in one document but rarely in others, showing it is important for that document.
  • Higher TF-IDF values indicate more important words for understanding and processing the text.

Applications of TFIDF

  • Document Classification:
    Helps in identifying the type or category of a document.
  • Topic Modelling:
    Helps in predicting the main topic of a group of documents (corpus).
  • Information Retrieval System:
    Helps in extracting important and relevant information from large text data.
  • Stop Word Filtering:
    Helps in removing unnecessary and less meaningful words from text.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *