Natural Language Processing Notes

Master Class 10 AI Natural Language Processing!
Simplified notes on NLP Applications, Chatbots, Text Processing (Sentance Segmentation, Tokenization, Stemming, Lemmatization), Bag of Words and and TFIDF. Includes diagrams for easy and quick exam revision. This Natural Language Processing Notes is prepared as per latest CBSE curriculum.

Contents hide

1. Natural Language Processing

2. Why NLP is important?

3. Applications of Natural Language Processing (NLP)

4. Stages of Natural Language Processing (NLP)

4.1. Lexical Analysis

4.2. Syntactic Analysis

4.3. Semantic Analysis

4.4. Discourage Integration

4.5. Pragmatic Analysis

5. Chatbots

6. Types of Chatbots

7. Text Processing

8. Steps of Text Processing:

8.1. Text Normalization

8.1.1. Sentence Segmentation

8.1.2. Tokenization

8.1.3. Removing Stop words, Special Characters and Numbers

8.1.4. Converting Text to a Common Case

10. Applications of TFIDF

Natural Language Processing

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that enables computers to understand, interpret, and process human language. It helps machines analyze text or speech and extract meaningful information, allowing them to communicate with humans more effectively.

Why NLP is important?

It enables communication between humans and computer systems.
It helps computers understand intent and context in language.
It supports the development of tools and techniques for better interaction with machines.

Applications of Natural Language Processing (NLP)

Autogenerated Captions: Converts speech into text in real-time, improving accessibility of videos.
Example: YouTube captions, Google Meet subtitles
Voice Assistants: Understand and process spoken language to perform tasks.
Example: Google Assistant, Alexa, Siri
Language Translation: Translates text or speech from one language to another, enabling global communication.
Example: Google Translate
Sentiment Analysis: Identifies whether a text expresses positive, negative, or neutral feelings.
Use: Understanding customer opinions and feedback
Keyword Extraction: Automatically finds important words or phrases from text.
Use: Analyzing trends and improving customer service
Text Classification: Categorizes text into predefined groups.
Example: News articles classified as Sports, Food, Politics

Stages of Natural Language Processing (NLP)

The stages of Natural Language Processing (NLP) typically involve the following:

Lexical Analysis
Syntactic Analysis
Sematic Analysis
Discourse Integration
Pragmatic Analysis

Lexical Analysis

It’s the process of breaking down the input text into tokens such as words, sentences, or structured paragraphs.
Lexicon stands for a collection of various words and phrases used in a language.

Syntactic Analysis

It’s the process of checking grammar of sentences and phrases.
Its forms relation among words and eliminate logically incorrect sentences.

Semantic Analysis

In this stage, input text is checked for meaning, and every word and phrase is checked for meaningfulness.

Discourage Integration

It refers to how sentences in a conversation or text connect to create meaningful and clear communication.
It ensures that each sentence is logically related to the sentences before and after it.

Pragmatic Analysis

It focuses on understanding the intended meaning of a sentence based on context, tone, and real-world knowledge rather than just its literal meaning.

Chatbots

One of the most common applications of Natural Language Processing (NLP) is a chatbot.
A chatbot is a computer program designed to simulate human conversation through text or voice.
It can interact with users, understand their queries, and respond accordingly.
Over time, chatbots can learn and improve their interactions.
Chatbots are widely used to answer questions, solve customer problems, provide support, generate leads, and boost sales on e-commerce platforms.
They help businesses communicate with users quickly and efficiently.
Some popular chatboats are: Eliza, Mitsuku, Cleverbot, Singtel

Types of Chatbots

Script-bot
Smart-bot

Script Bot	Smart Bot
Easy to make	Flexible and powerful
Works around a pre-defined script programmed in them	Works on bigger databases and other resources directly
Mostly free and easy to integrate with messaging platforms	Learns with more data
No or little language processing skills	Requires coding to develop
Limited functionality	Wide functionality

Text Processing

Text processing is the process in NLP where raw human language is cleaned, simplified, and prepared so that computers can understand and work with it effectively.

Steps of Text Processing:

Text Normalisation:
Bag of Words:
After cleaning, the text is converted into numerical form by representing it as words and their frequency, so that computers can analyze it easily.

Text Normalization

In text normalization, the text is cleaned and simplified by removing unnecessary elements like punctuation, symbols, and converting text into a standard form.

this process, we often work with a collection of text from multiple documents, which is called a corpus. All the text from these documents together forms the corpus on which text normalisation techniques are applied.

Steps in Text Normalization

Sentence Segmentation
Tokenization
Removing Stop words, Special Characters, Numbers
Converting Text to Common Case
Stemming
Lemmatization

Sentence Segmentation

In sentence segmentation, the entire corpus is divided onto sentences

Working on textual data from multiple documents put together is called Corpus.

Natural Language Processing is interesting. It helps computers understand human language. It is widely used in chatbots.

Tokenization

After segmenting the sentences, each sentence is then further divided into tokens.
Tokens is a term used for any word or number or special character occurring in a sentence.

Natural Language Processing is interesting.It helps computers understand human language.

Removing Stop words, Special Characters and Numbers

Stop words (such as articles, pronouns, and prepositions) are removed from the token list because they carry little meaningful information and are not important for analysis.

Special characters and numbers may also be removed if they are not useful, but they can be kept when they carry important information (like in email IDs).

Converting Text to a Common Case

After removing stop words, the text is converted into a single case (usually lowercase). This prevents the computer from treating the same words as different just because of capital letters.

Stemming

stemming is the process in which the affixes of words are removed and the words are converted to their base form.

In stemming, words are reduced to their base form by removing prefixes or suffixes. However, the resulting word may not always be meaningful (e.g., studies → studi). Stemming does not check meaning—it simply removes affixes, which makes it a fast process.

Word	Affixes	Stem
going	ing	go
goes	es	go
studies	es	studi

Lemmatization

Stemming and Lemmatization both are alternative processes to each other as both used to remove affixes.
But in Lemmatization, the word we get after affix removal is a meaningful one.

Word	Affixes	Stem
going	ing	go
goes	es	go
studies	es	study

Bag of Words

Bag of Words is a Natural Language Processing model which helps in extracting features out of the text which can be helpful in machine learning algorithms.
In the bag of words, we get the occurrences of each word and construct the vocabulary for the corpus.
Bag of words gives us two things:
- Vocabulary of words for corpus
- Frequency of these words

Steps to implement Bag of Words

Text Processing
Create a Dictionary
Create a Document Vector
Create Document Vectors for all the documents

Step 1: Collect Data & Pre-process

We start with documents and apply text normalisation.

Original Documents:

Document 1: Aman and Avni are stressed
Document 2: Aman went to a therapist
Document 3: Avni went to download a health chatbot

After Text Normalisation (Tokens):

Document 1: [aman, and, avni, are, stressed]
Document 2: [aman, went, to, a, therapist]
Document 3: [avni, went, to, download, a, health, chatbot]

Step 2: Create Dictionary (Vocabulary)

List all unique words from all documents.

Dictionary:

aman, and, avni, are, stressed
went, to, a, therapist
download, health, chatbot

(Each word is written only once, even if repeated in documents.)

Step 3: Create Document Vector (Example: Document 1)

For each word in the dictionary:

Put 1 if the word is present
Put 0 if the word is absent

Word	aman	and	avni	are	stressed	went	to	a	therapist	download	health	chatbot
Document 1	1	1	1	1	1	0	0	0	0	0	0	0

Step 4: Create Vectors for All Documents

Document	aman	and	avni	are	stressed	went	to	a	therapist	download	health	chatbot
Document 1	1	1	1	1	1	0	0	0	0	0	0	0
Document 2	1	0	0	0	0	1	1	1	1	0	0	0
Document 3	0	0	1	0	0	1	1	1	0	1	1	1

Final Output

The table above is called the Document Vector Table
Rows = Documents
Columns = Words (Vocabulary)
Values = Frequency (0 or 1 here)

TFIDF

It refers to Term Frequency and Inverse Document Frequency
It helps us identify the value of each word.
The purpose of TFIDF is to find those rare and valuable words which occurs the least but add the most value to the corpus.

Term Frequency

Term frequency is the frequency of a word in one document.
Term frequency can easily be found in the document vector table

Aman	And	Avni	Are	Stressed	Went	To	A	Therapist	Download	Health	Chatbot
1	1	1	1	1	0	0	0	0	0	0	0
1	0	0	0	0	1	1	1	12	0	0	0
0	0	1	0	0	1	1	12	0	1	1	1

Inverse Document Frequency

Document Frequency is the number of documents in which the word occurs irrespective of how many times it has occurred in those documents.
Inverse document frequency, we need to put the document frequency in the denominator while the total number of documents is the numerator.

Document Frequency

Aman	And	Avni	Are	Stressed	Went	To	A	Therapist	Download	Health	Chatbot
2	1	2	1	1	2	2	2	1	1	1	1

Inverse Document Frequency

Aman	And	Avni	Are	Stressed	Went	To	A	Therapist	Download	Health	Chatbot
3/2	3/1	3/2	3/1	3/1	3/2	3/2	3/2	3/1	3/1	3/1	3/1

Final TFIDF Formula

TFIDF(W) = TF(W)*log(IDF(W))

Exp: 1*log(3/2)

Aman	And	Avni	Are	Stressed	Went	To	A	Therapist	Download	Health	Chatbot
3/2	3/1	3/2	3/1	3/1	3/2	3/2	3/2	3/1	3/1	3/1	3/1

Aman	And	Avni	Are	Stressed	Went	To	A	Therapist	Download	Health	Chatbot
0.176	0.477	0.176	0.477	0.477	0.477	0	0	0	0	0	0
0.176	0	0	0	0	0.176	0.176	0.176	0.477	0	0	0
0	0	0.176	0	0	0.176	0.176	0.176	0	0.477	0.477	0.477

TFIDF – Key Points

Words that appear frequently in all documents get low TF-IDF values and are usually treated as stop words.
A word gets a high TF-IDF value when it appears often in one document but rarely in others, showing it is important for that document.
Higher TF-IDF values indicate more important words for understanding and processing the text.

Applications of TFIDF

Document Classification:
Helps in identifying the type or category of a document.
Topic Modelling:
Helps in predicting the main topic of a group of documents (corpus).
Information Retrieval System:
Helps in extracting important and relevant information from large text data.
Stop Word Filtering:
Helps in removing unnecessary and less meaningful words from text.

Natural Language Processing Notes – Class 10 AI (417)

Natural Language Processing

Why NLP is important?

Applications of Natural Language Processing (NLP)