Python NLP实战:17个经典任务实现

内容分享5小时前发布 不諳_
0 0 0

1、使用nltk库对以下文本进行分词:“Natural Language Processing enables computers to understand human language.”

可使用以下代码实现:


import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "Natural Language Processing enables computers to understand human language."
tokens = word_tokenize(text)
print(tokens)

输出结果为:


['Natural', 'Language', 'Processing', 'enables', 'computers', 'to', 'understand', 'human', 'language', '.']

2、使用spaCy库从文本“Google was founded by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University.”中提取命名实体。

以下是使用Python和spaCy库实现该功能的代码:


import spacy

# 加载SpaCy模型
nlp = spacy.load("en_core_web_sm")
text = "Google was founded by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University."

# 处理文本 
doc = nlp(text)

# 提取命名实体
for ent in doc.ents:
    print(ent.text, ent.label_)

运行上述代码,得到的输出结果为:


Google ORG
Larry Page PERSON
Sergey Brin PERSON
Stanford University ORG

3、使用TextBlob库分析以下文本的情感:“I am extremely happy with the service provided.”

以下是使用 Python 的

TextBlob

库分析该文本情感的代码示例:


from textblob import TextBlob

text = "I am extremely happy with the service provided."
blob = TextBlob(text)
sentiment = blob.sentiment
print(f"Polarity: {sentiment.polarity}, Subjectivity: {sentiment.subjectivity}")

运行这段代码,将输出文本的极性(polarity)和主观性(subjectivity),极性值越接近 1 表示越积极,越接近 -1 表示越消极;主观性值越接近 1 表示越主观,越接近 0 表示越客观。

4、使用sumy库将以下文本总结为两句话:“自然语言处理(NLP)是计算机科学、人工智能和语言学交叉领域中一个引人入胜的领域。它使机器能够理解、解释和生成人类语言,为从聊天机器人和翻译服务到情感分析等各种应用开辟了无限可能。NLP的发展得益于机器学习和深度学习的重大进展,这些进展使得语言理解模型更加复杂和准确。无论你目前的专业水平如何,本书旨在以一种易于理解和实用的方式将这些前沿技术呈现给你。”

以下是使用sumy库实现总结的代码:


from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

text = """Natural Language Processing (NLP) is a fascinating field at the intersection of computer science, artificial intelligence, and linguistics. It enables machines to understand, interpret, and generate human language, opening up a world of possibilities for applications ranging from chatbots and translation services to sentiment analysis and beyond. The evolution of NLP has been driven by significant advances in machine learning and deep learning, which have enabled more sophisticated and accurate models for language understanding. This book aims to bring these cutting - edge techniques to you in an accessible and practical way, regardless of your current level of expertise. """

parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, 2)

for sentence in summary:
    print(sentence)

运行上述代码即可得到总结后的两句话。

5、使用scikit – learn在以下数据上训练一个朴素贝叶斯分类器,并预测新文本“This experience was fantastic.”的情感倾向。数据如下:“I love this product”(积极);“This is the worst experience”(消极);“Absolutely fantastic!”(积极);“Not good at all”(消极)。

以下是使用 Python 和 scikit-learn 实现该任务的代码:


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

texts = ['I love this product', 'This is the worst experience', 'Absolutely fantastic!', 'Not good at all']
labels = [1, 0, 1, 0]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

classifier = MultinomialNB()
classifier.fit(X, labels)

new_text = ['This experience was fantastic']
X_new = vectorizer.transform(new_text)

prediction = classifier.predict(X_new)
print(prediction)

上述代码中,首先导入所需的库,然后定义文本数据和对应的标签,接着使用

CountVectorizer

将文本数据转换为矩阵,再使用

MultinomialNB

训练朴素贝叶斯分类器,最后对新文本进行预测并打印结果。

6、使用nltk库从以下文本中移除停用词:“NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence.”

以下是使用

nltk

库移除停用词的 Python 代码:


import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# 示例文本
text = "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."

# 对文本进行分词
tokens = text.split()

# 获取停用词列表
stop_words = set(stopwords.words('english'))

# 移除停用词
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

# 输出结果
print(filtered_tokens)

运行上述代码,即可得到移除停用词后的文本。

7、使用nltk库对以下文本进行词干提取:“Stemming helps in reducing words to their root form, which can be beneficial for text processing.”

以下是使用Python的nltk库实现该文本词干提取的代码:


from nltk.stem import PorterStemmer

# 示例文本
text = "Stemming helps in reducing words to their root form, which can be beneficial for text processing."

# 对文本进行分词 
tokens = text.split()

# 初始化词干提取器
stemmer = PorterStemmer()

# 对分词进行词干提取
stemmed_tokens = [stemmer.stem(word) for word in tokens]

print("Original Tokens:")
print(tokens)

print("
Stemmed Tokens:")
print(stemmed_tokens)

8、使用nltk库对以下文本进行词形还原:“Lemmatization is the process of reducing words to their base or root form.”


from nltk.stem import WordNetLemmatizer
import nltk

nltk.download('wordnet')

# 示例文本
text = "Lemmatization is the process of reducing words to their base or root form."

# 对文本进行分词 
tokens = text.split()

# 初始化词形还原器
lemmatizer = WordNetLemmatizer()

# 对分词进行词形还原
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

print("Original Tokens:")
print(tokens)

print("
Lemmatized Tokens:")
print(lemmatized_tokens)

9、使用正则表达式从以下文本中提取所有格式为“YYYY – MM – DD”的日期:“The project started on 2021 – 01 – 15 and ended on 2021 – 12 – 31.”

可以使用 Python 的

re

模块来实现。示例代码如下:


import re
text = "The project started on 2021 - 01 - 15 and ended on 2021 - 12 - 31."
pattern = r"d{4}-d{2}-d{2}"
dates = re.findall(pattern, text)
print("Extracted Dates:")
print(dates)

运行上述代码后,输出结果为:


Extracted Dates:
['2021 - 01 - 15', '2021 - 12 - 31']

10、使用nltk库对以下文本进行单词分词:“Tokenization is the first step in text preprocessing.”

以下是使用

nltk

库对该文本进行单词分词的 Python 代码:


import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

text = "Tokenization is the first step in text preprocessing."
tokens = word_tokenize(text)
print(tokens)

运行此代码后,输出结果将是一个包含分词后的单词列表:


['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'preprocessing', '.']

11、使用nltk库对以下文本进行句子分词:“Tokenization is essential. It breaks down text into smaller units.”

以下是实现该功能的Python代码:


import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# 示例文本
text = "Tokenization is essential. It breaks down text into smaller units."

# 进行句子分词
sentences = sent_tokenize(text)
print("Sentences:")
print(sentences)

输出结果为:


Sentences:
['Tokenization is essential.', 'It breaks down text into smaller units.']

12、编写一个 Python 函数,对输入文本“Character tokenization is useful for certain tasks.”进行字符标记化处理。


def character_tokenization(text):
    # Perform character tokenization
    characters = list(text)
    return characters

# Sample text
text = "Character tokenization is useful for certain tasks."
# Tokenize the text into characters
characters = character_tokenization(text)
print("Character Tokens:")
print(characters)

13、使用scikit – learn中的TfidfVectorizer将以下文本语料库转换为TF – IDF表示:documents = [ “Natural language processing is fun.”, “Language models are important in NLP.”, “Machine learning and NLP are closely related.” ]

可以使用以下代码实现:


from sklearn.feature_extraction.text import TfidfVectorizer

# 样本文本语料库
documents = [
    "Natural language processing is fun.",
    "Language models are important in NLP.",
    "Machine learning and NLP are closely related."
]

# 初始化TfidfVectorizer
vectorizer = TfidfVectorizer()

# 转换文本数据
X = vectorizer.fit_transform(documents)

# 将结果转换为数组
tfidf_array = X.toarray()

# 获取特征名称(词汇表)
vocab = vectorizer.get_feature_names_out()

print("Vocabulary:")
print(vocab)
print("
TF - IDF Array:")
print(tfidf_array)

运行上述代码,即可将文本语料库转换为TF – IDF表示,并输出词汇表和TF – IDF数组。

14、使用Gensim库在以下文本语料库上训练一个Word2Vec模型,并获取单词“NLP”的向量表示:文本内容为“Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related.”


from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk

nltk.download('punkt')

# 示例文本语料库
text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."

# 将文本分词成句子
sentences = sent_tokenize(text)

# 将每个句子分词成单词
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]

# 使用Skip - Gram方法训练Word2Vec模型
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, sg=1, min_count=1)

# 获取单词“NLP”的向量表示
vector = model.wv['NLP']
print("Vector representation of 'NLP':")
print(vector)

15、使用Hugging Face的transformers库为以下文本生成BERT嵌入:text = “Transformers are powerful models for NLP tasks.”

可以使用以下代码来实现:


from transformers import BertTokenizer, BertModel
import torch

# 加载预训练的BERT模型和分词器
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# 示例文本
text = "Transformers are powerful models for NLP tasks."

# 对文本进行分词
inputs = tokenizer(text, return_tensors='pt')

# 生成BERT嵌入
with torch.no_grad():
    outputs = model(**inputs)

# 获取 [CLS] 标记的嵌入(代表整个输入文本)
cls_embeddings = outputs.last_hidden_state[:, 0, :]

print("BERT Embeddings for the text:")
print(cls_embeddings)

16、从以下文本“Natural Language Processing with Python.”生成三元组(3 – 元组)。


- Natural Language Processing
- Language Processing with
- Processing with Python

17、使用以下句子和标签实现一个用于词性标注的隐马尔可夫模型(HMM):句子为 [ [‘I’, ‘run’, ‘to’, ‘the’,’store’], [‘She’, ‘jumps’, ‘over’, ‘the’, ‘fence’] ],标签为 [ [‘PRON’, ‘VERB’, ‘ADP’, ‘DET’, ‘NOUN’], [‘PRON’, ‘VERB’, ‘ADP’, ‘DET’, ‘NOUN’] ]

以下是实现代码:


import numpy as np
from hmmlearn import hmm

# 定义状态和观测值
states = ['PRON', 'VERB', 'ADP', 'DET', 'NOUN']
n_states = len(states)
observations = ['I', 'run', 'to', 'the', 'store', 'She', 'jumps', 'over', 'fence']
n_observations = len(observations)

# 对状态和观测值进行编码
state_to_idx = {state: idx for idx, state in enumerate(states)}
observation_to_idx = {obs: idx for idx, obs in enumerate(observations)}

# 定义句子和标签
sentences = [['I', 'run', 'to', 'the','store'], ['She', 'jumps', 'over', 'the', 'fence']]
tags = [['PRON', 'VERB', 'ADP', 'DET', 'NOUN'], ['PRON', 'VERB', 'ADP', 'DET', 'NOUN']]

# 创建训练序列
X = [[observation_to_idx[word] for word in sentence] for sentence in sentences]
y = [[state_to_idx[tag] for tag in tag_sequence] for tag_sequence in tags]

# 转换为 numpy 数组
X = np.concatenate([np.array(x).reshape(-1, 1) for x in X])
lengths = [len(x) for x in sentences]
y = np.concatenate(y)

# 创建 HMM 模型
model = hmm.MultinomialHMM(n_components=n_states, n_iter=100)
model.fit(X, lengths)

# 预测隐藏状态(解码问题)
logprob, hidden_states = model.decode(X, algorithm='viterbi')

# 将状态映射回原始标签
hidden_states = [states[state] for state in hidden_states]
print('Observations:', sentences[0] + sentences[1])
print('Predicted states:', hidden_states)
© 版权声明

相关文章

暂无评论

none
暂无评论...