データサイエンス入門　第9回　LSTMでLSTMで小説を執筆しよう　その２

AIプログラミングで学ぶデータサイエンス

LSTMの特徴は把握できたでしょうか。LSTMを利用して、小説の執筆に挑戦しましょう。本連載は人工知能AIを扱うのに適しているPythonを言語として使用し、人工知能AIライブラリであるKerasを利用します。使用するPythonのバージョンは 3.10.12です。またKerasのバージョンは 3.5.0です。

Pythonプログラミングの前にAIに小説を執筆させる方法と手順を説明します。LSTMで日本語を扱う場合、形態素解析により単語単位で学習した方がよいと思いますが、複雑になるので、まずは文字単位で試します。
LSTMで文章を学習させるには時系列データとして文字列を考えます。そこで、数文字の文字列の後にどんな文字が来るのかを学習して、未知の文字列に対して続く文字をLSTMに出力させることで新たな文章を生成します。

https://youtu.be/mb6X2IXPBeg

https://drive.google.com/file/d/1mun_MOj769r5ZZuoIGVP-oAR5djJLH9q/view?usp=drive_link

sakuhin_all.txt
https://drive.google.com/file/d/1KgJG9D1SMNSTzg49vcv7KlWgCuK4KG9I/view?usp=drive_link


!pip install mecab-python3
!pip install unidic
!python -m unidic download
!apt-get -q -y install mecab libmecab-dev file
!git clone --depth 1 https://github.com/neologd/mecab-unidic-neologd.git
!echo yes | mecab-unidic-neologd/bin/install-mecab-unidic-neologd -n


import matplotlib.pyplot as plt  # 追加

from tensorflow import keras
from tensorflow.keras import layers

import numpy as np
import random
import sys
import io

from google.colab import drive
drive.mount('/content/drive') 

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 3
step = 1
sentences = []
gakushu=3

neta_path="/content/drive/MyDrive/data_science/sakuhinn_all.txt"
hdf_path="/content/drive/MyDrive/data_science/sakuhinn_all.hdf"


with io.open(neta_path, encoding='utf-8') as f:
    text = f.read().lower()
print('corpus length:', len(text))
 
chars = sorted(list(set(text)))
print('総文字数:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

print("chars:",chars)
 
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print("sentence:",sentences)
 
print('ベクトル処理...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
 
 
# build the model: a single LSTM
print('AI LSTMモデル定義...')
model = keras.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars)))
model.add(layers.Activation('softmax'))
 
optimizer = keras.optimizers.RMSprop(learning_rate=0.01)
model.compile(loss="categorical_crossentropy", optimizer=optimizer)

model.summary()
#model.compile(loss='categorical_crossentropy')
history = model.fit(x, y,batch_size=128,epochs=gakushu) 

model.save(hdf_path)

print('＊＊＊＊＊　LSTMラーニング完了！結果を保存しました。＊＊＊＊＊') 
print("保存結果:",hdf_path)


#print(history.history.keys()) # ヒストリデータのラベルを確認
#dict_keys(['val_acc', 'acc', 'val_loss', 'loss'])

import matplotlib.pyplot as plt
%matplotlib inline
 
plt.plot(range(1,gakushu+1),history.history['loss'])
#plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.xlabel('Gakushu')
plt.ylabel('Sonshitu')
plt.grid()
plt.legend()
plt.show()
plt.savefig("loss.png")
plt.close()

GoogleColaboratoryにアップロードすればすぐに動作を確認できます。実行結果のサンプル付きです。
https://drive.google.com/file/d/14geNl2d8ASRXohV25DX6KfUWeS8j56bA/view?usp=drive_link

公開している動画と解説用pdfは電波新聞社刊行電子工作マガジンに連載された同題名の内容をGoogle NotebookLMにてまとめています