Shin Tech Notes: Reserved Indices in Keras Reuters Data Set

When I was checking "Classifying newswires: a multi-class classification example", I came across one question about the offset value of the word index of the Reuter newswires data set when decoding newswire text.
According to the author of "Deep Learning with Python", index 0, 1 and 2 are reserved indices for "padding", "start of sequence", and "unknown".

If you load the Reuter data set,

reuters = keras.datasets.reuters

(train_data, train_labels), (test_data, test_labels) = reuters.load_data()

And get, index:word,

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

with open('reserve_word_index.txt', 'w') as re_word_index_file:

for key, val in sorted(reverse_word_index.items()):

re_word_index_file.write(str(key) + ':' + str(val) + '\n')

Open the output file of the run result of above code, I get something like this,

1:the

2:of

3:to

4:in

5:said

...

First thing I noticed, as you see the result, there is no index 0, then index 1 is "the" not "start of sequence", and index 2 is "of" not "unknown".

I printed out the first newswire in the train data to see what's inside,

print(train_data[0])

[1, 27595, 28842, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]

Then decoded train_data[0], without -3 off set,

decoded_newswire = ' '.join([reverse_word_index.get(i, '?') for i in train_data[0]])

print(decoded_newswire) 

"the wattie nondiscriminatory mln loss for plc said at only ended said commonwealth could 1 traders now april 0 a after said from 1985 and from foreign 000 april 0 prices its account year a but in this mln home an states earlier and rise and revs vs 000 its 16 vs 000 a but 3 psbr oils several and shareholders and dividend vs 000 its all 4 vs 000 1 mln agreed largely april 0 are 2 states will billion total and against 000 pct dlrs"

Now, get the decoded newswire with offset -3 as the author of Deep Learning with Python said,

decoded_newswire = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])

print(decoded_newswire)

"? mcgrath rentcorp said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3"

So my conclusion may be that every word in the data set has "padding", "start of sequence", and "unknown", in other word, 3 for those are added in the word index number. For example, 6893 which is "dramatically" in the index_words items are actually stored (encoded?) as 6896 in the data set. When you decode the word from the data set and map with the word index dictionary, you need to decode by offsetting by 3 to get the actual word in a newswire.

Shin Tech Notes

Saturday, June 1, 2019

Reserved Indices in Keras Reuters Data Set

No comments:

Post a Comment

Installing TensorFlow GPU with Anaconda

Blog Archive