Ciao! Grazie innanzitutto per la risposta, hai ragione ho formulato male la domanda. Con questo codice riesco a dividere il testo e a trovare gli episodi in cui un personaggio compare, ma non riesco a fare altrettanto con le scene. Inoltre vorrei se possibile eliminare il testo superfluo e mantenere solo i personaggi. Grazie!
Questo è l'esempio di uno script:
Jorah Mormont: You need to drink, child... And eat.
Daenerys Targaryen: Isn't there anything else?
Jorah Mormont: The Dothraki have two things in abundance: grass and horses. People can't live on grass... In the Shadow Lands beyond Asshai, they say there are fields of ghost grass with stalks as pale as milk that glow in the night. It murders all other grass. The Dothraki believe that one day it will cover everything. That's the way the world will end… It'll get easier.
Doreah: Khaleesi!
Irri: Your hands.
Jorah Mormont: We're still not far from Pentos, your Grace. Magister Illyrio has extended his hospitality. You'd be more comfortable there.
Viserys Targaryen: I have no interest in hospitality or comfort. I'll stay with Drogo until he fulfils his end of the bargain and I have my crown.
Jorah Mormont: As you wish, your Grace.
-----------------------------------------
Joffrey Baratheon: Better-looking bitches than you're used to, Uncle. My mother's been looking for you. We ride for King's Landing today.
Tyrion Lannister: Before you go, you will call on Lord and Lady Stark and offer your sympathies.
Joffrey Baratheon: What good will my sympathies do them?
Tyrion Lannister: None. But it is expected of you. Your absence has already been noted.
punteggiatura = '!"#$%&\'()*+,./-:;<=>?@[\\]^_`{|}~'
personaggi_in_scene = dict()
for nome_file in files:
if nome_file.endswith('.script'):
with open('data/' + nome_file) as f:
for line in f:
riga = line.strip()
if riga.startswith('----'):
scena = 0
scena = riga.find('-----------------------------------------')
scena = scena + 1
else:
for personaggio in riga.split(':'):
for c in punteggiatura:
riga = riga.replace(c,' ')
if personaggio in personaggi_in_scene:
personaggi_in_scene[personaggio].add(nome_file[9:10])
else:
personaggi_in_scene[personaggio] = {scena}
Per ora il risultato è questo:
{'Jorah Mormont': {1, '2', '3', '4'},
' You need to drink, child... And eat.': {1},
'Daenerys Targaryen': {1, '2', '3', '4'},
" Isn't there anything else?": {1},
' The Dothraki have two things in abundance': {1},
" grass and horses. People can't live on grass... In the Shadow Lands beyond Asshai, they say there are fields of ghost grass with stalks as pale as milk that glow in the night. It murders all other grass. The Dothraki believe that one day it will cover everything. That's the way the world will end… It'll get easier.": {1},
'Doreah': {1, '2', '4'},
' Khaleesi!': {1},
'Irri': {1, '2', '3', '4'},
Il primo numero dovrebbe essere la scena ma è sbagliata perchè viene sempre 1, inoltre dovrebbe essere così il risultato in cui il primo numero rappresenta l'episodio e il secondo la scena:
{
"Jorah Mormont": [
"3_17",
"4_3",
"2_1",
"3_14",
"4_16"
],
"Daenerys Targaryen": [
"3_17",
"4_3",
"2_1",
"3_14",
"2_12",
"4_16",
"3_19",
"4_14",
"2_14"
],