再结合它的整体骨架推测文字转WAV音频