Photoacoustic tomography (PAT) images contain inherent distortions due to the imaging system and heterogeneous tissue properties. Improving image quality requires the removal of these system distortions. While model-based approaches and data-driven techniques have been proposed for PAT image restoration, achieving accurate and robust image recovery remains challenging. Recently, deep-learning-based image deconvolution approaches have shown promise for image recovery. However, PAT imaging presents unique challenges, including spatially varying resolution and the absence of ground truth data. Consequently, there is a pressing need for a novel learning strategy specifically tailored for PAT imaging. Herein, we propose a configurable network model named Deep hybrid Image-PSF Prior (DIPP) that builds upon the physical image degradation model of PAT. DIPP is an unsupervised and deeply learned network model that aims to extract the ideal PAT image from complex system degradation. Our DIPP framework captures the degraded information solely from the acquired PAT image, without relying on ground truth or labeled data for network training. Additionally, we can incorporate the experimentally measured Point Spread Functions (PSFs) of the specific PAT system as a reference to further enhance performance. To evaluate the algorithm’s effectiveness in addressing multiple degradations in PAT, we conduct extensive experiments using simulation images, publicly available datasets, phantom images, and in vivo small animal imaging data. Comparative analyses with classical analytical methods and state-of-the-art deep learning models demonstrate that our DIPP approach achieves significantly improved restoration results in terms of image details and contrast.