Files
old-svevijesti/pyth/__pycache__/scrapingsingle.cpython-310.pyc

59 lines
6.5 KiB
Plaintext
Raw Normal View History

2024-01-02 15:00:07 +01:00
o
2024-01-06 08:17:05 +01:00
<00>F<EFBFBD>e<EFBFBD>&<00> @s<>ddlmZddlZddlmZddlmZmZddlZddl m
Z
ddl m Z m Z mZmZmZmZmZddlZddlmZddlZe<15>e<11>e<08>d<08>Ze<06>Ze
<EFBFBD>Zgd <09>Zd
d iZd4d edefdd<10>Zdd<12>Z dd<14>Z!dd<16>Z"e#<23>Z$e#<23>Z%dd<18>Z&e#<23>Z'eD]Z(e&e(e'<27>Z)e)r<>e$<24>*e)<29>q<>dd<1A>e$D<00>Z+e#e<10><00>Z,e+e,Z-e-Z+e#e+<2B>Z+e"e+<2B>Z+e.dk<02>r_e+D]<5D>Z/e<02>0e/e<1C>Z1ee1j2d<1C>Z3e3<65>4gd<1D><01>Z5d<1E>6dd <20>e5D<00><01>Z7e3<65>4d!g<01>Z8d<1E>6d"d <20>e8D<00><01>Z9e9Z9e7Z7e!e7<65>Z7e:d#ee9d <0C><02><00><02>e e9<65>Z9e!ee9<65><01>Z9zIej;j<j=d d$d%d&<26>d'd(e7<65>d)e9<65>d*<2A>d&<26>gd+<2B>Z>e>j?dj@jAZBeBZBe<13>CeB<65>ZDeDd,ZEeDd-Z2e<1A>FeB<65>ZGe eEe2e/eGd.d/<2F><05>sCd0ZHe eEe2e/eGeH<65>Wq<>eI<65>y^ZJz e:d1eJ<65><00><02>WYdZJ[Jq<4A>dZJ[Jwwd2d3<64>ZKdS)5<>)<01> BeautifulSoupN)<01>urljoin)<02>OpenAI<41>APIError)<01>OpenAIEmbeddings)<07> insert_data<74>is_similar_data<74> get_similar<61>get_specific_data<74> get_all_links<6B> cleansing<6E>modify_similar_data)<01> load_dotenv<6E>OPENAI_API_KEY)zhttps://klix.bazhttps://srpskainfo.comzhttps://bljesak.infoz
User-Agentz<74>Mozilla/5.0 (Linux; Android 5.1.1; SM-G928X Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36<EFBFBD> gpt-3.5-turbo<62>string<6E>returncCst<00>|<01>}t|<02>|<00><01>S)N)<04>tiktoken<65>encoding_for_model<65>len<65>encode)r<00>model<65>encoding<6E>r<00>>/home/asabani/Desktop/svevijesti-master/pyth/scrapingsingle.py<70>num_tokens_from_strings
rcCsHd}d}t<00>|<01>}|<03>|<00>}t|<04>|kr|gS|d|<02>}|<03>|<05>}|S)Nr<00><>)rrrr<00>decode)<07>text<78> encoding_name<6D>
max_tokensr<00>tokens<6E> sliced_tokens<6E> sliced_textrrr<00>slice_text_at_2k_tokenss

  
r$cs d<01>d<02><00>fdd<04>|D<00><01>}|S)NuYABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzČčĆćDždžĐ𩹮ž0123456789 <20>c3s <00>|] }|<01>vr
|ndVqdS)<02> Nr)<02>.0<EFBFBD>char<61><01> allowed_charsrr<00> <genexpr>0s<02>z&replace_with_spaces.<locals>.<genexpr>)<01>join)r<00> cleaned_textrr)r<00>replace_with_spaces.sr.cCs>t<00>}|D]}d|vr|<02>dd<03>}|<01>|<03>q|<01>|<02>q|S)N<>wwwzwww.r%)<03>set<65>replace<63>add)<04> links_set<65>modified_links<6B>link<6E> modified_linkrrr<00> fix_links4s   r7c
Cs<>t<00>|t<02>}|jdkr@t|jd<02>}|<03>d<03>}g}|D]#}|jddd<06>}|D]}t||d<00>} | |vr<|<05>| <09>|<01> | <09>q%q|SdS)N<><4E><00> html.parser<65>article<6C>aT)<01>hrefr<)
<EFBFBD>requests<74>get<65>headers<72> status_coderr<00>find_allr<00>appendr2)
2024-01-02 15:00:07 +01:00
<EFBFBD>url<72>already_checked<65>response<73>soup<75>articles<65>
2024-01-06 08:17:05 +01:00
link_storer:<00>linksr5<00>
link_valuerrr<00>get_article_linksDs 
2024-01-02 15:00:07 +01:00
 


2024-01-06 08:17:05 +01:00
<02><02><04>rKcCsh|]}|r|<01>qSrr)r'<00>itemrrr<00> <setcomp>]srM<00>__main__r9)<03>h2<68>h1<68>h3r&cC<00>g|]}|jdd<01><01>qS<00>T)<01>strip<69><01>get_text)r'<00>titlerrr<00>
<listcomp>m<00>rX<00>pcCrRrSrU)r'rrrrrXprYzTokens usage: <20>system<65>+Data analytic, Journalist and News reporter<65><02>role<6C>content<6E>userz>Extract relevant information from the following input: Title: z, Text: z<>. Remove any non-news element related to the current text and title, and provide the cleaned data as a JSON object with 'title' and 'content' fields.<2E>r<00>messagesrWr_<00>\<5C><><EFBFBD>(\<5C>?<3F><01> threshold<6C>NO<4E>Error in completion: c$ Cs<>td<01>t<01>}i}|D<00>]<5D>}|<00>r<>t|d<00>}t|d<00>}|dd}|dd}|dd}|dd}|dd} |dd}
| |vrO|| <00>||f<02>n||fg|| <|
|vrd||
<00>||f<02>n||fg||
<|<01><04>D]<5D>\} } t| <0C>dkr<>| d\} }| d\}}| d\}}t| <0A>}t|<0F>}t|<11>}|||f}|dkr<>| <0A>|<0F>|<11><00>}t|<17>}d|<17>d<08>}||kr<>||kr<>||kr<>|<0E>d |<10>d |<12><00>}qo|}qod
| <0A>d |<0F>d |<11>d <0C>}||kr<>||kr<>||kr<>|<0E>d |<10>d |<12><00>}qo|}qot|<07>}t|<08>}||}|dk<00>r|<07>|<08><00>}d|<17>d <0A>}||k<03>r|<05>d |<06><00>}qo|}qod|<07>d |<08>d <0C>}||k<03>r5|<05>d |<06><00>}qo|}qoztj j
j dddd<12>d|d<12>gd<14>}|j dj j}|<00>r<>| |
k<02>rlt| <09>t|d<15>|<00>|<02>td<16>n&td| <09><00><02>td|
<EFBFBD><00><02>t|d<15>t|d<15>|<00>|<02>td<16>ntd<19>t<11>|<1E>}| } |d} t<13>|<1E>}!t| | ||!dd<1C><05>s<>d}"t| | ||!|"<22>Wq t<17>y<>}#z td|#<23><00><02>WYd}#~#q d}#~#wwq dS)NzChecking similarr<00><00><00>rz Here is text zg, combined from 3 sources, filter text, and make news content, return as JSON only with 'content' fieldr&zHere are 3 texts z and z<>, combine the following texts into a cohesive news remove any non-news related to both texts and provide the cleaned data as a JSON only with 'content' field.zg, combined from 2 sources, filter text, and make news content, return as JSON only with 'content' fieldzHere are 2 texts rr[r\r]r`ra<00>SOURCE<43>ModifiedzFirst: zSecond: zSimilar list is emptyr_rcrdrfrg)<18>printr r
rB<00>itemsrrr$<00>client<6E>chat<61> completions<6E>create<74>choices<65>messager_r <00>remove<76>json<6F>loads<64>
embeddings<EFBFBD> embed_queryrr<00> Exception)$<24>similar_article<6C> grouped_data<74>sa<73>first_t<5F>second_t<5F>link_f<5F>link_s<5F>f_text<78>s_text<78>f_title<6C>s_titlerW<00>tuples<65>text1<74>link1<6B>text2<74>link2<6B>text3<74>link3<6B>t1check<63>t2check<63>t3check<63> slice_if_more<72> combined_text<78> user_messager5<00>ftcheck<63>stcheck<63>fscomb<6D>
completion<EFBFBD>generated_text<78> response_datar<00>vector<6F> similar_d<5F>errr<00> comb_similar<61>s<>
            
 
  

<02><06>









<04> <08><02><02><04>r<EFBFBD>)r)L<>bs4rr=<00> urllib.parser<00>openairr<00>os<6F>langchain.embeddingsr<00>vectDatarrr r
r r r rv<00>dotenvrr<00>getenvrrorx<00>dlinksr?<00>str<74>intrr$r.r7r0<00> total_links<6B>collected_newsrKrD<00>dlink<6E>
temp_links<EFBFBD>update<74> final_links<6B>db_links<6B> new_links<6B>__name__r5r>rErrFrA<00>titlesr,<00>
title_text<EFBFBD>texts<74> text_textrmrprqrrr<>rsrtr_r<>rwr<>rWryr<>r<>rzr<>r<>rrrr<00><module>s<>   $ 
 
2024-01-02 15:00:07 +01:00

<02>
2024-01-06 08:17:05 +01:00

    <02><06>

<04> <08><02>