ÀÖÓãµç¾º

  • ½ÌÓýÐÐÒµA¹ÉIPOµÚÒ»¹É£¨¹ÉƱ´úÂë 003032£©

    È«¹ú×Éѯ/ͶËßÈÈÏߣº400-618-4000

    lxml¿â»ñÈ¡×Ó½ÚµãµÄ·½·¨»ã×Ü

    ¸üÐÂʱ¼ä:2021Äê06ÔÂ21ÈÕ14ʱ56·Ö À´Ô´:ÀÖÓãµç¾º ä¯ÀÀ´ÎÊý:

    ºÃ¿Ú±®ITÅàѵ

    lxmlÊÇʹÓÃPythonÓïÑÔ±àдµÄ¿â£¬Ö÷ÒªÓÃÓÚ½âÎöºÍÌáÈ¡HTML»òÕßXML¸ñʽµÄÊý¾Ý£¬Ëü²»½ö¹¦Äܷdz£·á¸»£¬¶øÇÒ±ãÓÚʹÓ㬿ÉÒÔÀûÓÃXPathÓï·¨¿ìËٵض¨Î»Ìض¨µÄÔªËØ»ò½Úµã¡£

    lxml¿âÖд󲿷ֹ¦Äܶ¼Î»ÓÚlxml.etreeÄ£¿éÖУ¬µ¼Èëlxml.etreeÄ£¿éµÄ³£¼û·½Ê½ÈçÏ£º

    from lxml import etree

    lxml¿âµÄһЩÏà¹ØÀàÈçÏ£º
    (1) ElementÀࣺ¿ÉÒÔÀí½âΪXMLµÄ½Úµã¡£
    (2) ElementTreeÀࣺ¿ÉÒÔÀí½âΪһ¸öÍêÕûµÄXMLÎĵµÊ÷¡£
    (3) ElementPathÀࣺ¿ÉÒÔÀí½âΪXPath,ÓÃÓÚËÑË÷ºÍ¶¨Î»½Úµã¡£

    1.Element Àà¼ò½é

    ElementÀàÊÇXML´¦ÀíµÄºËÐÄÀ࣬¿ÉÒÔÖ±¹ÛµØÀí½âΪXMLµÄ½Úµã£¬´ó²¿·ÖXML½ÚµãµÄ´¦Àí¶¼ÊÇÎ§ÈÆ×ÅElementÀà½øÐеÄ¡£ÒªÏë´´½¨Ò»¸ö½Úµã¶ÔÏó£¬Ôò¿ÉÒÔͨ¹ý¹¹Ô캯ÊýÖ±½Ó´´½¨¡£ÀýÈ磺

    root=etree.Element('root')

    ÉÏÊöʾÀýÖУ¬²ÎÊýroot±íʾ½ÚµãµÄÃû³Æ¡£ ¹ØÓÚElementÀàµÄÏà¹Ø²Ù×÷£¬Ö÷Òª¿É·ÖΪÈý²¿·Ö£¬·Ö±ðÊǽڵã²Ù×÷¡¢½ÚµãÊôÐԵIJÙ×÷¡¢½ÚµãÄÚÎı¾µÄ²Ù×÷£¬ÏÂÃæ½øÐÐÖðÒ»½éÉÜ¡£ £¨1£©½Úµã²Ù×÷£ºÈôÒª»ñÈ¡½ÚµãµÄÃû³Æ£¬¿ÉÒÔͨ¹ýtagÊôÐÔ»ñÈ¡¡£ÀýÈ磺

    print(root.tag)
    # Êä³ö½á¹ûÈçÏÂ
    root


    £¨2£©½ÚµãÊôÐԵIJÙ×÷:ÔÚ´´½¨½ÚµãµÄͬʱ£¬¿ÉÒÔΪ½ÚµãÔö¼ÓÊôÐÔ¡£½ÚµãÖеÄÊôÐÔÊÇÒÔkey-valueµÄÐÎʽ½øÐд洢µÄ£¬ÀàËÆÓÚ×ÖµäµÄ´æ´¢·½Ê½¡£Í¨¹ý¹¹Ôì·½·¨´´½¨½Úµãʱ£¬¿ÉÒÔÔڸ÷½·¨ÖÐÒÔ²ÎÊýµÄÐÎʽÉèÖÃÊôÐÔ£¬ÆäÖвÎÊýµÄÃû³Æ±íʾÊôÐÔµÄÃû³Æ£¬²ÎÊýµÄÖµ±íʾΪÊôÐÔµÄÖµ¡£´´½¨ÊôÐÔµÄʾÀýÈçÏ£º

    # ´´½¨root½Úµã£¬²¢ÎªÆäÌí¼ÓÊôÐÔ
    root=etree.Element('root', interesting='totally')
    print(etree.tostring(root))
    # Êä³ö½á¹ûÈçÏÂ
    b'<root interesting=" totally" />'

    ´ËÍ⣬¿ÉÒÔͨ¹ýset()·½·¨¸øÒÑÓеĽڵãÌí¼ÓÊôÐÔ¡£ÔÚµ÷Óø÷½·¨Ê±¿ÉÒÔ´«ÈëÁ½¸ö²ÎÊý£¬ÆäÖеÚÒ»¸ö²ÎÊý±íʾÊôÐÔµÄÃû³Æ£¬µÚ¶þ¸ö²ÎÊý±íʾÊôÐÔµÄÖµ¡£ÀýÈ磺

    # Ôٴθøroot½ÚµãÌí¼ÓageÊôÐÔ
    root.set('age', '30')
    print(etree.tostring(root))
    # Êä³ö½á¹ûÈçÏÂ
    b'<root interesting="totally"age="30"/>'

    ÔÚÉÏÊöÁ½¸öʾÀýÖУ¬¶¼Óõ½ÁËtostring()º¯Êý£¬¸Ãº¯Êý¿ÉÒÔ½«ÔªËØÐòÁл¯ÎªXMLÊ÷µÄ±àÂë×Ö·û´®±íʾÐÎʽ¡£

    £¨3£©½ÚµãÄÚÎı¾µÄ²Ù×÷£ºÒ»°ãÇé¿öÏ£¬¿ÉÒÔͨ¹ýtext¡¢tailÊôÐÔ»òÕßxpath()·½·¨À´·ÃÎÊÎı¾ÄÚÈÝ¡£Í¨¹ýtextÊôÐÔ·ÃÎʽڵãµÄʾÀýÈçÏÂ:

    root=etree.Element('root')    # ´´½¨root½Úµã
    root.text='Hello, World!'    # ¸øroot½ÚµãÌí¼ÓÎı¾
    print(root.text)
    print(etree.tostring(root))
    # Êä³ö½á¹ûÈçÏÂ
    Hello, world!
    b'<root>Hello, World!</root>'

    2.´Ó×Ö·û´®»òÎļþÖнâÎöXML

    ΪÁËÄܹ»½«XMLÎļþ½âÎöΪÊ÷½á¹¹£¬etreeÄ£¿éÖÐÌṩÁËÈçÏÂ3¸öº¯Êý£º (1 ) fromstring()º¯Êý£º´Ó×Ö·û´®ÖнâÎöXMLÎĵµ»òƬ¶Î£¬·µ»Ø¸ù½Úµã(»ò½âÎöÆ÷Ä¿±ê·µ»ØµÄ½á¹û)¡£ (2) XML()º¯Êý£º´Ó×Ö·û´®³£Á¿ÖнâÎöXMLÎĵµ»òƬ¶Î£¬·µ»Ø¸ù½Úµã(»ò½âÎöÆ÷Ä¿±ê·µ»ØµÄ½á¹û)¡£ (3) HTML()º¯Êý£º´Ó×Ö·û´®³£Á¿ÖнâÎöHTMLÎĵµ»òƬ¶Î£¬·µ»Ø¸ù½Úµã(»ò½âÎöÆ÷Ä¿±ê·µ»ØµÄ½á¹û)¡£ ÆäÖУ¬XML()º¯ÊýµÄÐÐΪÀàËÆÓÚfromstring0º¯Êý£¬Í¨³£ÓÃÓÚ½«XML×ÖÃæÁ¿Ö±½ÓдÈëµ½Ô´´úÂëÖУ»HTML()º¯Êý¿ÉÒÔ×Ô¶¯²¹È«È±ÉÙµÄ<html>ºÍ<body>±êÇ©¡£ÒÔÉÏ3¸öº¯ÊýµÄʾÀýÈçÏ£º

    xml_data='<root>data</root>'
    # fromstring()·½·¨
    root_one=etree.fromstring(xml_data)
    print(root_one.tag)
    print(etree.tostring(root_one))
    # XML·½·¨£¬Óëfromstring·½·¨»ù±¾Ò»Ñù
    root_two=etree.XML(xml_data)
    print(root_two.tag)
    print(etree.tostring(root_two))
    # HTML()·½·¨£¬Èç¹ûûÓÐ<html>ºÍ<body>±êÇ©£¬»á×Ô¶¯²¹ÉÏ
    root_three=etree.HTML(xml_data)
    print(root_three.tag)
    print(etree.tostring(root_three))
    ³ÌÐòÔËÐнá¹ûΪ£º
    root
    b'<root>data</root>'
    root
    b'<root>data</root>'
    html
    b'<html><body><root>data</root></body></html>'

    ³ýÁËÉÏÊö3¸öº¯ÊýÖ®Í⣬»¹¿ÉÒÔµ÷ÓÃparse()º¯Êý´ÓXMLÎļþÖÐÖ±½Ó½âÎö¡£ÔÚµ÷Óú¯Êýʱ£¬Èç¹ûûÓÐÌṩ½âÎöÆ÷£¬ÔòʹÓÃĬÈϵĽâÎöÆ÷£¬º¯Êý»á·µ»ØÒ»¸öElemenfTree ÀàµÄ¶ÔÏó¡£ÀýÈ磺

    html=etree.parse('./hello.html')
    result=etree.tostring(html, pretty_print=True)

    ElementPathÀà¼ò½é

    ElementTreeÀàÖи½´øÁËÒ»¸öÀàËÆÓÚXPath·¾¶ÓïÑÔµÄElementPathÀà¡£ÔÚElementTreeÀà»òElementsÀàµÄAPIÎĵµÖУ¬ÌṩÁË3¸ö³£Óõķ½·¨£¬¿ÉÒÔÂú×ã´ó²¿·ÖËÑË÷ºÍ²éѯÐèÇ󣬲¢ÇÒÕâ3¸ö·½·¨µÄ²ÎÊý¶¼ÊÇXPathÓï¾ä¡£¾ßÌåÈçÏ£º (1) find()·½·¨£º·µ»ØÆ¥Åäµ½µÄµÚÒ» ¸ö×ÓÔªËØ¡£ (2) findall()·½·¨£ºÒÔÁбíµÄÐÎʽ·µ»ØËùÓÐÆ¥ÅäµÄ×ÓÔªËØ¡£ (3) iterfind()·½·¨£º·µ»ØÒ»¸öËùÓÐÆ¥ÅäÔªËØµÄµü´úÆ÷¡£ ´ÓÎĵµÊ÷µÄ¸ù½Úµã¿ªÊ¼£¬ËÑË÷·ûºÏÒªÇóµÄ½Úµã¡£ÀýÈ磺

    # ´Ó×Ö·û´®ÖнâÎöXML,·µ»Ø¸ù½Úµã
    root=etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>")
    # ´Ó¸ù½Úµã²éÕÒ£¬·µ»ØÆ¥Åäµ½µÄ½ÚµãÃû³Æ
    print(root.find("a").tag)
    # ´Ó¸ù½Úµã¿ªÊ¼²éÕÒ£¬·µ»ØÆ¥Åäµ½µÄµÚÒ»¸ö½ÚµãµÄÃû³Æ
    print(root.findall(".//a[@x]")[0].tag)

    ³ÌÐòÔËÐнá¹ûΪ£º

    a
    A

    »¹¿ÉÒÔµ÷ÓÃxpath()·½·¨£¬Ê¹ÓÃÔªËØ×÷ΪÉÏÏÂÎĽڵãÀ´ÆÀ¹ÀXPath±í´ïʽ¡£

    lxml¿âµÄ»ù±¾Ê¹ÓÃ

    ÕâÀïʹÓÃÒ»¸öHTMLʾÀýÎļþ×÷ÎªËØ²ÄÀ´½éÉÜlxml¿âµÄ»ù±¾Ó¦Ó᣸ÃÎļþÃûΪhello.html£¬ÄÚÈÝÈçÏ£º

    <!-- hello.html -->
    <div>
        <ul>
            <li class="item-0"><a href="link1.html">first item</a></li>
            <li class="item-1"><a href="link2.html">second item</a></li> 
            <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
            <li class="item-1"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a></li>
        </ul>
    </div>

    °´ÏÂÀ´£¬»ùÓÚÉÏÊöHTMLÎĵµ£¬Ê¹ÓÃlxml¿âÖеÄ·¾¶±í´ïʽ¼¼ÇÉ£¬Í¨¹ýµ÷ÓÃxpath()·½·¨Æ¥ÅäѡȡµÄ½Úµã£¬¾ßÌåÈçÏ£º

    »ñÈ¡ÈÎÒâλÖõÄli½Úµã ¿ÉÒÔÖ±½ÓʹÓÓ//”´ÓÈÎÒâλÖÃѡȡ½Úµãli£¬Â·¾¶±í´ïʽÈçÏ£º

    //li

    ͨ¹ýlxml.etreeÄ£¿éµÄxpath()·½·¨£¬½«hello.htmlÎļþÖÐÓë¸Ã·¾¶±í´ïʽƥÅäµ½µÄÁÐ±í·µ»Ø£¬²¢´òÓ¡Êä³ö¡£¾ßÌå´úÂëÈçÏ£º

    from lxml import etree
    html=etree.parse('hello.html')
    # ²éÕÒËùÓеÄli½Úµã
    result=html.xpath('//li')
    # ´òÓ¡<li>±êÇ©µÄÔªËØ¼¯ºÏ
    print(result)
    # ´òÓ¡<li>±êÇ©µÄ¸öÊý
    print(len(result))
    # ´òÓ¡·µ»Ø½á¹ûµÄÀàÐÍ
    print(type(result))
    # ´òÓ¡µÚÒ»¸öÔªËØµÄÀàÐÍ
    print(type(result[0]))

    ³ÌÐòÔËÐнá¹ûΪ£º

    [<Element li at 0x2cc9a48>, <Element li at 0x2cc99c8>, <Element li at 0x2cc9a88>, <Element li at 0x2cc9ac8>, <Element li at 0x2cc9b08>]
    5
    <class 'list'>
    <class 'lxml.etree._Element'>

    ¼ÌÐø»ñÈ¡<li>±êÇ©µÄclassÊôÐÔ

    ÔÚÉϸö±í´ïʽµÄĩβ£¬Ê¹ÓÓ/”ÏòÏÂѡȡ½Úµã£¬²¢Ê¹ÓÃ@ѡȡclassÊôÐԽڵ㣬±í´ïʽÈçÏ£º

    //1i/@class

    »ñÈ¡<li>±êÇ©µÄclassÊôÐÔµÄʾÀý´úÂëÈçÏ£º

    from lxml import etree
    html=etree.parse('hello.html')
    # ²éÕÒλÓÚli±êÇ©µÄclassÊôÐÔ
    result=html.xpath('//li/@class')
    print(result)

    ³ÌÐòÔËÐнá¹ûΪ£º

    ['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']

    »ñÈ¡µ¹ÊýµÚ¶þ¸öÔªËØµÄÄÚÈÝ

    ´ÓÈÎÒâλÖÿªÊ¼Ñ¡È¡µ¹ÊýµÚ¶þ¸ö<li>±êÇ©£¬ÔÙÏòÏÂѡȡ±êÇ©<a>¡£Èç¹ûÒª»ñÈ¡¸Ã±êÇ©ÖÐµÄ Îı¾£¬¿ÉÒÔʹÓÃÈçϱí´ïʽ£º

    //li[last()-1]/a

    »òÕß

    //li[last()-1]/a]/text()

    ²»Í¬µÄÊÇ£¬µÚ¸ö±í´ïʽÐèÒª·ÃÎÊtextÊôÐÔ£¬²ÅÄÜÄõ½±êÇ©µÄÎı¾£¬¶øµÚ¶þ¸ö±í´ïʽ¿ÉÖ± ½Ó»ñÈ¡Îı¾¡£Ê¹ÓõÚÒ» ¸ö·¾¶±í´ïʽµÄʾÀýÈçÏ£º

    from lxml import etree
    html=etree.parse('hello.html')
    # »ñÈ¡µ¹ÊýµÚ¶þ¸öÔªËØµÄÄÚÈÝ
    result=html.xpath('//li[last()-1]/a')
    print(result[0].text)

    ³ÌÐòÔËÐнá¹û£º

    fourth item


    0 ·ÖÏíµ½£º
    ºÍÎÒÃÇÔÚÏß½»Ì¸£¡
    ¡¾ÍøÕ¾µØÍ¼¡¿¡¾sitemap¡¿